1
How Developers’ Experience and Ability Influence
Web Application Comprehension Tasks Supported
by UML Stereotypes: a Series of Four Experiments
Filippo Ricca1, Massimiliano Di Penta2 , Marco Torchiano3 ,
Paolo Tonella4 , Mariano Ceccato4
1
Unita’ CINI at DISI - Iniziativa Software Finmeccanica, Genova, Italy
2
University of Sannio, Dept. of Engineering, Benevento, Italy
3
4
Politecnico di Torino, Italy
Fondazione Bruno Kessler, Trento, Italy
Abstract
In recent years, several design notations have been proposed to model domain-specific applications or reference
architectures, often extending UML with stereotypes, constraints, and tagged values. In particular, James Conallen
has proposed the UML Web Application Extension (WAE): a UML extension to model Web applications.
The aim of our empirical investigation is to test whether the usage of the Conallen notation supports comprehension and maintenance activities with significant benefits, and whether such benefits depend on developers’
ability and experience.
This paper reports and discusses the results of a series of four experiments performed in different locations
and with subjects possessing different experience—namely undergraduate students, graduate students, and research
associates—and different ability levels. The experiments aim at comparing performances of subjects in comprehension tasks where they have the source code complemented either by standard UML diagrams or by diagrams
stereotyped using the Conallen notation.
2
Results indicate that, although in general it is not possible to observe any significant benefit associated with the
usage of stereotyped diagrams, significant improvements are visible for subjects with a lower ability or experience.
In conclusion, the availability of stereotypes reduces the gap between subjects with low skill or experience
and highly skilled or experienced subjects. Organizations employing developers with low experience can achieve
a significant performance improvement by adopting stereotyped UML diagrams for Web applications.
I. I NTRODUCTION
UML (Unified Modeling Language [1]) is a general purpose language used especially for the analysis
and design of object-oriented systems. It offers a wide spectrum of diagrams, useful for the representation
of various static and dynamic aspects of the system being designed. When modeling a problem belonging
to a specific domain, we face with the need of assigning a specific semantics to UML diagram elements: we
may want to indicate that a UML class represents an antenna in a telecommunication system, or a Web page
in a Web application, or that an association represents a wireless communication or a hyperlink between
two Web pages. In these cases, the designer has two options: one is to use standard UML, and specify the
additional semantics through informal notes; another is to take advantage of the extension mechanisms
that UML provides [1]. One such mechanism is based on the definition of stereotypes. Stereotypes are
UML modeling entities for which a specific semantics is defined by the user. Stereotyped UML elements
can be represented with a user-defined, intuitive graphical notation (i.e., icons), that replaces the standard
stereotype specification.
A domain where stereotypes can be used to add semantics (otherwise implicit) to diagram elements
is Web application modeling. In fact, several entities have very special meaning in such a domain. For
example, classes may be used to represent Web pages, which in turn can be server side or client side
pages. Other special elements that deserve an ad-hoc representation in design diagrams include forms,
form fields, the form-submission relationship between client and server pages, and hyperlinks between
pages. WAE (Web Application Extension [2]) is an example of a Web application modeling notation built
on top of UML thanks to the stereotype extension mechanism.
3
If, on the one hand a notation such as WAE provides to developers a way to model Web application
specific elements in UML diagrams, on the other hand as any new notation, requires proper training to be
learned. Also, differently from standard UML diagrams, stereotyped diagrams also limit the possibility of
information exchange with other people in the project team, whenever they are not properly understood
by everybody.
It is important to remark that, usually during a software development or maintenance task, design
diagrams—either stereotyped or not—are accessed as auxiliary information sources, when the source
code is being understood and modified. As a consequence, they should provide information that is hard
to get directly from the source code, in order to be regarded as useful by developers. While there is a
common wisdom about the benefits Web application stereotypes can offer, as a matter of fact there is no
empirical evidence about the actual benefits they indeed deliver.
We have designed and conducted a series of four controlled experiments to investigate the effect
of the use of UML stereotypes in Web application design on source code maintenance activities. We
have evaluated how the use of WAE affects the fundamental activity of program understanding. Subjects
were requested to answer a comprehension questionnaire. They could acquire knowledge about the Web
application to be understood by accessing the source code and either standard UML diagrams (first
treatment) or stereotyped WAE diagrams (second treatment) of the application, in accordance with a
counter-balanced design. Thus, differently from previous experiments [3], [4], we put subjects in a context
where they had both the source code and diagrams available, as this is representative of a realistic setting for
software maintenance tasks, where developers can access the source code, plus diagrams in case the latter
are available. The four studies involved subjects with different experience levels, namely undergraduate
students, graduate students and research associates1 , and having different levels of ability. This allowed
us to analyze the influence of subjects’ ability and experience on the use of a stereotyped notation in
software comprehension tasks.
1
Post-docs and non-permanent staff persons working in research institutes.
4
Results indicate that the usage of stereotypes per se does not have a significant effect on program
comprehension. However, results also indicate an interaction between the usage of stereotypes and subjects’
ability and experience: stereotypes are more useful for low-ability and low-experienced subjects and, when
they are available, they help to reduce the gap between the latter subjects and high-ability/experienced
ones.
The paper is organized as follows: Section II provides an overview of graphical Web modeling techniques, focusing in particular on the use of UML stereotypes. Section III describes the design of the series
of controlled experiments we conducted. Results are presented in Section IV, while Section V discusses
the threats to validity, and Section VI summarizes lessons learned and the pieces of empirical evidences
collected. Conclusions and future works are described in Section VIII, after comparing the present study
with related works (Section VII).
II. BACKGROUND
ON
UML
BASED
W EB M ODELING T ECHNIQUES
Web applications, like traditional software systems, are often represented by a set of models, such
as: analysis models, design models, implementation models, and deployment models [2]. Models help
designers to better understand the system, by abstracting away some of the details and by identifying the
artifacts involved in the development. Moreover, during comprehension and maintenance tasks, design
models are used to identify the portion of a system that is impacted by a change request.
When Web applications are modeled using traditional modeling languages, such as UML, the resulting models lack some information about the Web application structure and behavior [5], for instance
information related to:
•
the hypertext navigational structure;
•
the way data is submitted by means of forms and processed by server-side scripting;
•
the way client-side HTML pages are dynamically built by server-side pages; or
•
the mapping between the application data model and the Web pages content.
5
To this aim, during recent years, several methodologies have been proposed (e.g., [2], [6], [7]). Each
one emphasizes some particular aspects of a Web application, such as, for example, navigation structure,
content, security, data structure and presentation.
Among the most widely known design methodologies [2], [6], [7] proposed in support to Web development, those conceived for the specification of the application structure and navigational model—for example Conallen’s Web Application Extension (WAE) [2] and the Web Site Design Method (WSDM) [7]—are
closer to the implementation. Among the other notations that support Web engineering from modeling to
implementation, we can mention WebML [6] and UWE [8]. The Web Modelling Language (WebML) [6]
is a visual notation used to specify the content, composition, and navigation features of Web applications.
The UML-based Web Engineering (UWE) [8] methodology extends UML and provides an iterative and
incremental approach for the development of Web applications.
A. Web Application Extension
In this work, we focus on WAE [2]. Differently from other notations above mentioned, WAE simply
extends UML to make it applicable to model Web applications and requires, as it will be recalled in this
paper, a relatively quick training to make a developer—knowledgeable of UML—proficient with such a
notation. This section recalls how the static part of a Web application can be represented using a UML
class diagram and describes how WAE, the UML extension chosen in our experiment, can be used to
better capture peculiar concerns of Web applications.
UML [1] is today recognized as the de-facto standard modeling language for software systems. It can
be used to represent several aspects of a software system (i.e., static view, interaction view, and behavioral
view) but also to represent the design of a Web application. In a standard UML class diagram, i.e., without
stereotypes, usually:
•
Web pages are mapped to classes;
•
relationships between pages are mapped by means of the use relationship;
•
page scripts and variables are mapped to class operations and attributes respectively.
6
ContexListener
ServletContext
«Use»
1
«New»
«New»
Organization
1
«Use»
1
«Use»
1
Catalog
«Use»
«Use»
Instantiate
«Use»
«Use»
«Use»
«Use»
The interaction with the web
application starts here
«Use»
Main
Start
Complete
«Use»
«Use»
Login
«Use»
«New»
«Use»
Process
«Use»
Actor
«Use»
1
1
Logout
«Use»
Fig. 1.
HttpSession
Standard UML class diagram of WfMS.
As noted in [5], following this mapping, a problem arises when we consider that a Web page may
contain a set of scripts executed on the server (e.g., CGI, Java Server Pages—JSP, or Servlets) and a
completely different set of scripts executed on the client (e.g., Javascript). Considering a Web page class,
no distinction can be be found between methods executed on the server and on the client. Thus a simple
mapping of Web pages to UML classes does not help designers and programmers in the comprehension of
the system. Moreover, standard class diagrams explicitly support the representation neither of hyperlinks
between HTML pages nor of how variables are submitted through a form to a server-side application.
Fig. 1 shows the model of WfMS, one of the two Web applications used in the experiment, as modeled
7
in standard UML. The considered Java Web application implements a workflow management system
that permits the definition of processes and their enactment. The diagram models servlets and JSP pages
composing the application, use relationships between them, together with other Java classes—e.g., the
HttpSession—directly involved in the Web application view (the application has been designed following
a Model-View-Controller—MVC pattern).
UML is typically customized by using profiles. Essentially, a profile defines a number of stereotype
classes that can extend one or more UML metaclasses with additional properties and relationships. A
profile consists of the following elements:
1) stereotypes, that permit the definition of new UML elements, extending the existing ones, providing
them a specific meaning and specific properties, often related to a particular problem domain or
otherwise specialized usage. Stereotypes can either be represented by a name enclosed within
guillemets (e.g., <<Server Page>>), or by means of adornments or icons;
2) tagged values, i.e., key-value pairs, used to extend model properties and to assign values to model
elements;
3) constraints, used to refine the semantics of UML models, specifying conditions to which model
elements must conform.
To avoid ambiguity and to better represent the relationships among Web pages, Conallen [2] proposed
a profile, called WAE, for the design, construction and maintenance of Web applications. With respect to
standard UML, the most important novelty [5] is that the server-side aspects of a Web page are modeled
with classes, stereotyped as <<Server Page>> and the client-side aspects with other classes, stereotyped
as <<Client Page>>. A client page is an HTML document that includes content and presentation
elements, plus client-side scripts. The two class types—server and client—are related through a directional
relationship.
The relationship between a server page and the corresponding client pages is stereotyped as a <<Build>>
association. The usage of these stereotypes facilitates the modeling of page’s scripts and relationships.
8
«Link»
The interaction with the Web
application starts here.
«Use»
«Server Page»
Instantiate.jsp
«Use»
«Use»
«Build»
«Submit»
«Client Page»
Instantiate.html
«Use»
«Form»
form_4
process: select
submit
«Server Page»
Login.jsp
1
1
«Build»
«Client Page»
Main.html
«Server Page»
Logout.jsp
«Link»
«Client Page»
Login.html
«Link»
1
0..*
«Form»
form_3
process: hidden=<%=process-key()%>
submit
«Form»
form_1
user: select
submit
«Client Page»
Logout.html
1
0..*
«Build»
«Build»
«Form»
form_2
workItem: hidden=<%=activity.key()%>
submit
«Invalidate»
«Submit»
«Submit»
«Submit»
1
«New»
«Server Page»
Main.jsp
«Use»
Actor
user
1
«Session»
HttpSession
«Use»
«Server Page»
Process.jsp
«Use»
«Build»
«Use»
ServletContext
«Server Page»
Start.jsp
1
1
1
Catalog
«Use»
«Use»
«Use»
«Link»
«Build»
«Client Page»
Process.html
«Use»
1
«Link»
Organization
«Client Page»
Start.html
Legend:
«Use»
ContexListener
«Form»
form_5
<%=processContex[i].the_name%>: textarea = <%=processContex[i].the_value%>
workitem: hidden = <%=key%>
action: submit = Complete
action-: submit = Update
«New»
«New»
Client page
Server page
«Submit»
Form
«Use»
«Link»
«Server Page»
Complete.jsp
«Use»
«Use»
Fig. 2.
Conallen’s WAE class diagram of WfMS.
The <<Server Page>> class operations are functions in the server-side scripts of a page, while the
<<Client Page>> class operations are functions visible on the client-side. By separating the server-side
and client-side aspects of a page into different classes, it is possible to highlight the relationships between
pages and other classes of the system [5]. Server pages are modeled with relationships to server-side
resources—e.g., databases, Java components, etc.—while client pages are modeled with relationships to
client side resources, e.g., Java Applets, ActiveX and Javascript functions.
Fig. 2 shows the view of WfMS as expressed by WAE. An example of the extra information provided by WAE, compared to what is represented in standard UML class diagrams (see Fig. 1), is page
Main, represented in the model as two classes Main.jsp (server page) and Main.html (client page)
9
connected with the association <<Build>>. Hyperlinks between Web pages represent a navigational
path and are expressed in the model with a <<Link>> stereotyped association (see for example pages
Process.html and Main.jsp in Fig. 2). Server page stereotypes can be represented visually by means
of a gear, while client pages by means of an icon representing a browser. The main data entry mechanisms
for Web pages is the Form. Forms are represented as classes stereotyped as <<Form>>—represented
visually as a page in a browser containing input fields—and connected to the enclosing page through a
composition (part-of) relationship. In Fig. 2, page Main.html contains two forms: form 3 submits
(stereotype <<Submit>>) the parameter process to the server page Process.jsp, while form 2
submits the parameter workItem to the server page Start.jsp. As it can be noticed from Fig. 1, the
standard UML diagram does not provide any information about Web forms and their related actions, nor
about hyperlinks between pages. In this case, the only way a developer can get this information is by
browsing the source code.
The book by Conallen [2] provides any further details about the WAE methodology and the related
notation.
III. E XPERIMENTATION
DEFINITION , DESIGN AND PLANNING
This section provides the definition—summarized in Table I—design and planning of the experiment,
structured according to the guidelines by Wohlin et al. [9] and Juristo and Moreno [10].
The goal of the study is to analyze the use of stereotyped UML diagrams, with the purpose of evaluating
their usefulness in Web application comprehension for different categories of users. The quality focus is to
ensure high comprehensibility, while the perspective is both of Researchers, evaluating how effective are
the stereotyped diagrams during maintenance for different categories of users, and of Project managers,
evaluating the possibility of adopting the Web modeling technique WAE in her organization, depending
on the skills of the involved developers. The context of the experiment consists of two Web applications
(objects) and of four groups of subjects: research associates, students from an undergraduate course, and
students from two graduate courses.
10
TABLE I
OVERVIEW OF THE EXPERIMENTATION .
Goal
Analyze the support given by Conallen’s
stereotypes on comprehension tasks and the
the influence of subjects’ ability and experience.
Context
Diagrams (std UML and Conallen’s)
Null hypothesis
No effect on comprehension.
Main factor
Design notation used:
std UML (UML) vs. stereotypes (Conallen).
Other factors
Subjects’ Experience and Ability, Systems, Labs, and Questions.
Dependent variables
Comprehension level.
TABLE II
C HARACTERISTICS ( A ) OF
THE SYSTEMS UNDER STUDY AND ( B ) OF THE EXPERIMENTAL SUBJECTS .
Claros
WfMS
Files
LOC
Files
LOC
Java
44
6288
Java
85
2378
JSP
34
1996
JSP
7
431
Total
78
8284
Total
92
2809
A. Context description
The experimentation objects are two Java-based Web applications, Claros2 and WfMS3 [11]. Both are
small/medium size open source applications (see Table II) based on the Servlet/JSP technology. They are
small enough to fit the time constraints of the experimental sessions. Although commercial or institutional
Web applications may be larger, the application domain of the selected systems is pretty typical of
existing Web applications. Claros is an on-line Web mail management application. WfMS is a Workflow
management system that permits the definition of processes and their enactment. While WfMS is larger in
terms of classes, Claros has a larger design view and a more complex navigational model. Both applications
were designed using the Model-View-Controller (MVC) pattern and the View, on which stereotypes were
used, comprises 19 UML classes (38 in Conallen’s diagram) for Claros and 13 (24 in Conallen’s diagram)
2
http://www.claros.org
3
http://www.pearsoned.co.uk/HigherEducation/Booksby/BrugaliTorchiano/
11
for WfMS. Diagrams of Claros and WfMS (limited to the View) are shown in a technical report [12].
As is often the case, a thorough UML documentation is not available, therefore diagrams of Claros and
WfMS have been reverse engineered (for the experiment purpose) from the code by the authors. Then,
diagrams have been properly adjusted, so as to reproduce a situation where diagrams are aligned with the
code and at the same time represent a meaningful and compact abstraction of the implementation. Finally,
for the two systems, both the source code and UML class diagrams—drawn with and without Conallen’s
stereotypes [2]—were available and used in the experiment.
The study was executed twice at the University of Trento (Exp 1 and Exp 3), and twice at the University
of Sannio (Exp 2 and Exp 4), every time with different subjects. The subjects participating in the two
replications in Trento are 13 Master students (2nd year M.Sc.) attending the Laboratory of Software
Analysis and Testing (Exp 1), and 35 Bachelor students (2nd year B.Sc.) attending the Laboratory of
Software Engineering (Exp 2). At the University of Sannio, there were 18 Master students (1st year
M.Sc.) attending the course on Development of Web-based Systems (Exp 3), and 8 research associates
(Exp 4), with 4-5 years of experience, mainly working on research projects with industry. Bachelor students
had attended previously Programming and Software Engineering courses (which is of course true also of
Master students). All subjects had a good knowledge of UML and Java.
B. Hypotheses formulation
The objective of our study is twofold: first we want to investigate the effect of Conallen’s UML WAE
stereotypes on Web application comprehension, second we aim at finding out which category of maintainer
(high/low ability and high/low experience) will draw more benefits from their use.
H0
When performing a comprehension task, the use of WAE stereotyped class diagrams (versus nonstereotyped class diagrams) does not significantly improve the comprehension level achieved
by maintainers.
Ha
When performing a comprehension task, the use of WAE stereotyped class diagrams (versus
non-stereotyped class diagrams) significantly improves the comprehension level achieved by
12
maintainers.
As far as the effect of UML stereotypes on software comprehension is concerned, we are interested to
investigate whether such additional detail increases the comprehension level. Therefore, our null hypothesis
is one-tailed, as also happened in studies such as the one by Briand et al. [13], where the effect of additional
details to UML models on the comprehension level is investigated.
Moreover, we want to investigate whether the experience and ability of subjects is a co-factor playing a
significant effect. In particular, we are interested in studying how such factors—ability and experience—
interact with the presence of stereotypes, possibly affecting the comprehension level. Therefore we
formulate the following, additional, hypotheses:
H0e
Subjects’ experience does not significantly interact with the use of stereotyped class diagrams
to influence the comprehension level achieved by maintainers.
H0a
Subjects’ ability does not significantly interact with the use of stereotyped class diagrams to
influence the comprehension level achieved by maintainers.
H0ea Subjects’ ability and experience do not significantly interact with the use of stereotyped class
diagrams to influence the comprehension level achieved by maintainers.
The alternative hypotheses corresponding to these latter three null ones can be easily deduced from the
null ones along the line of the first two presented in this section.
It is also important to collect further, more subjective, information from the subjects to provide a
basis for the explanation of the observed phenomena. Such information can be collected by means of a
post-experiment questionnaire. It is important to remark that information collected by means of survey
questionnaires only represent a subjective feedback provided by subjects, and cannot be considered as a
replacement to more objective measurements such those used to assess the comprehension level during
the experimental task.
13
TABLE III
E XPERIMENTAL DESIGN .
Group 1
Group 2
Group 3
Group 4
Lab 1
Claros-Conallen
Claros-UML
WfMS-Conallen
WfMS-UML
Lab 2
WfMS-UML
WfMS-Conallen
Claros-UML
Claros-Conallen
C. Experimental design, material and procedure
We adopted a counter-balanced design using four groups; each group performed two assignments, one
in each experimental sessions (Lab 1 and Lab 2), on two objects (Claros and WfMS) with two treatments
(Conallen and UML), according to the schema summarized in Table III.
The selected design ensures that each subject worked on different Systems in the two Labs, receiving
each time a different Method treatment. In addition, the design allows us to consider different combinations
of System and Method treatment in different order across Labs, thus balancing the effect of learning.
Moreover, the chosen design allows the use of statistical tests (e.g., Two-Way and Three-Way ANOVA)
for studying the effect of multiple factors.
Before the experiments, subjects have been trained on Conallen’s notation as well as all the technologies
used in the target applications (e.g., Servlets/JSP). The introduction to the notation required a lecture of
two hours, followed by a laboratory of other two hours, in which subjects had to understand a small
glossary system reported in the appendix of Conallen’s book [2], and another simple Web application
implementing a shopping Cart. All subjects were able to understand questions we asked about the simple
applications without major problems. Also, the experiments were preceded by a detailed presentation of
instructions related to the tasks to be performed, the goal of the experiment, but not the experimental
hypotheses. Subjects were requested to work individually, and this was enforced by two of the authors
checking the subjects’ activity in the lab. Also, we advised the subjects to account for the time needed
for various activities (looking at diagrams, browsing the source code), as they would have been asked to
provide us this information at the end of the lab.
To perform the experimental tasks, subjects received the following material:
14
•
a short, textual description of the system to be understood;
•
the system source code, and an Integrated Development Environment to browse it (Eclipse-JDT);
•
UML class diagrams of the system, with or without Conallen’s stereotypes;
•
an installed version of the Web application;
•
a questionnaire containing the comprehension questions;
•
a post-experiment survey questionnaire.
According to the design, each subject has been involved in two experimental sessions (laboratories), each
lasting approximately 2 hours. Each laboratory consists of a comprehension task on the assigned System
(Claros or WfMS) documented either by Conallen or standard UML diagrams. The comprehension task
was carried out by answering 12 open questions (the same number of questions as in [14]) on the assigned
System. Three out of 12 questions were the same for both systems—as they were related to changing
style, architecture, input mechanism—and the other 9 were specific to each system (although conceived in
a way to avoid too simple or too complex questions in a system or in the other). Questions for the WfMS
application are shown in Table IV, while both questionnaires are accessible in a longer report [12]. Most
of the questions refer to realistic program understanding scenarios. In fact, we inspected a few randomly
selected change requests and bug reports from open-source projects hosted on SourceForge4 , then we
selected those compatible with the object systems, and eventually we adjusted them to be applicable
to the same objects. For example, change request 1117 of the Sequoia5 project, posted on the Jira bug
tracking system6, requires to “Add a logger to profile request execution time”. This is similar to question
q9 in Table IV. A table with other examples of change requests we were “inspired on” is reported in the
technical report [12].
To answer the questions, subjects had the possibility to look at the diagrams, to run the Web applications
and to browse the source code. After each laboratory session, subjects were asked to fill-in a post4
http://sourceforge.net/
5
http://sequoia.continuent.org/
6
http://forge.continuent.org/jira/browse/SEQUOIA-1117
15
TABLE IV
C OMPREHENSION QUESTIONNAIRE FOR W F MS.
ID
Question
q1
Suppose that you have to set the background color of each Web page using CSS (Cascading Style Sheets).
Which classes/pages does this change impact?
q2
Suppose that you have to substitute, in the entire application, the form-based communication mechanism
between pages with another mechanism (i.e. Applet, ActiveX, ...). Which classes/pages does this change impact?
q3
Does the application conform to the Model-View-Controller (MVC) pattern? If yes which class (or classes)
implements the controller component?
q4
The description of a process is made up of three main types of elements (activity, participant, and transition)
and stored in an XPDL file. Which are the process modeling classes (i.e. the classes
used to represent the processes in memory)?
q5
Which classes are initialized when the JSP container starts and are destroyed when it shuts down?
These classes keep the long lived information and are used by almost all Web pages.
q6
Suppose that you have to divide the main.jsp page in three different pages. One containing the workList
(i.e. the work items the user have to complete), one containing Processes (the processes the user started)
and the last containing the Process Catalog. Which classes/pages does this change impact?
q7
Suppose that you have to add the management of new rules for the sequencing of activities. They are:
split of thread of control among two branches (that can be executed in parallel) and the join of multiple
branches. Which classes/pages does this change impact?
q8
Which is the class that manages the work list of pending activities? (i.e the list of activities that the user
is required to perform at a given instant)
q9
Suppose that you have to add a logger able to store in a file the sequence of activities performed from the users.
This sequence should be visible from each page of the View. Which classes/pages does this change impact?
q10
Suppose that you have to introduce the capability of using complex data in the processes (for example files);
this extension must consider two aspects: the definition of complex data types and the presentation and editing
of such data. Which classes/pages does this change impact?
q11
Which are the classes/pages in the session scope?
q12
Suppose that you have to substitute the way how the session is managed. You have to handle the session
including all state values as parameters in every URL of the system. Which classes/pages does this change impact?
experiment survey questionnaire regarding objectives clarity, artifacts comprehension, and cognitive effects.
D. Variable selection
The main factor or treatment (hereby referred as Method) of this experimentation is the use of UML
stereotypes, in particular of Conallen’s stereotypes for Web application modeling [2]. Since this notation
16
extends UML, the notation used for comparison (control group) is standard UML, with no Web-specific
stereotype. Thus, the Method factor can assume one of the values in {UML, Conallen}.
Other than the Method, the experimental hypotheses are defined in terms of two other factors, Experience
and Ability. Regarding the Experience, Exp 1 and Exp 3 subjects were classified as Graduate (G), Exp 2
subjects as Undergraduate (U), and Exp 4 subjects as Research Associates (RA).
A quantitative assessment of the Ability level of each involved subject was obtained, by resorting to
the average grades obtained in the previous exams related to software engineering or Web technologies.
Subjects with average grades below 257 were classified as low (l) Ability, and the remaining ones as high
(h) Ability. Differently from other countries, where master students are often selected among the best
bachelor students, this is not the case in Italy, where anybody after the bachelor can sign up for a master
degree, and most of the students do it due to the limited availability of good job opportunities for bachelors.
For this reason, it is possible to find high and low Ability subjects both among undergraduate and graduate
students. Instead, the analysis by Ability did not include research associates, since the information required
to compute it was not available; moreover, in this particular case Ability may be confounded with the
experience. For this reason, research associates were only considered in the analysis by Experience.
Finally, given the design selected for the experiment (see Section III-C), another two experimental
factors must be considered: the object used in a task, represented by System (Claros or WfMS), and the
experimental session in which a task was performed, i.e. Lab (Lab 1 or Lab 2).
The main outcome observed in the study is the comprehension level. To evaluate it, we asked the
subjects to answer a questionnaire and we assessed the answers using an Information Retrieval approach.
Since the answer to each question consists of a list of system elements, i.e. classes, JSPs, HTML pages,
we can count:
7
As,i
set of elements mentioned in the answer to question i by subject s; and
Ci
the correct set of elements expected for question i.
In Italy exam grades are expressed as integer values in the interval [18,30], where 18 is the lowest grade and 30 the highest.
17
Based on the above definition, we computed precision and recall for each answer [15]. Precision
measures the fraction of items in the answer that are correct:
precisions,i =
|As,i ∩ Ci |
|As,i|
Recall measures the fraction of expected items that are in the answer:
recalls,i =
|As,i ∩ Ci |
|Ci |
Since the two above metrics measure two different concepts, it may be difficult to balance between
them. We used an aggregate measure, F −Measure [15], which is a standard combination of the two,
defined as their harmonic mean:
F −Measures,i =
2 · precisions,i · recalls,i
precisions,i + recalls,i
For example, if we consider the question: “q2: Suppose that you have to substitute, in the entire
application, the form-based communication mechanism between pages with another mechanism (i.e.
Applet, ActiveX, ...). ”
of the WfMS experiment, for which the correct answer is:
C2 ≡ {main.jsp, login.jsp, start.jsp}
Let us suppose that the subject ”x” provides, as answer:
Ax,2 ≡ {main.jsp, login.jsp, complete.jsp, instantiate.jsp}
In this case:
precisionx,2 =
|{main.jsp, login.jsp}|
2
= = 0.50
|{main.jsp, login.jsp, complete.jsp, instantiate.jsp}|
4
18
recallx,2 =
|{main.jsp, login.jsp}|
2
= = 0.67
|{main.jsp, login.jsp, start.jsp}|
3
We made sure, during the experiments, that subjects provided answers in a form such that the above
analysis could be carried out without any ambiguity. We carefully scrutinized the answers after each lab,
and in all cases we found that answers were provided in the appropriate form. To avoid mistakes in the
analyses, two authors compared the correct answers with the provided ones, cross-checking their results.
The correct answers were produced by three of the authors, again cross-checking their answers. Note that
one of the authors was also an author of WfMS, and another actually performed the tasks mentioned on
the comprehension questionnaire on Claros to verify the impact of the requested changes.
To obtain a single measure representing the comprehension level achieved by a subject for an object
application we use the averaged F-Measure over all the questions. However, it has to be considered that
in each experimental session (i) subjects have to answer questions having a variable level of difficulty—
although we tried to limit big discrepancies—and (ii) subjects could exhibit some learning as they proceed
in answering questions. For this reason, also the Question has to be considered an experiment co-factor
and an appropriate analysis of its influence, as described in Section III-E, is needed. Table V summarizes
dependent and independent variables of our experiments.
This study did not consider the time spent on answering questions as a dependent variable. Actually,
we controlled this variable by having all lab session lasting a fixed amount of time (2 hours) and we
observed that all subjects worked for almost the full duration of the laboratories.
As mentioned in Section III-C, after the experiment we collected subjects’ feedback by means of a
survey questionnaire. The questionnaire (shown in Table VI) consists of 7 common questions plus 2
specific questions (Q8 and Q9) asked only to subjects using Conallen diagrams. Answers to Q1-Q5 and
to Q8, Q9 are on a Likert scale [16] from 1 (strongly agree) to 5 (strongly disagree). Answers to Q6 and
Q7 are based on a 5 points ordinal scale: {A, B, C, D, E} (A. <20%; B. ≥20% and <40%; C. ≥40%
and <60%; D. ≥60% and <80%; E. ≥80%). As it was done in previous studies (e.g., [13], [17]), the
19
TABLE V
E XPERIMENT VARIABLES .
Variable
Factors
Output
Type
Method
Nominal: { Conallen, UML }
Experience
Nominal: { Undergraduate, Graduate, Research Associate }
Ability
Nominal: { low, high }
Precision
Interval [0..1]
Recall
Interval [0..1]
F-Measure
Interval [0..1]
System
Nominal { Claros, WfMS }
Lab
Ordinal { Lab1, Lab2 }
Comprehension question
Ordinal { q1, . . . q12 }
Context
survey questionnaire deals with three main issues:
•
Objectives clarity: were the task and lab objectives clear and was enough time given? (Questions
Q1, Q2, Q3)
•
Artifacts comprehension: were the subjects able to make use of the provided artifacts (i.e., source
code and diagrams) in order to extract knowledge useful to perform the tasks? (Questions Q4, Q5,
Q6)
•
Cognitive effects: is there any noticeable effect of the main factor onto the cognitive behavior of
the subjects? (Questions Q7, Q8, Q9)
E. Analysis procedure
In all our statistical tests we decided to accept a probability of 5% of committing Type I errors, i.e., of
rejecting the null hypothesis when it is actually true. Thus our procedure is to reject the null hypotheses
when the appropriate statistical tests provides a p-value less than standard α-level of 5%.
To compare samples of two populations we first check for the applicability of t-test—mainly normality
of the distribution using the Wilk-Shapiro test—and, in case of negative results, we use a non-parametric
test. In particular, we perform paired tests (Wilcoxon) when applicable i.e., with subjects who took part
in both labs of each experiment, and non-paired tests (Mann-Whitney) on all samples. We also repeat
20
TABLE VI
P OST- EXPERIMENT SURVEY QUESTIONNAIRE .
Issue
Objectives clarity
Artifacts comprehension
ID
Question
Q1
I had enough time to perform the lab tasks (1–5).
Q2
The objectives of the lab were perfectly clear to me (1–5).
Q3
The questions were clear to me (1–5).
Q4
I experienced no difficulty in reading the diagrams (1–5).
Q5
I experienced no difficulty in reading the source code (1–5).
Q8
I understood the meaning of Conallen’s stereotypes (1–5).
Q6
How much time (as a percentage) did you spend
looking at class diagrams?
Cognitive effects
Q7
How much time (as a percentage) did you spend
for source code browsing?
Q9
I found Conallen’s stereotyped diagrams useful (1–5).
the analyses performed by using non-parametric tests (Mann-Whitney and Wilcoxon) with the parametric
equivalent ones (non-paired and paired t-test).
Other than testing the hypotheses formulated in Section III-B, it is of practical interest to estimate the
magnitude of performance difference achieved with different treatments. To this aim, we use the Cohen
d effect size, which indicates the magnitude of a main factor treatment effect on the dependent variables
(the effect size is considered small for 0.2 ≤ d < 0.5, medium for 0.5 ≤ d < 0.8 and large for d ≥ 0.8).
For independent samples (to be used in the context of unpaired analyses), it is defined as the difference
p
between the means (M1 and M2 ), divided by the pooled standard deviation (σ = (σ12 + σ22 )/2) of both
groups:
d=
M1 − M2
σ
while for dependent samples (to be used in the context of paired analyses) it is defined as the difference
between the means (M1 and M2 ), divided by the standard deviation of the (paired) differences between
21
samples: (σD ):
d=
M1 − M2
σD
To analyze the interaction of two or more factors we use two-way Analysis of Variance (ANOVA). We
chose to use ANOVA because, differently from its non-parametric alternatives such as the Friedman test—
that could have been considered in this case—it allows to test for the presence of interactions between
factors, which we also represent by means of interaction plots. ANOVA is quite robust to deviations
from normality, although other analyses need to be done to check the reliability of its results, such as
the skewness of distributions, or the presence of trends in residuals. Also, its results can be verified by
using two-means non-parametric tests (Mann-Whitney, as done for the main factor analysis) to compare
data samples belonging to different combinations of factors. Clearly, in this case multiple tests are done
to verify a single hypothesis thus, to avoid hypothesis fishing, the Bonferroni correction is needed to
interpret the statistical significance of results.
When analyzing the interaction of the Comprehension question factor, since multiple measures (corresponding to different questions) are performed in each lab, we use a repeated measures ANOVA to
separate the analysis of the within subjects effect (a potential effect of the question) from the between
subjects effect (an expected variance between different subjects).
As far as the validation issues are concerned, we analyze the questionnaire results and since all questions
are measured on a five points ordinal scale, we adopt non-parametric tests.
We evaluate objectives clarity by verifying that the answers to questions Q1 through Q3 are either
Strongly agree (1) or Agree (2). Artifact comprehension is assessed by looking at questions Q4, Q5,
and Q8 (only for subjects receiving stereotyped diagrams). We deem an artifact comprehensible by the
subject if the answers are either Strongly agree (1) or Agree (2). For both validation issues, since values
are derived from an ordinal scale we test medians, using a one-tailed Mann-Whitney test for the null
f ≥ 3, where 3 corresponds to “Undecided”, and Qx
f is the median for question Qx. For the
hypothesis Qx
questions Q1-Q5 answers of subjects receiving standard UML diagrams were compared with answers of
22
subjects receiving Conallen’s diagrams. In this case a two-tailed Mann-Whitey test was used for the null
^
hypothesis Q^
Conallen = QU M L .
The cognitive effects we may observe correspond to different fractions of time devoted to code vs.
diagram browsing (i.e., the answers to questions Q6 and Q7). Since the values are ordinal but not numeric,
to analyze any difference here we build a contingency table and then apply a χ2 test. In addition, to treat
these data in a more intuitive way, we also convert each value into the mid-point of the corresponding
interval (e.g. B, which correspond to the interval ≥20% and <40%; can be converted into 30%). Such
an approach is often applied to convert categorical data into pseudo-interval data [18].
The percentages alone convey limited information. More interesting information can be gained by
comparing pairs of values. Therefore we compute a derived measure as the ratio of two converted values.
Such a number can be interpreted as the odds ratio of looking at diagrams vs. browsing code. The odds
ratio is a measure of effect size that can be used for dichotomous categorical data. An odds [19] indicates
how likely it is that an event will occur as opposed to it not occurring. Odds ratio is defined as the ratio
of the odds of an event occurring in one group (e.g., experimental group) to the odds of it occurring in
another group (e.g., control group), or to a sample-based estimate of that ratio. If the probabilities of the
event in each of the groups are indicated as p (experimental group) and q (control group), then the odds
ratio is defined as:
OR =
p/(1 − p)
q/(1 − q)
An odds ratio of 1 indicates that the condition or event under study is equally likely in both groups.
An odds ratio greater than 1 indicates that the condition or event is more likely in the first group. Finally,
an odds ratio less than 1 indicates that the condition or event is less likely in the first group.
In addition, an important feature to be considered is the perceived usefulness of the WAE notation.
Question Q9 provides a measure for this. We compute a 95% confidence interval to estimate the magnitude
of usefulness.
23
IV. E XPERIMENTAL
RESULTS
This section reports and analyzes results obtained from the series of experiments defined in Section III.
For replication purposes, the experimental package and raw data from the four experiments are available
0.2
0.4
F−Measure
0.6
0.8
for downloading8.
Method Conallen UML Conallen UML Conallen UML Conallen UML
Experiment
1
2
3
4
Fig. 3.
Boxplots of F-Measure across the experiments.
A. Analysis of main factor
First, we analyze whether the main factor—use of UML or Conallen’s diagrams—has a significant
effect on the achieved comprehension. Since the Wilk-Shapiro test indicates deviations from normality,
we perform this analysis using a non-parametric test, and then we repeat it using parametric tests as well.
Fig. 3 shows boxplots of the F-Measures (averaged across the 12 questions) with and without treatment—
i.e., using standard UML diagrams and Conallen’s diagrams respectively—for the four experiments. First,
we test H0 using paired, one-tailed tests for subjects who attended both laboratories of each experiment
(51 subjects in total). Table VIII reports results of a paired analysis. The table reports, in particular,
the mean and median F-Measure difference, the Wilcoxon test p-value, the paired t-test p-value, and the
Cohen d effect size for dependent variables. For the overall data set and for all the experiments differences
8
http://www.rcost.unisannio.it/mdipenta/WAE-experiment-material.zip
24
are not significant (except Exp 2, i.e., the undergraduate students), while the effect size is always small
(almost medium for Exp 2).
TABLE VII
F-M EASURE PAIRED ANALYSIS RESULTS .
Difference
Wilcoxon
t-test
Effect
Exp
N
mean
median
p-value
p-value
size
All
51
0.02
0.00
0.27
0.19
0.12
Exp 1
13
−0.00
−0.11
0.61
0.53
−0.02
Exp 2
20
0.08
0.05
0.04
0.03
0.45
Exp 3
10
−0.06
−0.09
0.88
0.80
−0.27
Exp 4
8
0.02
0.02
0.31
0.34
0.15
Then, an unpaired analysis is resorted to take into consideration also those subjects (24) who did not
participate to both labs. Results are shown in Table VIII, which shows the F-Measure mean, median,
and standard deviation for both treatments (UML and Conallen), the Mann-Whitney p-value, the unpaired
t-test p-value, and the Cohen d effect size for independent variables. Again, H0 can only be rejected for
Exp 2, where the effect size is medium and positive. The effect size is small in all the other cases (negative
for Exp 1 and Exp 3). It is not possible to reject H0 if the overall data set is considered.
TABLE VIII
F-M EASURE DESCRIPTIVE STATISTICS AND UNPAIRED ANALYSIS RESULTS .
UML
Conallen
M-W
t-test
Effect
Exp
N
mean
median
σ
N
mean
median
σ
p-value
p-value
size
All
64
0.64
0.67
0.15
62
0.67
0.70
0.14
0.19
0.13
0.20
Exp 1
13
0.64
0.72
0.17
13
0.63
0.62
0.08
0.82
0.82
−0.03
Exp 2
28
0.58
0.57
0.15
27
0.67
0.73
0.16
0.01
0.01
0.56
Exp 3
15
0.71
0.74
0.12
14
0.67
0.69
0.16
0.76
0.76
−0.29
Exp 4
8
0.72
0.70
0.13
8
0.73
0.74
0.13
0.36
0.36
0.12
B. Separate analysis of Precision and Recall
In Section IV-A we have reported results considering the F-Measure as an aggregate measure of precision
and recall. This raises the question whether the use of stereotypes could have different effects on precision
25
and recall, i.e., whether it could help developers to improve the accuracy in identifying the set of artifacts
related to a particular feature or impacted by a change (higher precision), or the completeness of this set
(higher recall).
However, our results indicate that the usage of Conallen’s stereotype has a similar effect on precision
and recall, not only for what concerns the main factor, but also for the influence of co-factors. For this
reason, we will omit separate results for precision and recall; the interested reader can find them in a
longer technical report [12].
C. Influence of Subjects’ Ability and Experience
In analyzing the effects of the co-factors—Experience and Ability—first of all we check whether they
have a direct effect on the comprehension level. Fig. 4 shows how the comprehension level varies as
a function of the subjects’ Experience: the results of all subjects (N=75) were analyzed (all laboratory
sessions, without distinguishing between UML and Conallen treatments). By observing the variation of the
F-Measure across different experience levels, we found an average 5% improvement from undergraduate
0.6
0.2
0.4
F−Measure
0.8
to graduate and from graduate to research associates.
U
G
RA
Experience
Fig. 4.
Boxplots of F-Measure for different Experience levels.
As far as hypotheses H0e , H0a , and H0ea are concerned, we perform two-way and three-way ANOVA
analyses on the overall data set (all four experiments for H0e , the first three only for H0a and H0ea ).
Descriptive statistics of the F-Measure for different Experience levels are reported in Table IX. Results
indicate an increases (15%) of the mean F-Measure from undergraduate to graduate with UML, while it
26
is similar (even better for undergraduate) with Conallen. Instead, the gap between graduate and research
associates remains between 5% and 10% (it slightly increases with Conallen). The two-way ANOVA table
by Method & Experience is shown in Table X. We observe a significant effect of the Experience alone
(p-value=0.03, essentially confirming the above result), but no significant interaction with the Method
(p-value=0.13). The interaction can be represented graphically as shown in Fig. 5, where the “tangled”
situation can be interpreted as the absence of a clear interaction. Overall, we cannot reject H0e . In addition,
we performed an ANOVA analysis considering Graduate and Undergraduate students only. In this case
we obtained a p-value of 0.01, which means that when considering only students, the interaction is
significant [20]. Such an interaction can be observed in Fig. 5, by looking at the two bottom segments (G
and U lines). Also, we performed a Mann-Whitney test to compare the U and G subjects with UML and
Conallen and we found that with UML there is a significant difference (p-value=0.01) with a medium
effect size (d=0.65); when using Conallen there is also a significant difference, but in the opposite
direction (p-value=0.01), even though the effect size is negligible (d=-0.11). It should be noted that,
although we are performing multiple tests here, p-values can still be considered significant according to
the Bonferroni correction9 . Also, it must be clear that the comparison between graduate and undergraduate
is only performed to analyze how Conallen’s stereotypes eliminates differences of performance between
them, while according to our ANOVA the hypothesis H0e on the whole data set cannot be rejected.
TABLE IX
F-M EASURE DESCRIPTIVE STATISTICS FOR DIFFERENT E XPERIENCE LEVELS .
Experience
UML
Conallen
N
mean
median
σ
N
mean
median
σ
U
28
0.58
0.57
0.15
27
0.67
0.73
0.16
G
28
0.68
0.73
0.15
27
0.65
0.63
0.13
8
0.72
0.70
0.13
8
0.73
0.74
0.13
RA
Descriptive statistics of the F-Measure for different Ability levels are reported in Table XI. As shown,
9
Dividing the significance level α by the number of tests performed.
27
TABLE X
ANOVA OF F-M EASURE BY M ETHOD & E XPERIENCE .
Df
Sum Sq
Mean Sq
F value
Pr(>F)
Method
1
0.03
0.03
1.31
0.25
Experience
2
0.14
0.07
3.47
0.034
Method:Experience
2
0.08
0.04
2.05
0.13
120
2.48
0.02
Residuals
0.65
0.70
RA
G
U
0.60
mean of FMeasure
Experience
Conallen
UML
Method
Fig. 5.
Interaction of Experience and Method.
with UML high Ability subjects perform, on average, 19% better, while with Conallen the difference is
much smaller (low Ability performances is 3% higher). The two-way ANOVA by Method & Ability is
shown in Table XII. We can observe a significant interaction between Method and Ability (p-value=0.03).
Such an interaction can be observed by looking at the interaction plot of Fig. 6. Overall, we can reject
H0a . The presence of a significant difference—with UML—between low and high Ability subjects is also
confirmed by the Mann-Whitney test (p-value=0.02, effect size=0.76), while the test does not indicate any
significant difference with Conallen (p-value=0.71, effect size=-0.17).
TABLE XI
F-M EASURE DESCRIPTIVE STATISTICS FOR DIFFERENT A BILITY LEVELS .
Ability
UML
Conallen
N
mean
median
σ
N
mean
median
σ
l
12
0.57
0.58
0.17
18
0.69
0.70
0.08
h
29
0.68
0.71
0.13
25
0.67
0.68
0.14
For hypothesis H0ea we perform a three-way ANOVA. The results show no significant interaction
among all three factors (Method, Experience and Ability) together (p-value=0.64). H0ea cannot therefore
28
TABLE XII
ANOVA OF F-M EASURE BY M ETHOD & A BILITY.
Df
Sum Sq
Mean Sq
F value
Pr(>F)
Method
1
0.01
0.01
0.66
0.42
Ability
1
0.03
0.03
1.66
0.20
Method:Ability
1
0.08
0.08
4.86
0.030
80
1.39
0.02
Ability
0.58 0.60 0.62 0.64 0.66 0.68
mean of FMeasure
Residuals
h
l
Conallen
UML
Method
Fig. 6.
Interaction of Ability and Method.
be rejected.
D. Effect of other factors
Since the design of our replicated experiments included two separate laboratory sessions, it is important
to check whether different results emerge in the two labs. To this aim, we performed a two-way ANOVA
by Method & Lab, considering data from each experiment, as well as the whole data set. For the sake of
briefness, we only report, in Table XIII, p-values of the influence of Method, Lab, and of the interaction
of Method and Lab on the dependent variable. As shown in the table, the Lab factor has no significant
effect and it does not interact with the Method.
The analyses performed for the System factor are similar to those conducted for Lab. As Table XIV
shows, we did not find any statistically significant influence of the factor, nor any interaction with the
Method.
Last, but not least, we need to analyze the longitudinal effect over different questions answered during
one lab. Table XV reports results of the unpaired analysis (paired analysis cannot be done as subjects
answered questions belonging to different systems in the two labs) on the whole data set. Results clearly
29
TABLE XIII
I NFLUENCE OF L AB : P - VALUES OF
Exp
TWO - WAY
ANOVA BY M ETHOD & L AB .
Method
Lab
Method:Lab
p-value
p-value
p-value
All
0.27
0.07
0.99
Exp 1
0.93
0.31
0.46
Exp 2
0.04
0.58
0.70
Exp 3
0.44
0.11
0.90
Exp 4
0.82
0.32
0.99
TABLE XIV
I NFLUENCE OF S YSTEM : P - VALUES OF
Exp
TWO - WAY
ANOVA BY M ETHOD & S YSTEM .
Method
System
Method:System
p-value
p-value
p-value
All
0.27
0.66
0.81
Exp 1
0.93
0.10
0.18
Exp 2
0.04
0.74
0.27
Exp 3
0.45
0.32
0.46
Exp 4
0.82
0.47
0.51
show that some questions were answered more correctly than others, however, at least on the whole data
set, only in some cases the use of stereotypes introduced a significant improvement of the comprehension
level. For Claros, it can be noticed how there are some questions—q6, q10, and q11 in particular—for
which the comprehension increase when using stereotypes is significant and with a medium effect size.
Question q6 “Which fields are set from the preference form?” regards seeking HTML forms, a task
that—for subjects having standard UML diagrams—required to seek fields in the JSP pages, while in
stereotyped diagrams fields are clearly visible as attributes of a <<Form>> stereotyped class. Question
q10 “Suppose that you want to make Claros accessible for systems that do not support Javascript. Which
classes should be changed?” is easier to understand using stereotypes as one has to seek <<Javascript>>
stereotyped classes in the model, instead of searching for Javascript code into JSP pages, also considering
that sometimes the Javascript could be dynamically generated from the server side. Finally, for question
30
TABLE XV
A NALYSIS BY Q UESTION : F-M EASURE DESCRIPTIVE STATISTICS AND UNPAIRED ANALYSIS RESULTS .
Claros
UML
Conallen
M-W
t-test
Effect
q
N
mean
median
σ
N
mean
median
σ
p-value
p-value
size
q1
30
0.61
0.73
0.30
32
0.63
0.67
0.29
0.38
0.39
0.07
q2
30
0.88
1.00
0.27
32
0.90
1.00
0.22
0.66
0.43
0.04
q3
30
0.60
0.73
0.40
32
0.49
0.58
0.43
0.90
0.85
−0.28
q4
30
0.93
1.00
0.26
32
0.93
1.00
0.26
0.49
0.49
0.01
q5
30
0.62
0.67
0.34
32
0.60
0.80
0.41
0.44
0.61
−0.07
q6
30
0.72
0.80
0.35
32
0.88
1.00
0.25
0.01
0.02
0.53
q7
30
0.73
0.80
0.15
32
0.73
0.80
0.17
0.50
0.46
0.03
q8
30
0.75
0.80
0.24
32
0.70
0.80
0.27
0.88
0.78
−0.21
q9
30
0.32
0.31
0.26
32
0.29
0.29
0.29
0.65
0.64
−0.10
q10
30
0.40
0.17
0.44
32
0.61
1.00
0.45
0.04
0.04
0.47
q11
30
0.55
0.83
0.49
32
0.75
1.00
0.38
0.08
0.04
0.46
q12
30
0.39
0.00
0.44
32
0.37
0.00
0.44
0.57
0.57
−0.04
M-W
t-test
Effect
WfMS
UML
Conallen
q
N
mean
median
σ
N
mean
median
σ
p-value
p-value
size
q1
34
0.87
1.00
0.27
30
0.92
1.00
0.23
0.27
0.25
0.17
q2
34
0.95
1.00
0.12
30
0.91
1.00
0.20
0.80
0.79
−0.22
q3
34
0.35
0.00
0.48
30
0.36
0.00
0.49
0.46
0.48
0.01
q4
34
0.56
0.80
0.40
30
0.59
0.80
0.39
0.38
0.40
0.07
q5
34
0.62
0.80
0.45
30
0.72
1.00
0.43
0.14
0.20
0.22
q6
34
0.63
0.67
0.28
30
0.66
0.80
0.35
0.22
0.35
0.10
q7
34
0.23
0.00
0.36
30
0.34
0.00
0.47
0.24
0.17
0.25
q8
34
0.76
1.00
0.41
30
0.69
1.00
0.45
0.72
0.73
−0.16
q9
34
0.70
0.83
0.34
30
0.69
0.83
0.38
0.38
0.55
−0.03
q10
34
0.74
1.00
0.41
30
0.76
1.00
0.38
0.60
0.41
0.06
q11
34
0.64
0.80
0.46
30
0.60
0.80
0.43
0.70
0.66
−0.11
q12
34
0.51
0.40
0.47
30
0.77
1.00
0.38
0.01
0.01
0.60
31
q11 “Which classes contain information about the size and type of message attachments?” the usefulness
of stereotypes could appear less clear. However, when stereotypes are available, at least one is aware that
the Web form where the attachment is specified sends information related to attachments to specific classes.
For WfMS, stereotypes clearly result useful only for question q12 “Suppose that you have to substitute the
way how the session is managed. . . . Which classes/pages does this change impact?”, where stereotyped
models clearly show classes representing HTTP sessions. For all the other questions, the comprehension
improvement with stereotypes depends on co-factors related to Ability and Experience.
TABLE XVI
R EPEATED MEASURES ANOVA OF F-M EASURE , BY M ETHOD & Q UESTION .
Df
Sum Sq
Mean Sq
F value
Pr(>F)
Between Subjects
Method
Residuals
Sum Sq
Mean Sq
F value
Pr(>F)
1
0.22
0.22
0.84
0.3642
57
15.24
0.27
Question
1
0.62
0.62
3.59
0.06
Method:Question
1
0.08
0.08
0.46
0.50
110.87
0.17
Between Subjects
1
0.16
0.16
55
13.71
0.25
0.66
0.42
Method
Residuals
Within Subjects
Within Subjects
Question
1
6.25
6.25
49.22
< 0.01
Method:Question
1
0.25
0.25
1.98
0.16
79.36
0.13
Residuals
Df
625
(a) Claros
Residuals
647
(b) WfMS
Table XVI reports results of the repeated measures ANOVA of F-Measure by Method & Question.
The analysis is performed separately for the two systems Claros and WfMS, as the questions being asked
are different. Results indicate a significant within subjects effect of the Question factor for Claros, and a
marginal effect for WfMS. In other words, subjects exhibit different performances on different questions.
However, there is no significant interaction between Question and Method: although the difficulty in
providing answers varied across questions, this is true with and without the availability of stereotypes. In
other words, if looking at performances within subjects, results do not indicate that stereotypes were more
useful for particular questions (e.g., more complex questions): when they were helpful, this happened for
both simple and complex questions.
32
E. Survey Questionnaire Results
For each of the three main issues assessed through the survey questionnaire, we first perform a onetailed analysis to check whether, in any experiment, subjects encountered particular problems related, for
instance, to objectives clarity, artifacts comprehension, or time available to perform the tasks. Then, we
further investigate if any significant difference exists between the two methods whenever possible.
The answers to questions Q1 through Q3 address objectives clarity. All subjects—see Table XVII—had
enough time to complete the assigned tasks (Q1) and they found the lab objectives clear (Q2). As far
as clarity of questions (Q3) is concerned, in general a positive answer was observed, except for Exp 2
(median=3).
Then we check if in each experiment and overall a significant difference between the two methods
can be observed. In this case a Mann-Whitney test is used; results are shown in the right-end part of
Table XVII: no significant difference was found neither overall nor in any specific experiment.
TABLE XVII
O BJECTIVES CLARITY ANALYSIS .
Q̃ ≥ 3
H0 :
Q̃Conallen = Q̃U M L
Exp
f
Q1
p
f
Q2
p
f
Q3
p
Q1.p
Q2.p
Q3.p
All
2.00
<0.01
2.00
<0.01
2.00
<0.01
0.40
0.58
0.73
Exp 1
2.00
<0.01
2.00
<0.01
2.00
<0.01
0.65
0.59
1.00
Exp 2
2.00
0.01
2.00
<0.01
3.00
0.02
0.63
0.99
1.00
Exp 3
1.00
<0.01
1.00
<0.01
2.00
<0.01
0.94
0.88
0.60
Exp 4
2.00
<0.01
2.00
0.03
2.00
0.11
0.49
0.95
0.69
Questions related to the comprehension of provided artifacts can be analyzed using the same approach;
results are shown in Table XVIII. In general, subjects provided a non-positive answer to question Q4,
indicating problems in reading diagrams. The only exception is Exp 3, where no problems were reported.
In general, subjects had no problems in reading the source code (Q5), although the answer was not positive
for Exp 1 and Exp 2. All subjects were able to understand Conallen’s stereotypes (Q8).
We also analyzed the effect of Method on artifact comprehension, the results are presented in the right-
33
hand side of Table XVIII, which reports Mann-Whitney p-values (Qx p) and Cohen’s d values (Qx d).
Considering all the subjects, a significant difference was found—between UML and Conallen—in terms of
difficulty in reading the diagrams (see Q4 p in Table XVIII); in particular, Conallen’s diagrams were easier
to be understood than UML diagrams. When looking at each experiment separately, such a difference was
significant for Exp 3 only, although the direction is the same for all experiments. Obviously, Method had
no effect at all to facilitate code reading. Clearly, the analysis of differences between UML and Conallen
was not performed for Q8 since this question was asked only to subjects receiving Conallen’s diagrams.
TABLE XVIII
A RTIFACTS COMPREHENSION .
Q̃ ≥ 3
H0 :
Q̃Conallen = Q̃U M L
Exp
f
Q4
p
f
Q5
p
f
Q8
p
Q4 p
Q4 d
Q5 p
Q5 d
All
3.00
0.18
2.50
<0.01
2.00
<0.01
<0.01
−0.60
0.81
0.02
Exp 1
3.00
0.23
3.00
0.20
2.00
0.01
0.34
−0.44
0.98
−0.13
Exp 2
3.00
0.98
3.00
0.14
2.00
0.04
0.07
−0.48
0.49
0.21
Exp 3
2.00
<0.01
2.00
<0.01
2.00
<0.01
0.02
−0.89
0.60
−0.24
Exp 4
3.00
0.38
2.00
<0.01
1.50
0.02
0.07
−1.04
1.00
0.00
As far as the cognitive effects of Conallen’s stereotypes are concerned, first of all a χ2 test has been
used to check whether there exists a difference in answers to questions Q6 and Q7, depending on the
Method. Then, we observe the “diagrams vs. code usage” odds, which is easier to interpret. The presence
of significant differences is deduced using the Mann-Whitney test.
For both Q6 and Q7 we observe a significant difference depending on the Method, χ2 p-values are
2.3 · 10−6 and 2.5 · 10−7 respectively. The odds values and test results are presented in Table XIX. We
observe a significant difference overall, with a practical difference that can be quantified with an oddsratio of 2.68. A significant difference can also be found when looking at each experiment separately,
with odds-ratios ranging from 1.77 to 3.39. In practice, we observed that, when Conallen’s diagrams are
available, the odds of looking at diagrams instead of code are 2-3 times higher than for cases when a
simple UML documentation is provided.
34
TABLE XIX
D IAGRAM VS
CODE ODDS DIFFERENCE RESULTS .
Conallen
UML
OR
p-value
All
1.96
0.73
2.68
<0.01
Exp 1
2.06
0.61
3.39
<0.01
Exp 2
2.31
0.76
3.03
<0.01
Exp 3
1.51
0.85
1.77
0.04
Exp 4
1.17
0.57
2.03
0.02
Exp
In addition, we can assess the perceived usefulness of the stereotypes by looking at question Q9.
Descriptive statistics (median) and Mann-Whitney test results are shown in Table XX. The perceived
usefulness was always stated to be high or very high.
TABLE XX
S TEREOTYPES PERCEIVED USEFULNESS (Q9).
median
M-W p-value
All
2.00
<0.01
Exp 1
2.00
0.01
Exp 2
2.00
0.03
Exp 3
1.50
<0.01
Exp 4
2.00
0.02
Exp
V. T HREATS
TO VALIDITY
This section discusses the threats to validity that can affect our results: conclusion, construct, internal
and external validity threats.
Conclusion validity concerns the relationship between the treatment and the outcome. Attention was
paid to not violate assumptions made by statistical tests. Whenever conditions necessary to use parametric
statistics did not hold, we used non-parametric tests, in particular Mann-Whitney test for unpaired analyses
and Wilcoxon test for paired analyses. Results of non-parametric tests were also confirmed by parametric
tests (t-test). Analysis of co-factors was only performed with a parametric test (ANOVA). This means that
such results should be interpreted with particular caution, as although ANOVA is pretty robust to deviations
35
from normality, distributions of data groups are in some cases highly skewed, as shown in Fig. 3 (e.g., Exp
1–UML, Exp 2–Conallen). On the other hand, we have observed the scatterplot of residuals obtained when
performing the two-way ANOVA tests (see a longer report [12] for the scatterplots), and we could not
observe particular trends in them. In addition, the discussion about interactions is supported by multiple
two-means, non-parametric tests (Mann-Whitney tests), and the Bonferroni correction was applied to the
p-values. The measure chosen to evaluate the comprehension, i.e., the F-Measure, allowed to (i) aggregate
both precision and recall and (ii) evaluate the questionnaire answers in an objective manner, avoiding
to give subjective scores. In Section IV-B, we also discussed results of precision and recall separately
that, as said, are both consistent with the F-Measure. The comprehension questionnaire covers different
aspects of the system, so that the high number of correct answers indicates a good comprehension level.
Survey questionnaires, mainly intended to get qualitative insights, were designed using standard settings
and scales [16]. This permitted the use of statistical tests—Mann-Whitney and χ2 —for their analysis.
Finally, we dealt with random heterogeneity of subjects by introducing the Ability and Experience factors,
and analyzing their interaction with Method, as well as the three-way interaction among all three factors.
Construct validity threats concern the relationship between theory and observation. The achieved comprehension level was measured by using questionnaires inspired by real change requests and bugs posted
on bug-tracking systems of open source projects. To avoid any subjective evaluation, for each question
the subject was requested to provide a list of items, and the accuracy and completeness of the provided
answer was evaluated by using precision and recall, then aggregated into the F-Measure. We carefully
defined the questions so that they were not too complicated to make the tasks impossible to be performed,
nor too simple to make it difficult to observe any difference among subjects. In fact, the obtained results
show that, in most cases, subjects were not able to answer all questions correctly. Also, it can be noticed
that for research associates—who performed better than others and exhibited similar performances with
and without stereotypes—the average F-Measure was below 0.75, indicating a low risk that a ceiling effect
had occurred. Nevertheless, we do not know whether, for more complex tasks, research associates would
36
have also significantly benefited, to a larger extent, of the availability of stereotypes. We are aware that
alternative ways to assess a developer’s comprehension level could have produced different results. For
similar reasons, we designed the experiments so that subjects would not have too much time to easily
perform all task, nor too few time (the latter was confirmed by survey questionnaires). Regarding the
levels of the Ability factor, more levels than high and low could have been used. Nevertheless, analysis
performed with more levels did not yield any different or contrasting result. We are aware, however, that
Ability measures based on factors different than exam grades could have led to different conclusions. As
far as Experience is concerned, the ordinal measure we used can be considered a proxy of the actual
subjects’ experience, therefore it is possible that a more accurate measure could have led to different
results.
Internal validity concerns external factors that may affect our dependent variable. A major threat is
related to a number of students that, in Exp 2 and 3, did not participate to both Labs. Paired analyses were
limited to students that participated in both Labs, while unpaired tests were used over all data available,
including students who participated in one Lab only. These subjects were excluded from the analysis of
the Lab factor. Other internal validity threats can be due to the learning effect experienced by subjects
between Labs. Such an effect is mitigated by the chosen experimental design: subjects worked, over
the two Labs, on different systems with different levels of the main factor (UML vs. Conallen). Also,
there is the risk that, during laboratories, subjects might have learned how to comprehend the source
code of a Java Web application and how to read UML diagrams. We limited this effect by means of
a preliminary training phase. ANOVA analyses were used to study the influence of the Lab factor, for
which no significant effect was found. We also investigated on the effect of the different comprehension
questions by using repeated measures ANOVA, and found no interaction with the main factor. To avoid
social threats due to evaluation apprehension, students were not evaluated on their performance. Finally,
while subjects knew the study goal, they were not aware of the experimental hypotheses being tested.
37
External validity concerns the generalization of our findings. This kind of threat is always present when
experimenting with students, which was the case of all our experiments except for Exp 4. The selected
students represent a population of students specifically trained on Web development technologies and
software engineering methods. Also, all students involved in Exp 1 and 3 (graduate students) either had
some professional experiences or worked on industrial project during their thesis. The fourth experiment
was performed with research associates, experts in using UML and in developing Web applications. We
are aware that the latter subjects may be very different from professionals working in industry, however
their involvement in research projects, and the deadline pressure they were subject to, makes the Exp 4
environment closer to industry than the other experiments. Nevertheless, replications with professionals
working in industry—on “production” projects rather than on research projects—are highly desirable. The
experiment objects were two real, though small, Web applications belonging to different domains. This
makes the context quite realistic, despite further studies with different types of systems are necessary to
confirm or confute the obtained results.
Last, but not least, we ensure replicability of our studies by providing the whole experimental package,
and of our analyses, by providing raw data of the four experiments.
VI. D ISCUSSION
In this section we summarize the main results that emerged from the data analysis, and highlight the
Pieces of Evidences (PoE) collected.
A. Results
The first result that emerges from this series of experiments is that the use of stereotypes does not
always introduce significant benefits. This result contrasts from what obtained by Staron et al. [3]: while
they found that the largest improvements were observed with research associates, in our case subjects
having a low Ability and Experience obtained the highest benefits. As will be discussed in Section VII, the
provided material is different, and this can be the reason for different results. In our context, subjects had
38
both diagrams and source code available, thus they could find the information necessary to perform the
task either in the source code or in the stereotyped diagrams. This, of course, depends on the capability
of maintainers to browse source code or diagrams: some of them are able to quickly browse complex
source code artifacts, while others are more confident in looking at UML diagrams.
When analyzing the effect of Ability, we found a significant interaction with the Method. When
stereotypes are not available, the difference between high and low Ability subjects is significant, and
the comprehension level for high Ability subjects is, on average, 19% higher. In our interpretation, this
happens because low Ability subjects are less confident in answering the comprehension questions by
looking at the source code directly. When stereotypes are available, as shown in Fig. 6, the gap between
high and low Ability subjects becomes negligible. High Ability subjects already exhibit good performance
without stereotypes, and they are not able to further improve when stereotypes are available. Low Ability
subjects, instead, are able to achieve an improvement when using stereotyped diagrams, which were found
easier to understand than the source code.
When analyzing the effect of subjects’ Experience, we found a significant effect of Experience on
the comprehension level. This means that, as expected, more experienced subjects perform better than
less experienced ones. If considering all three Experience levels (undergraduate, graduate, and research
associates), there is no significant interaction with the Method (see Table X), while a significant interaction
can be found if considering only undergraduate and graduate students. This is highlighted in the interaction
plot of Fig. 4, and can be interpreted as follows:
•
The relationship between performances of undergraduate and graduate students—visible in Fig. 4—
is similar to what found between high and low Ability subjects. Undergraduate students exhibit
lower performance when stereotypes are not available. For them, performing the tasks by looking
at the source code (plus non-stereotyped diagrams) is more difficult than for more experienced
subjects, such as graduate students or research associates. Undergraduate students know the basics
of Web application development in Java, however they have a limited experience in developing
39
and maintaining non-trivial Web systems. When stereotypes are available, undergraduates obtain a
significant benefit and they are able to perform as well as graduate students, also reducing the gap
with research associates.
•
Research associates always exhibit higher performance than other subjects, on average 5% better
than graduate students without stereotypes, and 11% better with stereotypes. By carefully looking at
the background of research associates, we found that they are used to develop and maintain complex
Web applications. On the one side they felt stereotyped diagrams useful and when stereotypes were
available, the OR of looking at diagrams was significantly higher (OR=2.03, p-value=0.02), on the
other side for research associates the odds of looking at diagrams is always lower than for other
subjects (see the odds in Table XIX), indicating that they tend to follow a more integrated approach,
relying more on source code, instead of using diagrams as the main source of information.
•
The availability of stereotypes does not help to reduce the gap between graduate students and research
associates. This suggests that Experience introduces a gap—in this case observable between graduate
students and research associates—that the use of stereotypes cannot fill. Not only research associates
are highly skilled to browse the source code, but also to effectively benefit of stereotyped diagrams
when these are available.
We noticed that, if analyzing precision and recall of the answers provided to the comprehension questions
separately (instead of aggregating them as F-Measure), their values were pretty consistent. We interpret
this as follows: (i) stereotypes help to identify a larger, more complete set of elements related to a
comprehension task or impact task, elements that might be missed when browsing the source code, while
they are clearly visible in stereotyped diagrams; (ii) for similar reasons, stereotypes can limit mistakes
due to misunderstandings occurring when browsing the source code.
Feedbacks provided by subjects using survey questionnaires suggested—for all levels of Ability and
Experience—that Conallen’s diagrams were easier to understand, and that subjects used diagrams more
when these were stereotyped. Our interpretation is that subjects found them easier because they better
40
describe the view of a Web application, immediately providing an idea of how it works. Because of such
a higher, perceived usefulness, the percentage of time spent on diagrams was about three times higher
when stereotypes were available.
B. Pieces of Evidence
On the basis of the above discussion and the experimental results we are able to distill the following
Pieces of Evidence (PoE):
1) PoE 1: the usefulness of stereotypes in comprehension tasks significantly depends on subjects’
Ability and Experience. This happens in a context where the source code is available in addition to
diagrams—which is likely during real maintenance tasks. This was not the case in previous studies
performed on diagrams only [3], where benefits were found for all subjects. We found that the
use of stereotypes reduces the gap between low Experience subjects—who do not have enough
confidence in dealing with complex source code—and high Experience subjects. For subjects with
a mid Experience—graduate students for instance—the use of stereotypes does not produce further
benefits, leaving the gap with research associates almost unaltered (or even slightly increased thanks
to the capability of research associates to effectively benefit of diagrams in addition to the source
code). Much in the same way, the use of stereotypes reduces the gap between low Ability and high
Ability subjects, gap probably due to the fact that low Ability subjects are less skilled to browse and
understand the source code. Finally, it is important to remark that low Ability/Experience subjects
could be also considered less skilled in understanding diagrams; nevertheless, according to our
empirical results, they benefit from stereotyped diagrams because stereotypes convey information—
otherwise available only in the source code—in a visual fashion, which is more suitable for them
to use.
2) PoE 2: high Ability/Experience subjects tend to use the code more extensively than low Ability/Experience subjects. They appear to adopt a more “integrated” approach, compared to the
41
top-down approach naturally supported by stereotyped diagrams. The capability to quickly browse
unfamiliar code requires experience, and varies a lot depending on how skilled a maintainer is. Vice
versa, locating concepts in stereotyped diagrams is (relatively) simpler. When stereotypes are not
available, the source code is the only place where some information related to the View (e.g., related
to page links, Web form fields, HTTP sessions) can be located, and thus low Ability/Experience
subjects are penalized.
3) PoE 3: when stereotypes are available, diagrams are used more extensively, since they contain
additional information, otherwise available only in the source code when standard UML diagrams
are used. This happens in all cases, even for subjects that do not really gain significant benefits from
stereotypes, such as high ability and highly experienced subjects (graduate students and research
associates).
VII. R ELATED WORK
This section discusses the related literature concerning experiments aimed at: (i) assessing the use of
graphical notations, in particular in comprehension tasks, (ii) assessing UML models and (iii) evaluating
the influence of subjects’ ability and experience.
A. Experiments aimed at assessing the use of graphical notations
The usefulness of graphical elements to support maintenance and evolution tasks has been experimentally assessed by Bratthall and Wohlin [21], who compared ten different representations aiming at
enriching the design with qualitative information. Graphical elements were used in software architecture
diagrams to represent control relations, software components size and external and internal complexity of
the components. The subjects employed in the experiments were master students (computer science and
electrical engineering) and PhD students. For data collection, similarly to us, authors used a questionnaire.
As happened in Exp 2 of our study, the presence of graphical elements enhanced the understanding of
the architecture and helped subjects evolve and maintain complex software systems.
42
Kuzniarz, Staron and Wohlin [3], [4] conducted a series of controlled experiments, performed both in
academia and industry, and focused on the use of stereotypes in UML class diagrams for comprehending
Object-Oriented applications in the telecommunication domain. They showed that the use of stereotypes
significantly helps both students and industrial developers to improve comprehension. To the best of our
knowledge, their study is the most similar to ours, although there are important differences:
•
application architecture: monolithical object-oriented systems vs. Servlet/JSP Web applications;
•
size and complexity: applications used by Kuzniarz et al. [4] count 14 classes, while the two systems
we used count 78 and 92 files among Java classes and JSPs;
•
application domain: telecommunication vs. Web applications for workflow management and Web
mail client;
•
stereotypes being evaluated: ad-hoc stereotypes introduced by Staron et al. [3] for the telecommunication domain vs. Conallen’s stereotypes [2]; and
•
experiment material: above all, while in Staron et al. experiments subjects just relied on diagrams,
in our experiments they also relied on the source code. This is, in our opinion, more realistic for
a software maintenance task. Also, for our setting it is necessary to make the comparison fair,
otherwise subjects using standard UML diagrams would not receive the same information subjects
using Conallen diagrams have available.
Similar results, but in a different context, were obtained in two experiments, presented by Hendrix
et al. [22], who investigated the influence of additional graphical information on the source code. The
focus of their work was to understand the effects of the Control Structure Diagram (CSD) on program
comprehensibility. CSD [23] is a graphical notation able to represent some constructs of programming
languages (i.e., sequence, selection, iteration) by means of graphical symbols. Authors asked the subjects
to answer a set of questions regarding the structure and the execution of a software module from a
graphical library, with and without the availability of CSD. Statistical analysis of the data collected from
the two experiments revealed that CSD improved subjects’ performance in program comprehension tasks.
43
This experiment compared comprehension performance with the availability of graphical design notations,
versus comprehension performances when only the source code was available.
Finally, Lawrence et al. [24] presented the results of an empirical study conducted to understand the
effect of code coverage visualization techniques on test effectiveness. For their experiment, Lawrence
et al. used some code coverage views provided by a commercially available programming environment.
Consistently with our results, their study reveals that graphical visualizations of code coverage information
help end-user programmers and provide insights into the strategies developers use to test code, while the
same seems not happen for professional developers.
In a companion paper [20], we reported some partial analyses and results from 3 of the 4 experiments
reported in this work. The present paper provides further details and analysis about the 3 experiments
already presented, in particular new analyses related to the post-experiment survey questionnaires, the
analysis by question, and a discussion of results obtained if considering precision and recall separately.
Also, the present paper adds a new replication with 8 research associates. Even if master and bachelor
students can be considered not far from young developers [25], [26], this new replication is an important
step to better comprehend the effects of stereotypes with highly experienced developers. Last, but not
least, the present paper builds lessons learned and guidelines from the four experiments performed.
B. Experiments aimed at assessing the usefulness of UML models
Arisholm et al. [17] conducted two controlled experiments with bachelor and master students, aimed
at studying the impact of UML documentation in software maintenance tasks. Results indicated that such
a documentation improves the functional correctness of changes and the quality of the modified design.
While simple class diagrams, with or without stereotypes, help low ability or low experience subjects, a
complete, thorough UML documentation requires a certain learning curve to become useful. Regarding
the time, the authors concluded that, for complex tasks, the availability of UML documentation “does not
seem to provide any resulting time saving” while for simpler tasks, “the time needed to update the UML
44
documentation may be substantial compared with the potential benefits”. The design and planning of our
experiments took advantage of the experiences reported by Arisholm et al.
Torchiano [27] conducted a study with graduate students to understand whether the use of static
object diagrams improves the comprehension of software systems. The four systems used as objects
were documented in two ways: with a class diagram or with both a class diagram and an object diagram.
Also in this case the metric “comprehension” was estimated by means of a questionnaire, comprising four
questions for each system. This study revealed that object diagrams have a significant impact (with α set
to 0.17) on comprehension tasks, when compared with UML documentation consisting of class diagrams
only.
UML limitations in aiding program understanding are highlighted in an experiment performed by Tilley
and Huang [28] with 15 academics (PhD students and professors). Subjects analyzed a series of UML
diagrams (provided in UML 1.4) and answered, as in our case, a comprehension questionnaire for a
software system. Qualitative results highlighted that UML diagrams do not provide a sufficient support
in program understanding tasks, since their efficacy is mainly limited by three factors: ill-defined syntax
and semantics, spatial layout and domain knowledge. Possibly, stereotypes may help to reduce the first
and the third limitation.
Lemus et al. [29] showed that composite states improve the understandability of statecharts provided
that subjects had a previous experience in using them. In our study we have not considered “the training”
dimension because we believe that using stereotypes is more intuitive and simple than using complex
formalisms and structures such as OCL and composite states.
As it happened in our study, the most intuitive notation is not always the one producing better
performances/results—see Exp 1 and Exp 3 of our study, where students without stereotypes did the
comprehension tasks better than students with stereotypes. Purchase et al. [30] compared five different
graphical UML notational variants for class relationships, and found that the less intuitive ones helped more
the subjects, since this forced them to perform the analysis more carefully. The study was conducted in a
45
mixed context of academia and industry: the subjects were 34 bachelor students and 5 expert programmers.
Differently from us, questionnaires were not used to estimate the comprehension. Instead, comprehension
was estimated matching a given textual specification against a set of (correct and incorrect) diagrams and
indicating whether the diagrams correctly matched the specification or not.
Other experimental studies were performed by different authors with the aim of assessing the usefulness
of UML diagrams in comprehension tasks. For example, the role of dynamic UML diagrams in software
comprehension was investigated by Otero and Dolado [31]. The comprehension level and the time required
to perform the comprehension task resulted different for different diagrams and system complexities.
Otero and Dolado also compared UML dynamic diagrams with Open Modeling Language (OML) [32],
concluding that the comprehension of OML dynamic diagrams is faster than for UML.
C. Effect of ability and experience
In the following we discuss experiments considering the influence of subjects’ ability and experience on
software comprehension. The impact of subjects’ background on pair design was investigated by Canfora et
al. [33]. This study was conducted with students attending master courses in software technology having a
different background, i.e., non-computer science and computer-science students, referred as MUTEGS and
MUTS students respectively10. The focus of the study was on evaluating how the individual background
affects the knowledge built with pair designing. To reject the null hypothesis “the difference in education
between the pairs components does not affect the building of system knowledge”, authors formed pairs
combining subjects with different ability and experience in different ways. Results for pairs composed of
subjects having a different background were worse than for pairs composed of subjects having a similar
background.
Another study where the ability of the subjects is considered is by Briand et al. [13], who empirically
investigated the impact of OCL on the comprehension and maintenance of UML diagrams. The authors
found that OCL has the potential to improve an engineer’s ability to understand, inspect, and modify a
10
MUTEGS and MUTS were the names of the software technology master courses students were attending.
46
system modeled with UML. However, substantial training is required to make OCL useful, although for
some tasks, such as defect detection, an interaction between ability and treatment, similar to that observed
in our study, was detected. OCL better helped low ability subjects, who were not able to guess system
functionality from the textual description. As in our study, there is an interaction between the additional
notation (OCL) and the subjects’ ability and, as in our case, the additional notation is more helpful for
low ability subjects, provided that they are properly trained on the notation.
VIII. C ONCLUSIONS
AND FUTURE WORK
This paper reported the results from a series of experiments aimed at investigating the effects of stereotypes on software comprehension, in particular concentrating on the usefulness of Conallen’s stereotypes
for modeling Web applications. The experimentation consisted in a comprehension task, conducted by
making use of both the source code and UML diagrams—stereotyped or not—where the comprehension
level was assessed by computing precision, recall and F-Measure on the answers provided to a questionnaire. The experimentation was carried out in four replications, involving undergraduate students, graduate
students (two replications), and research associates.
Results indicate that subjects having different ability/experience achieved different performance and
different levels of benefit by using stereotyped diagrams. Low ability/experience subjects benefited more
of stereotyped diagrams, that allowed them to reduce their gap with high ability/experience subjects.
The latter did not significantly benefit from the use of stereotypes, being able—and used—to perform
comprehension tasks in a “integrated” fashion, i.e., by looking at diagrams but also by browsing the source
code. Stereotypes did not, however, help to fill the gap between mid experience subjects, e.g., graduate
students, already able to browse the source code, and highly skilled research associates. The latter not
only are able to use the source code, they also get a slight additional benefit from the use of stereotyped
diagrams.
The findings partially contrast with results of similar experiments [3] performed on diagrams only,
where benefits were always found, in particular for professionals. Instead, our results are in agreement
47
with experiments where additional notation—e.g., constraints written in Object Constraint Language—
produce benefits for low ability subjects [13]. The experiments also showed that stereotypes make diagrams
more likely to be used during a comprehension task, since they provide developers with an immediate
overview of the application interface behavior, and contain information often unavailable in conventional
design diagrams and otherwise available only from the source code.
The obtained results suggest that the adoption of Conallen’s stereotypes can compensate organization
volatility and human resources heterogeneity. On the other hand, if the software will be maintained mainly
by highly skilled and experienced people, it might not be necessary to adopt stereotyped notations like
WAE. A project manager should carefully think about the potential benefits in terms of maintainability
and comprehensibility WAE could introduce, also considering that application maintainers may change in
the future.
When trying to balance the costs related to the adoption of this new notation, one should also take
into account the time/cost needed for training developers. In our experience, WAE required a very limited
training—about two hours plus a lab of other two hours—provided that subjects knew UML. Also, very
likely, existing Web applications are not documented with stereotyped diagrams. Re-documenting them
requires an additional effort that, for standard UML diagrams is at least mitigated by the availability of
reverse engineering features in most of the existing modeling tools. For stereotyped diagrams, either one
should complement reverse-engineered standard UML diagrams by hand (for the Web applications used
in our experiments this took half-a-day work of two people), or use Web application reverse engineering
tools, which are however mostly experimental tools [34].
The experiments involved subjects possessing different ability and experience levels and made use of
applications belonging to different domains. Despite such a prominent variety, further empirical studies are
highly desirable to provide further evidence, or to contradict our results. For example, it would be useful
to perform case studies with industrial developers different than research assistants, on the comprehension
and maintenance of larger applications. Also, additional experiments would be useful to investigate whether
48
the obtained results can be extended to stereotypes dealing with different domains than Web applications,
or to test the usefulness of other Web-application specific modeling languages (e.g., WebML and UWE).
IX. ACKNOWLEDGMENTS
The authors would like to thank all the students and research associates who participated to the
experiments. Without them, this work would not have been possible. The authors would also like to
thank Prof. Maurizio Morisio for his insightful comments on early drafts of this paper.
R EFERENCES
[1] J. Rumbaugh, I. Jacobson, and G. Booch, Unified Modeling Language Reference Manual.
Addison-Wesley, 2004.
[2] J. Conallen, Building Web Applications with UML, Second Edition. Reading, MA: Addison-Wesley Publishing Company, 2002.
[3] M. Staron, L. Kuzniarz, and C. Wohlin, “Empirical assessment of using stereotypes to improve comprehension of UML models: A set
of experiments.” Journal of Systems and Software, vol. 79, no. 5, pp. 727–742, 2006.
[4] L. Kuzniarz, M. Staron, and C. Wohlin, “An empirical study on using stereotypes to improve understanding of UML models,” in
Proceedings of the International Workshop on Program Comprehension (IWPC), Bari, Italy, 2004, pp. 14–23.
[5] J. Conallen, “Modeling Web application architectures with UML,” Communications of the ACM, vol. 42, no. 10, pp. 63–70, 1999.
[6] S. Ceri, P. Fraternali, A. Bongio, M. Brambilla, S. Comai, and M. Matera, Designing Data-Intensive Web Applications.
Morgan
Kaufmann, 2002.
[7] O. M. F. De Troyer and C. J. Leune, “WSDM: a user centered design method for web sites,” in Proceedings of the seventh international
conference on World Wide Web 7.
ACM Press, 1998, pp. 85–94.
[8] R. Hennicker and N. Koch, “A UML-based methodology for hypermedia design.” in UML 2000 - The Unified Modeling Language,
Advancing the Standard, Third International Conference, York, UK, October 2-6, 2000, Proceedings, 2000, pp. 410–424.
[9] C. Wohlin, P. Runeson, M. Höst, M. Ohlsson, B. Regnell, and A. Wesslén, Experimentation in Software Engineering - An Introduction.
Kluwer Academic Publishers, 2000.
[10] N. Juristo and A. Moreno, Basics of Software Engineering Experimentation.
Kluwer Academic Publishers, 2001.
[11] D. Brugali and M. Torchiano, Software Development: Case Studies in Java.
Addison Wesley, 2005.
[12] F. Ricca, M. Di Penta, M. Torchiano, P. Tonella, and M. Ceccato, “How developers’ experience and ability influence web application
comprehension tasks supported by UML stereotypes: a series of four experiments,” University of Sannio, Italy
http://www.rcost.unisannio.it/mdipenta/wae-experiments-tr.pdf, Tech. Rep., 2009.
[13] L. C. Briand, Y. Labiche, M. Di Penta, and H. D. Yan-Bondoc, “An experimental investigation of formality in UML-based development,”
IEEE Transactions on Software Engineering, vol. 31, no. 10, pp. 833–849, 2005.
49
[14] L. Kuzniarz, M. Staron, and C. Wohlin, “An empirical study on using stereotypes to improve understanding of UML models,” in
Proceedings of the International Workshop on Program Comprehension.
IEEE Computer Society, 2004, pp. 14–23.
[15] W. B. Frakes and R. Baeza-Yates, Information Retrieval: Data Structures and Algorithms.
[16] A. N. Oppenheim, Questionnaire Design, Interviewing and Attitude Measurement.
Prentice-Hall, 1992.
London: Pinter, 1992.
[17] E. Arisholm, L. C. Briand, S. E. Hove, and Y. Labiche, “The impact of UML documentation on software maintenance: An experimental
evaluation,” IEEE Transactions on Software Engineering, vol. 32, no. 6, pp. 365–381, 2006.
[18] A. Agresti, An Introduction to Categorical Data Analysis.
John Wiley & Sons, 2007.
[19] D. Sheskin, Handbook of Parametric and Nonparametric Statistical Procedures (fourth edition).
Chapman & All, 2007.
[20] F. Ricca, M. Di Penta, M. Torchiano, P. Tonella, and M. Ceccato, “The role of experience and ability in comprehension tasks supported
by UML stereotypes,” in Proceedings of the International Conference on Software Engineering.
IEEE Computer Society, 2007, pp.
375–384.
[21] L. Bratthall and C. Wohlin, “Is it possible to decorate graphical software design and architecture models with qualitative information?
-An experiment,” IEEE Trans. Softw. Eng., vol. 28, no. 12, pp. 1181–1193, 2002.
[22] S. M. D. Hendrix, J.H. Cross II, “The effectiveness of control structure diagrams in source code comprehension activities,” IEEE Trans.
Softw. Eng., vol. 28, pp. 463–477, 2002.
[23] J. H. Cross, L. A. Barowski, T. D. Hendrix, and J. C. Teate, “Control structure diagrams for Ada 95,” in Annual International Conference
on Ada archive Proceedings of the conference on TRI-Ada ’96, 1996, pp. 143–147.
[24] J. Lawrance, S. Clarke, M. M. Burnett, and G. Rothermel, “How well do professional developers test with code coverage visualizations?
an empirical study.” in 2005 IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC 2005), 21-24 September
2005, Dallas, TX, USA, 2005, pp. 53–60.
[25] J. Carver, L. Jaccheri, S. Morasca, and F. Shull, “Issues in using students in empirical studies in software engineering education,” in
Ninth International Software Metrics Symposium (METRICS’03), 2003, p. 239.
[26] P. Runeson, “Using students as experiment subjects - an analysis on graduate and freshmen student data,” in Intl. Conf. Empirical
Assessment and Evaluation in Software Eng. (EASE03), 2003, pp. 95–102.
[27] M. Torchiano, “Empirical assessment of UML static object diagrams,” in Proceedings of the International Workshop on Program
Comprehension.
IEEE Computer Society, 2004, pp. 226–229.
[28] S. Tilley and S. Huang, “A qualitative assessment of the efficacy of UML diagrams as a form of graphical documentation in aiding
program understanding,” in SIGDOC ’03: Proceedings of the 21st annual international conference on Documentation.
New York,
NY, USA: ACM Press, 2003, pp. 184–191.
[29] J. A. Cruz-Lemus, M. Genero, M. E. Manso, and M. Piattini, “Evaluating the effect of composite states on the understandability
of UML statechart diagrams,” in proceedings of the International Conference on Model Driven Engineering Languages and Systems
(MODELS 2005).
Springer, 2005.
[30] H. C. Purchase, L. Colpoys, M. McGill, D. Carrington, and C. Britton, “UML class diagram syntax: an empirical study of
50
comprehension,” in APVis ’01: Proceedings of the 2001 Asia-Pacific symposium on Information visualisation. Darlinghurst, Australia:
Australian Computer Society, Inc., 2001, pp. 113–120.
[31] M. C. Otero and J. J. Dolado, “An initial experimental assessment of the dynamic modelling in UML,” Empirical Software Engineering,
vol. 7, no. 1, pp. 27–47, 2002.
[32] D. Firesmith, B. Henderson-Sellers, and I. Graham, OPEN modeling language (OML) reference manual.
New York, NY, USA:
Cambridge University Press, 1998.
[33] G. Canfora, A. Cimitile, F. Garcia, M. Piattini, and C. A. Visaggio, “Confirming the influence of educational background in pair-design
knowledge through experiments,” in SAC ’05: Proceedings of the 2005 ACM symposium on Applied computing.
ACM Press, 2005,
pp. 1478–1484.
[34] G. A. Di Lucca, A. R. Fasolino, and P. Tramontana, “Reverse engineering web applications: the WARE approach,” Journal of Software
Maintenance, vol. 16, no. 1-2, pp. 71–101, 2004.
51
A PPENDIX
52
+top_main
+index
+bottom_main
+top_banners
+authenticate
-username
-password
-serverName
«use»
«new»
+Actor
+preferences
«Session»
+HttpSession
1
user
The diagram shows an <<use>> dependency association between the "index" server page and the Actor
The same association holds between any other server page and Actor as well
+logout
«invalidate»
+mailbox
+saveprefs
«Use»
+composer
dumppart
+deletemessage
+downloadatt
«Use»
«Use»
+readmessage
+attachment
+sendMessage
+getmessage
Fig. 7.
Claros - UML Diagram (View).
1
getAttribute( )
setAttribute( )
53
«Server Page»
+top_main
«Builds»
«includes»
«Form»
+login
username
password
Submit
«Server Page»
+index
«Client Page»
+index_client
«Client Page»
+top_main_client
«Server Page»
+bottom_main
«includes»
«Builds»
«Builds»
«Client Page»
+bottom_main_client
«includes»
«Server Page»
+top_banners
«Submit»
«Server Page»
+authenticate
-username
-password
-serverName
«Redirect»
«use»
«new»
«Builds»
+Actor
«Server Page»
+preferences
«Redirect»
«Session»
+HttpSession
1
user
«Redirect»
«Link»
«Link»
«Builds»
«Client Page»
+top_banners_client
username: hidden
fullname: text
from: text
replyto: text
html: enum {yes, no}
signature: textarea
Submit
«Client Page»
+preferences_client
«Link»
«Link»
The diagram shows an <<use>> dependency association between the "index" server page and the Actor
The same association holds between any other server page and Actor as well
The diagram also shows a redirect between the "authenticate" server page and
the "index" server page. The same stereotyped association holds between any other page and index,
if the user accesses to the page without authentication
«Server Page»
+logout
«Submit»
«Server Page»
+mailbox
«invalidate»
«Server Page»
+saveprefs
«Redirect»
«Server Page»
+composer
«Builds»
«Submit»
«Builds»
«Client Page»
+composer_client
«Form»
+frmComposer
manage: button
send: button
to: text
cc: text
bcc: text
subject: text
msgbody: textarea
«Link»
«Server Page»
+deletemessage
«Client Page»
+mailbox_client
«Submit»
«Link»
«Form»
+messages
Submit
«Client Page»
+showAtt
«Link»
«Submit»
«Server Page»
«Redirect»
+readmessage
«Frameset»
+readmessage_fset
«Target»
+readmessage_targets
«Submit»
«Builds»
«Server Page»
+sendMessage
«Client Page»
+readmessage_client
«Targeted Link»
«Builds»
«Server Page»
+dumppart
«Targeted Link»
«Link»
previous
«Client Page»
+sendMessage_client
«Server Page»
+getmessage
«JavaScript Object»
+frmMsgScript
«Redirect»
«Server Page»
+downloadatt
«Builds»
«Redirect»
+attachment
Claros - Conallen Diagram (View).
«Form»
+frmMsg
DeleteMessage: button
ReplySender: button
ReplyAll: button
Forward: button
«Client Page»
+readframe
«Link»
next
Fig. 8.
getAttribute( )
setAttribute( )
«Form»
+prefs
«Redirect»
«Redirect»
1
the frmMsgScript changes the frmMsg action to deletemessage or to composer
depending on the button clicked
54
ContexListener
ServletContext
«Use»
1
«New»
«New»
Organization
1
«Use»
1
«Use»
1
Catalog
«Use»
«Use»
Instantiate
«Use»
«Use»
«Use»
«Use»
The interaction with the web
application starts here
«Use»
Main
Start
Complete
«Use»
«Use»
Login
«Use»
«New»
«Use»
Process
«Use»
Actor
«Use»
1
1
Logout
«Use»
Fig. 9.
WfMS - UML Diagram (View).
HttpSession
55
«Link»
The interaction with the Web
application starts here.
«Use»
«Server Page»
Instantiate.jsp
«Use»
«Use»
«Build»
«Submit»
«Client Page»
Instantiate.html
«Use»
«Form»
form_4
process: select
submit
«Server Page»
Login.jsp
1
1
«Build»
«Client Page»
Main.html
«Server Page»
Logout.jsp
«Link»
«Client Page»
Login.html
«Link»
1
0..*
«Form»
form_3
process: hidden=<%=process-key()%>
submit
«Form»
form_1
user: select
submit
«Client Page»
Logout.html
1
0..*
«Build»
«Build»
«Form»
form_2
workItem: hidden=<%=activity.key()%>
submit
«Invalidate»
«Submit»
«Submit»
«Submit»
1
«New»
«Server Page»
Main.jsp
«Use»
Actor
user
1
«Session»
HttpSession
«Use»
«Server Page»
Process.jsp
«Use»
«Build»
«Use»
ServletContext
«Server Page»
Start.jsp
1
1
1
Catalog
«Use»
«Use»
«Use»
«Link»
«Build»
«Client Page»
Process.html
«Use»
1
«Link»
Organization
«Client Page»
Start.html
Legend:
«Use»
ContexListener
«Form»
form_5
<%=processContex[i].the_name%>: textarea = <%=processContex[i].the_value%>
workitem: hidden = <%=key%>
action: submit = Complete
action-: submit = Update
«New»
«New»
Client page
Server page
«Submit»
Form
«Use»
«Link»
«Use»
Fig. 10.
WfMS - Conallen Diagram (View).
«Server Page»
Complete.jsp
«Use»
56
TABLE XXI
E XAMPLES OF REAL CHANGE - REQUESTS / BUGS FROM ON - LINE WEB - BASED SOURCE CODE REPOSITORIES THAT ARE
SIMILAR TO OUR
COMPREHENSION QUESTIONS
Question
URL
System
Id bug/change
q1
http://tracker.moodle.org/browse/MDL-12772
Moodle
MDL-12772
q1
https://bugzilla.mozilla.org/show bug.cgi?id=145454
Core Graveyard, GFX
Bug 145454
q2
http://drupal.org/node/153392
e-Commerce PayFlow Pro
//
q3
https://bugzilla.mozilla.org/show bug.cgi?id=317904
Core, XUL
Bug 317904
q9
http://forge.continuent.org/jira/browse/SEQUOIA-1117
Sequoia
SEQUOIA-1117
q9
https://bugzilla.mozilla.org/show bug.cgi?id=421925
Webtools, Kubla
bug 421925
http://my.opera.com/tech-nova/blog/2008/01/28/spip-1-9-2d-is-available
SPIP
1.9.2d
q12
57
TABLE XXII
C OMPREHENSION QUESTIONNAIRE FOR C LAROS .
ID
1
Question
Suppose that you have to set the background color of each Web page using CSS (Cascading Style Sheets).
Which classes/pages does this change impact?
2
Suppose that you have to substitute, in the entire application, the form-based communication mechanism
between pages with another mechanism (i.e. Applet, ActiveX, ...). Which classes/pages does this change impact?
3
Does the application conform to the Model-View-Controller (MVC) pattern? If yes which class (or classes)
implements the controller component?
4
Which page/class contains the form used to insert username and password to handle authentication?
5
Which controller classes are used to retrieve the users from the page referred in the question 4?
6
Which fields are set from the preference form?
7
Suppose you want to add a new preference. Which classes/pages are impacted?
8
Which page is used to read a message? Which are the options available for the user?
9
Suppose that you have to change the way attachments are handled. Which classes are impacted?
10
Suppose that you want to make Claros accessible for systems that do not support Javascript. Which
classes should be changed?
11
Which classes contain information about the size and type of message attachments?
12
Suppose you want to modify the links that permit to browse among messages (i.e., going to the
previous and to the next message), because you might also want to go to the first and to the last
message. Which classes need to be changed?
58
TABLE XXIII
C OMPREHENSION QUESTIONNAIRE FOR W F MS.
ID
1
Question
Suppose that you have to set the background color of each Web page using CSS (Cascading Style Sheets).
Which classes/pages does this change impact?
2
Suppose that you have to substitute, in the entire application, the form-based communication mechanism
between pages with another mechanism (i.e. Applet, ActiveX, ...). Which classes/pages does this change impact?
3
Does the application conform to the Model-View-Controller (MVC) pattern? If yes which class (or classes)
implements the controller component?
4
The description of a process is made up of three main types of elements (activity, participant, and transition)
and stored in an XPDL file. Which are the process modeling classes (i.e. the classes
used to represent the processes in memory)?
5
Which classes are initialized when the JSP container starts and are destroyed when it shuts down?
These classes keep the long lived information and are used by almost all Web pages.
6
Suppose that you have to divide the main.jsp page in three different pages. One containing the workList
(i.e. the work items the user have to complete), one containing Processes (the processes the user started)
and the last containing the Process Catalog. Which classes/pages does this change impact?
7
Suppose that you have to add the management of new rules for the sequencing of activities. They are:
split of thread of control among two branches (that can be executed in parallel) and the join of multiple
branches. Which classes/pages does this change impact?
8
Which is the class that manage the work list of pending activities? (i.e the list of activities that the user
is required to perform at a given instant)
9
Suppose that you have to add a logger able to store in a file the sequence of activities performed from the users.
This sequence should be visible from each page of the View. Which classes/pages does this change impact?
10
Suppose that you have to introduce the capability of using complex data in the processes (for example files);
this extension must consider two aspects: the definition of complex data types and the presentation and editing
of such data. Which classes/pages does this change impact?
11
Which are the classes/pages in the session scope?
12
Suppose that you have to substitute the way how the session is managed. You have to handle the session
including all state values as parameters in every URL of the system. Which classes/pages does this change impact?
0.4
Precision
0.6
0.8
1.0
59
Method Conallen UML Conallen UML Conallen UML Conallen UML
Experiment
1
2
3
4
Boxplot of Precision across the experiments.
0.2
0.4
Recall
0.6
0.8
1.0
Fig. 11.
Method Conallen UML Conallen UML Conallen UML Conallen UML
Experiment
1
2
3
4
Fig. 12.
Boxplot of Recall across the experiments.
60
TABLE XXIV
P RECISION DESCRIPTIVE STATISTICS AND UNPAIRED ANALYSIS RESULTS .
UML
Conallen
M-W
t-test
Effect
Exp
N
mean
median
σ
N
mean
median
σ
p-value
p-value
size
All
64
0.70
0.70
0.16
62
0.71
0.74
0.14
0.37
0.32
0.09
1
13
0.68
0.74
0.16
13
0.67
0.64
0.08
0.86
0.57
−0.07
2
28
0.64
0.64
0.15
27
0.71
0.76
0.15
0.04
0.05
0.44
3
15
0.78
0.81
0.13
14
0.70
0.69
0.15
0.95
0.94
−0.58
4
8
0.77
0.79
0.16
8
0.80
0.82
0.14
0.34
0.32
0.25
TABLE XXV
R ECALL DESCRIPTIVE STATISTICS AND UNPAIRED ANALYSIS RESULTS .
UML
Conallen
M-W
t-test
Effect
Exp
N
mean
median
σ
N
mean
median
σ
p-value
p-value
size
All
64
0.65
0.68
0.16
62
0.68
0.70
0.16
0.23
0.16
0.18
1
13
0.65
0.73
0.18
13
0.65
0.62
0.12
0.72
0.52
−0.02
2
28
0.58
0.58
0.15
27
0.67
0.72
0.17
0.01
0.02
0.54
3
15
0.74
0.76
0.15
14
0.70
0.69
0.19
0.76
0.74
−0.25
4
8
0.70
0.71
0.11
8
0.70
0.70
0.14
0.32
0.48
0.02
TABLE XXVI
P RECISION PAIRED ANALYSIS RESULTS .
Difference
Wilkoxon
t-test
Effect
Exp
N
mean
median
p-value
p-value
size
All
51
0.01
−0.02
0.35
0.33
0.08
1
13
−0.01
−0.06
0.62
0.57
−0.07
2
20
0.07
0.10
0.06
0.06
0.41
3
10
−0.10
−0.15
0.92
0.93
−0.70
4
8
0.04
0.04
0.22
0.19
0.25
61
TABLE XXVII
R ECALL PAIRED ANALYSIS RESULTS .
Difference
Wilkoxon
t-test
Effect
N
mean
median
p-value
p-value
size
All
51
0.02
0.01
0.35
0.26
0.12
1
13
−0.00
−0.15
0.69
0.52
−0.02
2
20
0.08
0.04
0.06
0.03
0.48
3
10
−0.07
−0.08
0.78
0.78
−0.41
4
8
0.00
−0.01
0.58
0.48
0.02
0.8
0.6
Recall
0.4
0.6
0.2
0.4
Precision
0.8
1.0
1.0
Exp
U
G
RA
U
G
Experience
Fig. 13.
Experience
Boxplot of Precision and Recall for different Experience levels
(a) Precision
(b) Recall
TABLE XXVIII
P RECISION AND R ECALL : DESCRIPTIVE STATISTICS FOR DIFFERENT E XPERIENCE LEVELS .
Experience
UML
Conallen
N
mean
median
σ
N
mean
median
σ
U
28
0.64
0.64
0.15
27
0.71
0.76
0.15
G
28
0.74
0.76
0.15
27
0.69
0.67
0.12
P
8
0.77
0.79
0.16
8
0.80
0.82
0.14
(a) Precision
Experience
UML
Conallen
N
mean
median
σ
N
mean
median
σ
U
28
0.58
0.58
0.15
27
0.67
0.72
0.17
G
28
0.70
0.76
0.17
27
0.68
0.63
0.16
P
8
0.70
0.71
0.11
8
0.70
0.70
0.14
(b) Recall
RA
62
TABLE XXIX
ANOVA OF P RECISION AND R ECALL BY M ETHOD & E XPERIENCE .
Df
Sum Sq
Mean Sq
F value
Pr(>F)
Method
1
0.01
0.01
0.24
0.62
Experience
2
0.17
0.08
4.04
Method:Experience
2
0.09
0.05
2.26
2.48
0.02
Residuals
120
Df
Sum Sq
Mean Sq
F value
Pr(>F)
Method
1
0.03
0.03
1.06
0.31
0.02
Experience
2
0.15
0.08
2.96
0.05
0.11
Method:Experience
2
0.09
0.05
1.84
0.16
3.05
0.03
Residuals
120
mean of Recall
0.70
0.75
RA
G
U
0.65
mean of Precision
Experience
Conallen
0.58 0.60 0.62 0.64 0.66 0.68 0.70
(b) Recall
0.80
(a) Precision
UML
Experience
RA
G
U
Conallen
UML
Method
Fig. 14.
Method
Interaction of Experience and Method (for Precision and Recall).
(a) Precision
TABLE XXX
(b) Recall
P RECISION AND R ECALL DESCRIPTIVE STATISTICS FOR DIFFERENT A BILITY LEVELS .
Ability
UML
Conallen
N
mean
median
σ
N
mean
median
σ
l
12
0.62
0.65
0.16
18
0.72
0.75
0.07
h
29
0.74
0.75
0.14
25
0.70
0.67
0.13
(a) Precision
Ability
UML
Conallen
N
mean
median
σ
N
mean
median
σ
l
12
0.57
0.58
0.19
18
0.70
0.72
0.12
h
29
0.71
0.76
0.14
25
0.69
0.68
0.16
(b) Recall
TABLE XXXI
ANOVA OF P RECISION , R ECALL BY M ETHOD & A BILITY
Df
Sum Sq
Mean Sq
F value
Pr(>F)
Method
1
0.00
0.00
0.01
0.9108
Ability
1
0.03
0.03
1.83
Method:Ability
1
0.09
0.09
5.47
1.35
0.02
Residuals
80
(a) Precision
Df
Sum Sq
Mean Sq
F value
Pr(>F)
Method
1
0.01
0.01
0.44
0.51
0.1794
Ability
1
0.06
0.06
2.44
0.12
0.0219
Method:Ability
1
0.10
0.10
4.31
0.04
1.84
0.02
Residuals
80
(b) Recall
0.72
Ability
Ability
mean of Recall
0.62
0.58
0.62
0.68
0.66
0.66
h
l
0.70
h
l
0.64
mean of Precision
0.70
0.74
63
Conallen
UML
Conallen
UML
Method
Fig. 15.
Method
Precision and Recall: Interaction of Ability and Method.
(a) Precision
(b) Recall
TABLE XXXII
I NFLUENCE OF L AB ON P RECISION AND R ECALL : P - VALUES OF
Exp
Method
Lab
Method:Lab
p-value
p-value
p-value
All
0.63
0.16
0.96
Exp 1
0.86
0.63
Exp 2
0.11
Exp 3
Exp 4
ANOVA BY M ETHOD & L AB .
Method
Lab
Method:Lab
p-value
p-value
p-value
All
0.31
0.02
0.93
0.56
Exp 1
0.96
0.25
0.49
0.54
0.65
Exp 2
0.05
0.47
0.67
0.12
0.17
0.67
Exp 3
0.50
0.06
0.89
0.65
0.52
0.42
Exp 4
0.97
0.10
0.38
(a) Precision
Exp
TWO - WAY
(b) Recall
0.0
Residuals
−0.4
−0.4
−0.2
0.0
−0.2
Residuals
0.1
0.2
0.2
0.3
64
0.58
0.60
0.62
0.64
Fitted Model
Fig. 16.
0.66
0.68
0.60
0.65
0.70
Fitted Model
Residuals of two-way ANOVA analyses by Method & Ability and by Method & Experience.
(a) ANOVA by Method & Ability
(b) ANOVA by Method & Experience
© Copyright 2026 Paperzz