A controlled experiment on the effects of PSP training: Detailed

A controlled experiment
on the effects of PSP training:
Detailed description and evaluation
Lutz Prechelt ([email protected])
Barbara Unger ([email protected])
Fakultät für Informatik
Universität Karlsruhe
D-76128 Karlsruhe, Germany
+49/721/608-4068, Fax: +49/721/608-7343
http://wwwipd.ira.uka.de/EIR/
Technical Report 1/1999
April 8, 1999
Abstract
The Personal Software Process (PSP) is a methodology for systematic and continuous improvement of an individual software engineer’s software production capabilities. The proponents of the
PSP claim that the PSP methods improve in particular the program quality and the capability for
accurate estimation of the development time, but do not impair productivity.
We have performed a controlled experiment for assessing these and related claims. The experiment
compares the performance of a group of students that have just previously participated in a PSP
course to a comparable set of students from a “normal” programming course. This report presents
in detail the experiment design and setup, the results of the experiment, and our interpretation of
the results.
The results indicate that the claims are basically correct, but the improvements may be a lot smaller
than expected. However, we found an important additional benefit from PSP that is not usually
mentioned by the PSP proponents: The performance in the PSP group was consistently less variable for most of the many variables we investigated. Less variable performance in a software team
greatly reduces the risk in software projects.
Contents
1
Introduction
1.1 What is the PSP? . . . . .
1.2 Experiment overview . . .
1.3 Related work . . . . . . .
1.4 Why such an experiment? .
1.5 How to use this report . . .
.
.
.
.
.
4
4
5
6
6
6
2
Description of the experiment
2.1 Experiment design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2 Hypotheses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.3 Experiment format and conduct . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.4 Experimental subjects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.4.2 Education and experience . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.4.3 The PSP course (experiment group) . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.4.4 The alternative courses (control group) . . . . . . . . . . . . . . . . . . . . . . . . .
2.5 Task . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.5.1 Goals for choosing the task . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.5.2 Task description and consequences . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.5.3 Task infrastructure provided to the subjects . . . . . . . . . . . . . . . . . . . . . . .
2.5.4 The acceptance test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.5.5 The “gold” program . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.6 Internal validity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.6.1 Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.6.2 Accuracy of data gathering and processing . . . . . . . . . . . . . . . . . . . . . . .
2.7 External validity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.7.1 Experience as a software engineer . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.7.2 Experience with psp use . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.7.3 Kinds of work conditions or tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8
8
9
10
11
11
12
14
14
15
15
15
16
16
17
17
17
18
18
18
18
19
3
Experiment results and discussion
3.1 Statistical methods . . . . . . . . . . . . . . . . . . . .
3.1.1 One-dimensional statistics . . . . . . . . . . . .
3.1.2 Two-dimensional statistics . . . . . . . . . . . .
3.1.3 Presentation of results . . . . . . . . . . . . . .
3.2 Group formation . . . . . . . . . . . . . . . . . . . . .
3.3 Estimation . . . . . . . . . . . . . . . . . . . . . . . . .
3.4 Reliability and robustness . . . . . . . . . . . . . . . . .
3.4.1 Black box analysis and white box analysis . . . .
3.4.2 The test inputs: a, m, and z . . . . . . . . . . . .
2
20
20
20
23
25
25
27
31
31
32
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
CONTENTS
. . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . .
32
33
35
36
39
39
42
42
47
48
49
51
51
52
53
53
54
55
Conclusion
4.1 Summary of results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.2 Possible reasons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.3 Consequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
59
59
59
61
3.5
3.6
3.7
3.8
3.9
3.10
3.11
3.12
3.13
3.14
4
3
3.4.3 Reliability measures . . . . . . . . . . . .
3.4.4 Inputs with nonempty encodings . . . . . .
3.4.5 Arbitrary inputs . . . . . . . . . . . . . . .
3.4.6 Influence of the programming language . .
3.4.7 Summary . . . . . . . . . . . . . . . . . .
Release maturity . . . . . . . . . . . . . . . . . .
Documentation . . . . . . . . . . . . . . . . . . .
Trivial mistakes . . . . . . . . . . . . . . . . . . .
Productivity . . . . . . . . . . . . . . . . . . . . .
Quality judgement . . . . . . . . . . . . . . . . .
Efficiency . . . . . . . . . . . . . . . . . . . . . .
Simplicity . . . . . . . . . . . . . . . . . . . . . .
Analysis of correlations . . . . . . . . . . . . . . .
3.12.1 How time is spent . . . . . . . . . . . . . .
3.12.2 Better documentation saves trivial mistakes
3.12.3 The urge to finish . . . . . . . . . . . . . .
Subjects’ experiences . . . . . . . . . . . . . . . .
Mean/median/iqr overview table . . . . . . . . . .
Appendix
A Experiment materials
A.1 Experiment procedure . . . . . . . . .
A.2 Questionnaire – personal information
A.3 Task description . . . . . . . . . . . .
A.4 Questionnaire – Estimation . . . . . .
A.5 Questionnaire – Self-evaluation . . . .
A.6 Versuchsablauf . . . . . . . . . . . .
A.7 Fragebogen – persönliche Angaben . .
A.8 Aufgabenstellung . . . . . . . . . . .
A.9 Fragebogen – Selbsteinschätzung . . .
A.10 Fragebogen – Eigenbeurteilung . . . .
62
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
62
63
64
66
70
73
76
77
79
84
87
B Glossary
90
Bibliography
92
One item could not be deleted because it was missing.
Apple Macintosh System 7 OS error message
Technical Report 1/1999, Lutz Prechelt, Barbara Unger, University of Karlsruhe
Chapter 1
Introduction
Everybody has opinions. I have data.
Watts S. Humphrey
The present report is the definitive and detailed description and evaluation of a controlled experiment comparing
students that received PSP (Personal Software Process) training to other students that received other software
engineering education.
In the first chapter we will first discuss the general topic of the experiment, then give a broad overview of the
purpose and setup of the experiment, and finally describe related work.
Chapter 2 describes the subjects, setup, and execution of the experiment, relying partially on the original experiment materials as printed in the appendix. It also discusses possible threats to the internal and external validity
of the experiment.
Chapter 3 presents and interprets in detail the results obtained in the experiment and Chapter 4 presents conclusions. The appendix contains the handouts used in the experiment: personal data questionnaire, estimation
questionnaire, task description, work time logging sheet, postmortem questionnaire.
1.1 What is the PSP?
The Personal Software Process (PSP) is a methodology for structuring the work of an individual software
engineer introduced by Watts Humphrey in 1995 [3]. At its core is the notion of an individual’s software
process, that is, the set of procedures used by a single software engineer to do his or her work. The PSP has
several goals:
Reliable planning capability, i.e., the ability to accurately predict the delivery time of a piece of work
performed by a single software engineer.
Effective quality management, i.e., the ability to avoid introducing defects or other quality-reducing
properties into the work products, to detect and remove those that have been introduced anyway, and
to improve both capabilities over time. These other quality attributes can be, for instance, the ease of
maintenance, reuse, or testing (internal view of the product), or the suitability, flexibility, and ease of use
(external view), etc.
Defining and documenting the software process, i.e., laying down in writing the abstract principles
and concrete procedures by which one generally creates software. The purpose of process definition is
improving the ability of the process to be traced, understood, communicated, measured, or improved.
4
1.2 Experiment overview
5
Continuous process improvement, i.e., the capability to continuously identify the relatively weakest
points in one’s own software process, develop alternative solutions, evaluating these solutions, and incorporating the best one into the process subsequently.
The PSP may also lead to improved productivity but this is not a primary goal; part of the productivity gains may
be offset by the overhead introduced by the PSP, because the means to the above ends are process definition,
process measurement, and data analysis, which lead to a number of additional tasks.
The PSP methodology is taught by means of a PSP course. In its standard form, this is a 15-week training
program requiring roughly one full day per week. According to the experience of both Watts Humphrey and
ourselves, the PSP can hardly be learned without that course, because under the pressure of real-world working
conditions, programmers will only be able to accept and execute the overhead tasks once they have experienced
their benefits — but the benefits will only be experienced after the overhead tasks have been performed for
quite a while. Hence, the course is needed for providing a pressure-free “playground” for learning about the
usefulness of PSP techniques.
The practical consequence of the PSP course for an individual software engineer is to obtain a personal software
process (psp, in non-capital letters). The course provides a set of example methods that serve as a starting
point for the development of an individual psp. The methods are reasonable default instantiations of the PSP
principles and can be tailored to one’s individual preferences and work conditions during later psp usage.
1.2 Experiment overview
The question asked by this experiment is the following:
What (if any) differences in behavior or capabilities can be found when comparing software engineers that have received PSP training to software engineers that have received an equivalent amount
of “conventional” technical training?
The approach used to answer this question is the following:
Find participants with similar capabilities and backgrounds, except that one group has had previous PSP
training and the other has not.
Let each participant solve the same non-trivial programming task.
Observe as many features of behavior (process) and result (product) as possible.
Examples: total working time, number of compilations, program reliability when program was first
considered functional (acceptance test), final program reliability, program efficiency, etc.
Formulate hypotheses describing which differences might be expected between the groups with respect
to the features that were observed.
Example: PSP-trained participants produce more reliable programs.
Analyze the data in order to test the hypotheses. Describe additional interesting structure found in the
data, if any. Interpret the results.
In our case, the participants were graduate students and the programming task involved designing and implementing a rather uncommon search and encoding algorithm. On average, the task size was effectively more
than 1 person day (between 3 and 50 work hours). A total of 55 persons participated in the experiment between
August 1996 and October 1998.
Technical Report 1/1999, Lutz Prechelt, Barbara Unger, University of Karlsruhe
6
Chapter 1: Introduction
1.3 Related work
It is one of the most important principles of the PSP methodology to base decisions on objective measurement
data (as opposed to intuitive judgement). Consequently, the PSP course (and also later practitioning of the PSP)
builds a collection of data for each participant from which the development of several attributes of process quality and effective developer capability can be seen. Such data has been described and discussed by Humphrey in
several articles and reports, e.g. [4]. Most of this data show the development of a certain metric over time, such
as the decreasing density of defects inserted in a program. The main and unavoidable drawback of such data is
a lack of control: It is impossible to say how much of the effect comes from each of the possible sources, such
as: the particular programming problem solved at each time, maturation that would also occur without PSP
training (at least given some other training), details of the measurement that cannot be made objective, and,
finally, real and unique PSP/psp benefits.
The purpose of the present experiment is providing data with a much higher degree of comparability: measuring
psp effects in a controlled fashion.
We know of no other evaluation work specifically targeting the PSP methodology or the PSP course.
1.4 Why such an experiment?
There are basically two reasons why we need this experiment. First, any methodology, even one as convincing
as the PSP, should undergo a sound scientific validation. Not only to see whether it works, but rather to
understand the structure of its effects: Which consequences are visible at all? How strong is their influence?
How do they interact?
The second reason is more pessimistic: Based on our observations with about a hundred German informatics
students we estimate that only about one third of them will regularly use PSP techniques in their normal work
after the course and will really form a psp. Roughly another third appears to be unable to regularly exercise the
self-control required for building and using a psp. For the rest, we consider the prospects to be unclear; their
PSP future may depend on the kind of environment in which they will work.
Given this estimation it is not clear whether PSP-trained students will be superior to others, even if one is
willing to believe that a PSP education in principle has this effect. The purpose of the experiment is to assess
the average effect as well as look for the results of the above-mentioned dichotomy, if any. If it exists, we may
for instance see a larger variance of performance in the PSP group or maybe even a bimodal distribution having
two peaks instead of just one.
1.5 How to use this report
This report is meant to provide a most detailed documentation of the experiment and its results. This has several
consequences:
You need not read the whole report from front to back. The contents are logically structured and it should
be easy to find specific information of interest using the table of contents.
When you encounter a term whose definition you have skipped, refer to the table of contents or the
glossary for finding it.
A controlled experiment on the effects of PSP training
1.5 How to use this report
7
The main text does not try to describe the tasks or questionnaires in any detail but instead relies on
the original experiment materials that are printed in the appendix. Please consult the appendix where
necessary.
The results section is rather detailed. We recommend to stick to the text and to refer to the diagrams and
their captions only at points of particular interest.
Technical Report 1/1999, Lutz Prechelt, Barbara Unger, University of Karlsruhe
Chapter 2
Description of the experiment
Good judgement comes from experience,
and experience comes from bad judgement.
Anonymous
This chapter will describe the experiment design, the hypotheses to be investigated by the experiment, the
experiment procedure, the motivation and background of the participants, and the task to be solved. In the final
sections we will discuss possible threats to the internal and external validity of the experiment.
2.1 Experiment design
As mentioned before, the general goal of the experiment is to investigate differences in performance or behavior
between
persons that have received PSP training shortly before and
“other” people.
The resulting basic experimental design is very simple: it is a two-group, posttest-only, inter-subject design
with a single binary independent variable, namely “subject1 has received PSP training”. We will call the two
groups P (for “PSP-trained”) and N (for “not PSP-trained”). The set of dependent variables to be used is not at
all simple, though. To avoid unnecessary repetition, these variables will be introduced step by step during the
presentation of the results in Chapter 3.
In order to maximize the power of the experiment in view of the large variations in individual performance
that are to be expected, one would ideally want to pair participants with similar expected performance (based
on available knowledge about the participants’ background) and put one person of each pair into each group.
Unfortunately, this is not possible in the present setting because group membership is determined by the previous university career of each participant that can not be controlled by the experimenter. See also the following
section.
1
We will use the terms “subject” and “participant” interchangeably to refer to the persons who participated in our experiment as
experimental subjects.
8
2.2 Hypotheses
9
2.2 Hypotheses
As mentioned in the overview in Section 1.2, the general purpose of the experiment is investigating behavior differences (and their consequences) between PSP-trained and non-PSP-trained subjects. To satisfy such
a broad ambition, we must collect and analyze data as comprehensively as is feasible (from both a technical/organizational and a psychological point of view).
However, to guide the evaluation of this data it is useful to formulate some explicit expectations about the
differences that might occur. These expectations are formulated in this section in the form of hypotheses. The
experiment shall investigate the following hypotheses:
Hypothesis H1: Estimation. Since effort estimation is a major component of the PSP course, we expect that
the deviations of actual from expected development time are smaller in group P compared to group N.
(Note that this hypothesis assumes that the task can be considered to be from a familiar domain, because
otherwise the PSP planning may break down and the results become unpredictable.)
Hypothesis H2: Reliability. Since defect prevention and early defect removal are major goals throughout the
PSP course, we expect the reliability of the program for “normal” inputs to be higher in group P compared to
group N.
Hypothesis H3: Robustness. Since producing defect-free programs is an important goal during the PSP
course, we also expect the reliability of the program for “surprising” sorts of inputs to be higher in group P
compared to group N.
(Note that the term “robustness” is somewhat misleading here because robustness against illegal inputs was
explicitly not a requirement in the experiment.)
Hypothesis H4: Release maturity. We expect that group P will typically deliver programs in a relatively more
mature state compared to group N. When the requirements are invariant, release maturity can be represented by
the fraction of overall development time that comes after the first program release.
(In the experiment, releasing a program will be represented by the request of the participant that an acceptance
test be performed with the program.)
Hypothesis H5: Documentation. Since the PSP course puts quite some weight onto making a proper design
and design review and onto product quality in general, we expect that there will be more documentation in the
final product in group P compared to group N.
Hypothesis H6: Trivial mistakes. Quality management as taught in the PSP course is based on the principle
to take even trivial defects seriously, hence we expect a lower number of simple-to-correct defects in group P
compared to group N.
Hypothesis H7: Productivity. For programs that are difficult to get right, the PSP focus on early defect
detection saves a lot of testing and debugging effort. The overhead implied by planning and defect logging
usually does not outweigh these savings. What might outweigh the savings, though, is if participants produce
a thorough design documentation that accompanies the program, e.g. in the form of long comments in the
source code. Such behavior is also expected to be more likely for PSP-trained participants. Note that since
all programs were built to conform to the same requirements, productivity can be defined simply as 1 divided
by the total work time. Now for the actual hypothesis: We expect group P to complete the program faster
compared to group N, at least if one subtracts the effort spent for documentation.
Hypothesis H8: Quality judgement. The PSP quality management tracks the density of defects in a program
as seen during development and even as predicted before development starts. We speculate that this might also
lead to more realistic estimates of the defect content and reliability of a final program. Hence, we expect to see
more accurate estimates of final program reliability in group P compared to group N.
Technical Report 1/1999, Lutz Prechelt, Barbara Unger, University of Karlsruhe
10
Chapter 2: Description of the experiment
Speculative hypothesis H9: Efficiency. PSP quality management suggests to prefer simpler solutions over
“clever” ones. This might lead to programs that run more efficient or use less memory, but might also lead
to programs that run more slowly or require more memory. We hypothesize that we may find differences in
speed and memory consumption between groups P and N, although we cannot say in advance which form these
differences will have.
Speculative hypothesis H10: Simplicity. The more careful design phase presumably performed by PSPtrained subjects might lead to simpler and shorter code in group P compared to group N.
Note that this set of hypotheses is quite bold and the odds are not the same for each hypothesis: For estimation
and reliability, for example, we clearly expect to see some advantage of the P group, while for efficiency
expecting any differences is pure speculation. The results for documentation and productivity must be treated
with care, because the experiment instructions first and foremost called for reliability, not for productivity or
maintainability.
All of these hypotheses are expected to hold more strongly if we look at only the better half of the participants in
each group, because then, according to the remark in Section 1.4, the PSP group presumably consists mostly of
subjects that indeed use a psp. In Section 3.2 on page 25, we will form various subgroups for such comparisons.
As mentioned above, the purpose of the experiment is not limited to formally testing these hypotheses, but includes all other appropriate analyses that might provide insights towards understanding the behavior differences
and their effects. Consequently, we will use the hypotheses somewhat loosely in order not to be distracted from
possibly more important other observations.
2.3 Experiment format and conduct
After some trial runs in August 1996, the experiment started in February 1997 and finished in October 1998.
With a few exceptions, the participants worked during the semester breaks from mid-February to mid-April
or mid-July to mid-October. Subjects announced their participation by email and agreed on an appointment,
usually starting at 9:30 in the morning.
Each subject could choose the programming language to be C, C++, Java, Modula-2 or Sather-K; two subjects
would have preferred Pascal. The compilers used were gcc, g++, JDK, mocka, and sak, respectively. An
account was set up for each subject on a Sun Unix workstation. At most three such accounts were in use
at any given time. The account was setup so as to protocol activity, in particular each version of the source
code that was submitted to the compiler. A subject was told about this monitoring on request. Due to a
mistake made by one experimenter (Prechelt) when adapting the setup of the participant accounts to a change
in the software environment of our workstations (switch to a new version of the Java development kit), the
monitoring mechanism was corrupted and nonfunctional during a significant part of the experiment and many
of these compilation protocols were lost. In order to provide as natural a working environment as possible and
avoid irritating the participants, no direct observation, video recording, or other kind of visible monitoring was
performed.
A subject was allowed to fetch and install auxiliary tools or data by FTP and install it for the experiment. For
instance, a few subjects brought their own editor or re-used small parts of previously written programs (e.g. file
handling routines). Some, but not all, PSP subjects brought their estimation data and/or PSP tools.
The subject was then given the first three parts of the experiment materials: first the personal information
questionnaire, then the task description, and then the estimation questionnaire (see Appendix A). After filling
in the latter, the subject was left alone and worked according to a schedule he could choose freely. Only three
restrictions were made: First, to always work in this special account; second, to use but one source code file;
A controlled experiment on the effects of PSP training
2.4 Experimental subjects
11
and third, to log the working time on a special protocoling sheet we had provided along with the other materials.
Each of these restrictions was violated by a few (but not many) of the participants.
The subject was told to ask if he encountered technical problems with the setup or if he felt something was
ambiguous in the requirements. Technical problems occurred frequently and were then resolved on the spot.
In order not to bias the results, the time required for resolving the problems (between 5 and 30 minutes per
participant) is included in the work time. Inclusion is required because some subjects may have chosen to
resolve the problem alone. Questions about the requirements were asked by about a dozen participants and in
all cases they were referred to read the requirements more closely, because the apparent ambiguity was indeed
properly resolved in the description they had received.
The participant was asked not to cooperate with other participants working at the same time or with earlier
participants; we have not found any evidence of such cooperation and believe that essentially none has occurred.
The subject was further told to ask for an acceptance test at any time if he felt his program would now work
correctly.
If the acceptance test failed, the subject was encouraged to analyze the problems, correct the program, and try
again. Once the acceptance test was passed (or the subject gave up), the participant was given the postmortem
self-evaluation questionnaire. After finishing and returning the questionnaire, the subject was paid (see Section 2.4.1) and all data was copied from the subject’s experiment account to a safe place. In a few cases, the
subjects asked for (and were granted) some additional time for improving the program after the acceptance
test was passed. Typically this was used for cleaning up unused code sections and inserting comments into the
program.
2.4 Experimental subjects
This section will describe the background of the experimental subjects. Often in this report we refer to a few of
the subjects individually by name: these names are letter/number-combinations in the range from s12 to s102.
2.4.1 Overview
The experiment had 50 participants, 30 of them in the PSP group and 20 in the control group. Our 50 participants break up into the following sorts with respect to their motivation: 40 of them were obliged to participate
in the experiment as a part of a lab course they took (29 from two PSP courses and 11 from a Java course, see
the description in Sections 2.4.3 and 2.4.4). Note that the obligation included only participation, not success:
Even those participants that gave up during the experiment passed their course. One PSP participant, s045,
retracted all of his materials from the experiment and will not be included in any of the analyses.
The other 10 participants were volunteers. 8 participants were volunteers from other lab courses in our department. Four of these were highly capable students who expressedly came to “prove that PSP-trained people are
not better than others”. (s020, s023, s025, and s034. Interestingly, two of these four, s020 and s034, later participated in the PSP course.) 2 participants (s081, s102) are actually “repeaters”: they had already participated
in the experiment in February/March 1997 as non-PSP subjects and voluntarily participated again in June 1998
and March 1998, respectively, after they had taken the PSP course. Quite obviously, these differences need to
be taken into account during data analysis; see Section 3.2 for a description of the actual groups compared in
the analysis.
All participants were motivated towards high performance by the following reward system: They would be
paid DM 50 (approximately 30 US dollars) for successfully participating in the experiment (i.e., passing the
acceptance test). However, each failed acceptance test reduced the payment by DM 10.
Technical Report 1/1999, Lutz Prechelt, Barbara Unger, University of Karlsruhe
12
Chapter 2: Description of the experiment
2.4.2 Education and experience
The following information excludes the two repeater participants, the one participant from the PSP group who
retracted all of his materials from the experiment, and those participants for which the particular piece of
information was not available (“missing”).
The participants were in their 4th to 17th semester at the university (median 8, see Figure 2.1; please refer
to Section 3.1.1 on page 20 for an explanation of the plot). With the exception of two highly capable fourth
semesterers (s023 and s050) all of the students were graduate students (after the “Vordiplom”).
N
o
M
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
1 missing
o
o
M
o
o
4
6
8
10
12
o
1 missing
M
P
all
o
o
o
o
14
Figure 2.1:
Distribution of
semester number of subjects in
the PSP group (P), the non-PSP
group (N), and both together
(all).
16
current semester number
The participants had a programming experience of 3 to 14 years (median 8 years, see Figure 2.2) with an estimated 10(sic!) to 15000(sic!) actual programming working hours beyond the programming exercises performed
in the university education (median 600 hours, see Figure 2.3).
M
N
o
P
o
all
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
1 missing
o
o
o
o
o
o
o
o
1 oomissing
o
M
o
M
o
4
6
8
10
o
o
12
Figure 2.2: Distribution of years
of programming experience in
the PSP group (P), the non-PSP
group (N), and both together (all)
14
total programming experience [years]
6 missing
M
N
o oo
o o
o
o
o
P
o
oo o o o
o o
o
o
M
o
o
o
all
oo oo o
ooo
o o
o
o
o
o
o
o
o
o
o
o
o
1000
2000
3000
4000
o
o
o
o
M
0
7 missing
o
o
13 missing
o
Figure 2.3: Distribution of hours
of non-university programming
experience in the PSP group (P),
the non-PSP group (N), and both
together (all). There is one point
at 15000 in N.
5000
total programming experience [work hours]
During that time, each of them had written an estimated total 4 to 2000(sic!) KLOC2 with a median of 20, see
Figure 2.4. The estimated total number of lines the subjects had written in the language they had chosen for
2
One KLOC is thousand lines of code. For roughly half of the participants this includes comments, for the others it includes only
statements.
A controlled experiment on the effects of PSP training
2.4 Experimental subjects
13
the experiment was from 0.5 to 100 KLOC (median 5 KLOC, see Figure 2.6). The few extremely high values
that occur in most of these variables show that there are a few quite extraordinary subjects in the sample, in
particular in the non-PSP group.
oo
ooo oo o o
P
M
ooo
o
oo
ooo
ooo o
o
o
o
o
o ooooo o
ooo
o o
o oooo
o
M
o
o
o
o
all
o
o
0
3 missing
o
M
o
N
100
1 missing
4 missing
o
200
300
400
Figure 2.4: Distribution of total
KLOC written in the PSP group
(P), the non-PSP group (N), and
both together (all). There is one
point at 2000 in N.
500
total programming experience [KLOC]
N
P
all
M
o oo o o
o oo
o
o
o
o
o
o
o
o
1 missing
M
o o
ooo
ooo
o
o
ooo
o
o
o
o
o
o
M
o
o
ooo
oo o
oo
o o
ooooo
o o
o
0
o
o
o
o
o
o
20
1 missing
o
o
o
40
Figure 2.5: Distribution of size
of largest program written in the
PSP group (P), the non-PSP
group (N), and both together
(all).
60
size of largest program written [KLOC]
o o o o
oooo
ooo
P
o
oooooo
ooooo
o
all
1 missing
M
N
o o
oo o o
oo
o o
oo
o
oooo
o
oo
ooo
oo
0
o
o
o
M
o
o
o
o
o
o
o
1 missing
M
o
o
o
o
o
o
2 missing
o
20
40
60
80
o
100
Figure 2.6: Distribution of programming experience (in KLOC)
in the programming language
used during the experiment by
each individual subject in the
PSP group (P), the non-PSP
group (N), and both together
(all).
programming language experience [KLOC]
Looking at these data, our two main groups appear to be reasonably balanced. The apparently somewhat larger
values in the N group for total experience in KLOC and size of largest program may be spurious, because the
non-PSP participants are more likely to over-estimate their past productivity as we will see in Section 3.3.
There are two subjects (s034 and s043) that appear among the top three 5 times and 3 times, respectively, for
the 5 programming experience measures; both are in the N group.
Note that the distribution of programming languages differs between the N and P group, see Figure 2.7. There
is a relatively larger fraction of Java users in the N group.
Technical Report 1/1999, Lutz Prechelt, Barbara Unger, University of Karlsruhe
M
od Sat
he
J
C
av ula
-2 r-K
C ++
a
Chapter 2: Description of the experiment
M
od Sat
he
J
C
av ula
-2 r-K
C ++
a
14
0
2
4
6
8
10
12
14
programming lang., N group
0
2
4
6
8
10
12
14
programming lang., P group
Figure 2.7: Number of participants in the N group (left) and the P group (right) that have used each
programming language.
2.4.3 The PSP course (experiment group)
The PSP methodology is taught by means of a PSP course. In its standard form and as taught in our department,
this is a 15-week training program consisting of 15 lectures of ninety minutes each, 10 programming exercises,
and 5 process exercises. This course requires roughly one full day per week.
Each exercise is submitted to a teaching assistant who carefully checks the correctness of the materials (with
respect to process) and marks any problems s/he finds. The materials are handed back to the students who
have to resubmit them in corrected form until everything is OK. The focus of the course is on the following
topics: working towards (and then according to) a well-defined, written-out software process, learning and
using systematic planning and estimation based on personal historical data, and defect prevention and removal
by means of managed personal reviews, defect logging, defect data analysis, and performance-data-controlled
process changes.
The participants of the P group are from the second and third time we taught the PSP course. For the most part,
we used the agenda defined by Humphrey in his book [3] on page 746. The participants needed to submit all
exercises in good shape (with resubmissions if corrections were necessary) in order to pass the course. Those
few participants who did not pass the course dropped out voluntarily during the semester, nobody was explicitly
expelled.
2.4.4 The alternative courses (control group)
The volunteers of the N group (s014, s017, s020, s023, s025, s028, s031, s034) came from various other lab
courses.
The non-volunteers of the N group all came from an advanced Java course (“component software in Java”);
many of them had previously also participated in a basic Java course we had taught the year before. The course
was following a compressed schedule such that the course ran over only 6 weeks of the 13-week semester,
but required a very high time investment during that time. In terms of the total amount of code produced, this
course is quite similar to the PSP course, although it had only 5 larger exercises instead of 10 smaller ones. The
course content was highly technical, covering the then-new Swing GUI classes, localization and internationalization, serialization and persistence, reflection and JavaBeans, and distributed programming (Remote Method
A controlled experiment on the effects of PSP training
2.5 Task
15
Invocation). The programs were submitted to the course teachers and tested in a black-box fashion. Participants
needed to score 70% of all possible points (based on correctly implemented functionality) in order to pass the
course.
2.5 Task
This section will shortly describe the task to be solved in the experiment and will explain why we chose it. For
details, please refer to the original task description on page 66 in the appendix.
2.5.1 Goals for choosing the task
The tasks to be used in our experiment should have the following properties:
1. Suitable size. Obviously, the task should not be too small in order to provide enough and interesting data.
A trivial task would have too little generalizability. The task was planned to take about 4 or 5 hours for
a good programmer, so that most participants would be able to finish within one day, when they started
in the morning. (It later turned out that only 28% of the participants were able to finish on the day they
started and 46% took more than two days.)
2. Suitable difficulty. Most if not all of the participants should be able to complete the task successfully.
In particular, it must be possible to solve the task without inventing an algorithm or data structure that
requires high creativity.
Furthermore, the application domain of the task had to be well understandable by all subjects.
On the other hand, it must be possible to make subtle mistakes or ruin the efficiency so that there can be
sufficient differences in the work products among even the successful solutions.
3. Automatic testability. In order to test the quality of the solutions thoroughly and objectively it must be
possible to run a rather large number of tests without human intervention. In particular, the acceptance test
should be automatic and entirely objective.
2.5.2 Task description and consequences
From these requirements, we chose the following task:
Given a list of long “telephone numbers” and a “dictionary” (list of words), encode each of the
telephone numbers by one word or a sequence of multiple words in every possible way according
to a fixed, prescribed letter-to-digit mapping. A single digit may stand for itself in the encoding
between two words under certain circumstances.
Read the phone numbers and the dictionary from two files and print each resulting encoding to
standard output in an exactly prescribed format.
Please see the exact task description on page 66 for the details and for input/output examples.
The above functionality can be coded in about 150 statements with any programming language that has a reasonable string handling capability. Understanding the requirements exactly and producing an appropriate search
algorithm is not trivial, but certainly within the capabilities of the participants. Various details give enough room
for gross or subtle mistakes, e.g. handling special characters allowed in the phone numbers (slash, dash) or the
words (quote, dash), always producing the correct output format, or handling all cases of digit-insertion correctly. The algorithmic nature of the problem is simple to understand for all subjects, regardless of specific
Technical Report 1/1999, Lutz Prechelt, Barbara Unger, University of Karlsruhe
16
Chapter 2: Description of the experiment
backgrounds, and the search algorithm gives room for enormous differences in the resource consumption (both
space and time) of the resulting program. The batch job character of the program makes automatic testing
possible and the simple structure of the input data even allows for fully automatic generation of test cases once
a correct “gold” program has been implemented. This allowed for generating new data for each acceptance test
on the fly.
During the evaluation of the experiment it turned out that the differences among the solutions were even larger
than expected.
2.5.3 Task infrastructure provided to the subjects
Along with the task description, the following task-specific infrastructure was provided to the participants:
the miniature dictionary test.w and the small phone number input file test.t used in the example
presented in the task description,
a file test.out containing the correct output for these inputs,
a large dictionary called woerter2 containing 73220 words.
The same large dictionary was also used during the evaluation of all programs presented in this report; a fact
which we told the participants upon request.
2.5.4 The acceptance test
The acceptance test worked as follows: For each test, a new set of 500 phone numbers was created and the
corresponding correct output computed using the gold program. This took only a few seconds. The dictionary
used was a random but fixed subset of 20946 words from the woerter2 dictionary.
Then the candidate program was run with these inputs and the outputs were collected by an evaluation Perl
script. This script matches the outputs of the candidate program to the correct outputs and computes the
reliability of the candidate program. The evaluation script stops the candidate program if it is too slow: an
accumulative timeout of 30 seconds per output was applied plus a 5 minute bonus for loading the dictionary
at the start. This means that, for instance, the 40th output must be produced before 25 minutes of wall clock
time are over or else the program will be stopped and its reliability judged based on the outputs produced so
far. Many programs did indeed run for half an hour or more during the acceptance test; the number of expected
outputs varied from 25 to 248 with a typical range of 40 to 80.
At the end of the acceptance test the following data was printed by the evaluation script: The sorted actual
output of the candidate program, the sorted expected output (i.e., the output of the gold program), a list of
differences in the form of missing correct outputs and additional incorrect outputs, and the resulting reliability
in percent.
The exact notion of reliability will be defined in Section 3.4.3 under the name of output reliability. A minimum
output reliability of 95 percent was required for passing the acceptance test. However, in their final acceptance
test with only two exceptions all programs either achieved 100 percent or failed entirely.
A controlled experiment on the effects of PSP training
2.6 Internal validity
17
2.5.5 The “gold” program
Given this style of acceptance test, the before-mentioned “gold” program obviously plays a rather important
role in this experiment. The gold program was developed by Lutz Prechelt together with the development of the
requirements. The initial requirements turned out to be too simple, so the rules for allowing or forbidding digits
in the encoding were made up and added to the requirements during the implementation of the gold program.
The gold program, called phonewrd, was developed using a psp in July 1996 during three sessions of about six
hours total. The total time splits into 126 minutes of design (including global design, pseudocode development,
and test development), 93 minutes of design review, 72 minutes of coding, 38 minutes of code review, and
19 minutes compilation. The program ran correctly upon the first attempt and no defect was ever found after
completion — this despite numerous claims of participants that “the acceptance test program is wrong. My
program works correctly!”. 19 defects were originally introduced in the design and pseudocode, 11 of which
were found during design review, 6 defects were introduced during coding and found during code review or
compilation. These values include trivial mistakes such as syntactical errors. The program was written in C
with refinements3 , which turned out to be a superbly suitable basis for this problem.
The initial program was only modestly efficient (trying 10% of the dictionary for each digit). A few days
later it was improved into a more efficient version (called phonewrd2 trying only 0.01% of the dictionary
for each digit) which also worked correctly right from the start. This second version was used throughout the
experiment.
2.6 Internal validity
There are two sources of threats to the interval validity of an experiment4 : Insufficient control of relevant
variables or inaccurate data gathering or processing.
2.6.1 Control
Controlling the independent variable means holding all other influential variables constant and varying only
the one under scrutiny. Controlling the dozens of possibly relevant variables in human-related experiments is
usually done by random sampling of participants into the experiment groups and subsequent averaging over
these groups: variation in any other than the controlled variable is then expected to cancel out.
However, random sampling is difficult for software engineering experiments, because they require such a high
level of knowledge. The problem becomes particularly pronounced if the independent variable is a specific
difference in education: neither can we randomly sample from a large group of possible participants, nor can
we freely assign each of them into a group chosen at random. Instead, we are confined to a small number
of available subjects with the proper background and, worse yet, typically each of them fits into only one of
the groups, because we cannot impose a certain education on the subjects and withhold the other; they choose
themselves what they want to learn.
In the given experiment this means that the preferences that let the subjects choose one course and not the other
could in principle be related to the results observed. We cannot prove that there is no such effect, but based on
our personal knowledge of the individuals in both courses, we submit that we cannot see any severe difference
in their average capabilities. In fact, because they liked us as teachers, several participants from the PSP course
later also took the other course and vice versa.
3
http://wwwipd.ira.uka.de/˜prechelt/sw/#crefine
Definition from [1]: “Internal validity refers to the extent to which we can accurately state that the independent variable produced
the observed effect.”
4
Technical Report 1/1999, Lutz Prechelt, Barbara Unger, University of Karlsruhe
18
Chapter 2: Description of the experiment
2.6.2 Accuracy of data gathering and processing
Inaccurate data gathering or processing is unlikely as there was very little manual work involved in this respect.
Instead, most data gathering and almost all data processing was automated. The scripts and programs were
carefully developed and tested and their results again scrutinized. Many consistency checks were applied for
detecting possible mistakes. One remaining problem is missing data, which is almost inevitable in any large
scale experiment. The consequences of missing data, if any, will be discussed for each dependent variable in
the results sections below.
2.7 External validity
Three major factors limit the generalizability (external validity) of this experiment: Longer experience as a
software engineer, longer experience with psp use, and other kinds of work conditions or tasks. (You may
perhaps want to skip the rest of this section until you have seen the results and their discussion.)
2.7.1 Experience as a software engineer
One of the most frequent critiques applied to controlled experiments performed with student subjects is that the
results cannot be transferred to software professionals.
The validity of this critique depends on the actual task performed in the experiment: if the task requires very
specific knowledge such as the capability to properly use complex tools or esoteric notations or uncommon processes, then the critique is probably valid. If, on the other hand, only very general software production abilities
are required in the task, a graduate student group performs not much different from a group of professionals: It
is known that experience is a rather weak predictor of performance within a group of professionals and in fact
soon these same students will be professionals themselves.
The task in this experiment is very general, requiring only knowledge that is taught during undergraduate
university education. Hence, we can expect to find relatively little difference in the behavior of our student
subjects compared to professionals. We can imagine only two differences that might be relevant.
First, professionals taking a PSP course after some professional experience may often be much more motivated towards actually using PSP techniques, because due to previous negative experiences they have a much
clearer conception of the possible benefits than students. Student background often involves only relatively
small projects with comparatively little schedule pressure and little need for relying on the work of colleagues.
This motivation difference, if present, should pronounce the differences between PSP and non-PSP groups for
professionals.
Second, some of the less gifted students may later pick a non-programming job, resulting in a sort of clean-up
(smaller variance of performance in the lower part) in a group of professionals compared to a group of students.
The possible consequences of this effect, if it exists, on the difference between PSP and non-PSP groups are
unclear.
2.7.2 Experience with psp use
The present experiment investigates performance and behavior differences shortly after a PSP course. One
should be rather careful when generalizing these results to persons that have been using a psp for some longer
time, say, two years.
A controlled experiment on the effects of PSP training
2.7 External validity
19
It is plausible that in those cases where differences between the PSP group and the non-PSP group were found,
these differences will become more pronounced over time. However, for a PSP-adverse work environment it
is also conceivable that differences wear off over time (because PSP techniques are no longer used) and it is
unclear whether differences may emerge over time where none have been observed in the experiment.
It would definitely be important to run a similar experiment much longer after a PSP (or other) training.
2.7.3 Kinds of work conditions or tasks
As mentioned above, it is plausible that differences due to PSP training may be reduced by a working environment that discourages the sort of data gathering implied by PSP techniques; the level of actual PSP use may just
drop. Inversely, the effects might also become more pronounced for instance if the tasks are very difficult to
get right, if the work conditions demand accurate communication of technical decisions, or if accurate planning
can reduce the stress due to schedule pressure.
Furthermore, some of the PSP participants may have taken the experiment task too lightly and may have underused their psp in comparison to their standard professional working behavior. For instance, some of them did
not bring their PSP tools or PSP estimation data.
All of this is unknown, however, so adequate care must be exercised when applying the results of this experiment to such different situations.
Technical Report 1/1999, Lutz Prechelt, Barbara Unger, University of Karlsruhe
Chapter 3
Experiment results and discussion
I don’t know the key to success,
but the key to failure is to please everybody.
Bill Cosby
This chapter presents and interprets the results of the experiment. The first section explains the means of
statistical analysis and result presentation that we use and explains why they were chosen. Section 3.2 describes
how, exactly, the groups to be compared in the analysis were derived from the raw groups of PSP and non-PSP
subjects. The following sections present the results (objective performance); the analysis is organized along the
hypotheses listed in Section 2.2. (Warning: The amount of detail in the diagrams and captions may overwhelm
you. The main text, however, is short and easy to read.) Two final sections describe findings from an analysis of
correlations between variables and findings from analyzing the answers of the subjects given in the postmortem
questionnaire.
3.1 Statistical methods
In this section we will describe the individual statistical techniques (including the graphical presentations) used
in this work for assessing the results. For each technique, we will describe its purpose, the meaning of its
results, and its caveats and limitations. The analyses and plots were made with S-Plus 3.4 on Solaris and we
will shortly indicate the names of the relevant S-Plus functions in the description as well.
3.1.1 One-dimensional statistics
For most of this report, we will simply compare the values of a single measurement for all of the participants
in the PSP group against the participants in the non-PSP group. The simplest form of such a comparison is
comparing the arithmetic mean of the values in one group against the mean of the values in the other. However,
the mean can be very misleading if the data contains a few values that are very different from the rest. A more
robust basis for a comparison is hence the median, that is, the value chosen such that half of the values in the
group are smaller or equal and the other half are greater or equal. In contrast to the mean, the median is not
influenced by how far away from the rest the most extreme values are located.
Possibly we are not only interested in the average behavior of the groups, but also in the variation (variability,
variance) within each group. In our context, smaller variation is usually preferable, because it means more
predictable software development. One way of assessing variation is the standard deviation. If the underlying
20
3.1 Statistical methods
21
data follows a normal distribution (the Gaussian bell curve), about two thirds of the data (68%) will lie within
plus or minus one standard deviation from the mean. However, if the data does not follow a normal distribution,
the standard deviation is plagued by the same problem as the mean: a few far-away values will influence the
result heavily — the resulting standard deviations can be very misleading. Software engineering data often has
such values and hence the standard deviation is not a reliable measure of variation. Instead, we will often use
the interquartile range and similar measures we will explain now in a graphical context.
A flexible and robust way of comparing two groups of values for both average (statisticians speak of “location”)
and variation (called “spread”) is the boxplot, more fully called box-and-whisker plot (S-Plus: bwplot()).
You can find an example in Figure 3.4 on page 28. The data for the PSP group is shown in the upper part,
the data for the non-PSP group in the lower part. The small circles indicate the individual values, one per
participant. Only their horizontal location is important, the vertical “jittering” was added artificially to allow
for discriminating values that happen to be at the same horizontal position. The width and location of the
rectangle (the “box”) and the T-shaped lines on its left and right (the “whiskers”) are determined from the data
values as follows. The left edge of the box is located such that 25% of the data values are less than or equal
to its position, the right edge is chosen such that 75% of the data values are less than or equal to its position
(which means that 25% are greater or equal that value). The position of the left edge is called the 25-percentile
or 25% quantile or first quartile, the right edge is correspondingly called the 75% quantile or third quartile.
Similarly, the left and right whiskers indicate values such that exactly 10% of the values are smaller or equal
(left whisker, 10-percentile) or 10% are larger or equal (right whisker, 90-percentile), respectively. The fat dot
within the box marks the median, which could also be called 50-percentile, 50% quantile, or second quartile.
Note that different percentiles can be the same if there are several identical data values (called ties), so that, for
instance, whiskers can be missing because they are identical with the edge of the box or the median dot can lie
on an edge of the box etc.
Boxplots allow for easy comparison of both spread and location of several groups of data. One can concentrate
either on the width of the boxes (called the inter-quartile range or iqr) or the width of the whole boxplots
for comparing spread or can concentrate on particular box edges or the median dots for comparing different
aspects of location (namely the location of the lower half, upper half, or middle half of the data points).
Note that for distributions that have only few distinct values (typically all small integers) and therefore contain
many ties, differences in the width of the box or the location of any of the quartiles can be misleading because
it may change a lot if only a single data value changes. Figure 3.26 on page 41 shows a simple example. The
two distributions are similar, but the boxplots look quite different. A similar caveat applies when the number
of data values plotted is small, e.g. less than ten.
Our boxplots have one additional feature: the letter M in the plot indicates the location of the mean and the
dashed line around it indicates a range of plus or minus one standard error of the mean. The latter quantifies
the uncertainty with which the mean is estimated from the data and decreases with decreasing standard deviation
of the data and with an increasing number of data points. For about 68% of all data samples of the given size
taken from the same population, the sample mean will lie within this standard error band. For symmetric
distributions, the mean is equal to the median. However, in our data, many distributions are skewed, i.e., the
data is less dense on one end of the distribution than on the other. In this case, the mean will lie closer to the
less dense end than the median.
A string such as “2 missing” (e.g. on the left edge of Figure 3.5) indicates that two of the data points in the
sample had missing values and hence are not shown in the plot.
When comparing two distributions, say in a boxplot, it is often unclear whether observed differences in location
should be considered accidental or real. This question can be assessed by a statistical hypothesis test. A
test computes the probability that the observed differences of, say, the mean will occur when the underlying
distributions in fact have the same mean.
Technical Report 1/1999, Lutz Prechelt, Barbara Unger, University of Karlsruhe
22
Chapter 3: Experiment results and discussion
The classical statistical test for comparing the means of two samples is the t-test (S-Plus: t.test()). Unfortunately, this test assumes that the two samples each come from a normal distribution and that these distributions have the same standard deviation. Most of our data, however, has a distribution that is non-normal
in many ways. In particular, our distributions are often unsymmetric. For such data, the t-test may produce
misleading results and should thus not be used. Sometimes asymmetric data can be transformed into normally
distributed data by taking e.g. the logarithm and the t-test will then produce valid results, but this still requires
postulation of a certain distribution underlying the data, which is often not warranted for software engineering
data, because too little is known about their composition.
We will use two replacements for the t-test that do not require assumptions about the actual distribution of the
data: Bootstrap resampling of means differences and the Wilcoxon test.
The Wilcoxon rank sum test also known as the Mann-Whitney U test (S-Plus: wilcox.test()) compares
the medians of two samples and computes the probability that the medians of the underlying distributions are in
fact equal. This probability is called the p-value and is usually presented as the test result as a number between
0 and 1. When the p-value is sufficiently small, i.e., the difference is probably not due to chance alone, we call
the difference significant. We will call a difference significant if p < 0:1. The Wilcoxon test does not make any
assumptions about the distribution of the data, except that the data must come from a continuous distribution
so that there are never any ties. Fortunately, there is a modified test (the Wilcoxon test with Lehmann normal
approximation) that can cope with ties as well and we will use this extension where necessary.
Note that a Wilcoxon test may find the estimated median of a sample a to be larger than that of a sample b even
if the actual sample median of a is smaller! Here is an example:
a:
b:
1 2 3 4 5 6 7 8 9 10
10 11 12 13 14 15 16 17 18 19
21
20
22 23 24 25 26 27 28 29 30 31
30 31 32 33 34 35 36 37 38 39
Obviously b should be considered larger than a and in fact the Wilcoxon test will find exactly this with p =
0:0076. (We will write a test result such as this in the following form: Wilcoxon test p = 0:0076.) However,
the actual sample median happens to be 21 for a but only 20 for b; for other samples this difference could even
be arbitrarily large.
If we have no preconception about which sample must have the larger median, if any, we use a so-called twosided test. If, as in the example above, we want to assume that, say, either b is larger or both are equal, but
want to neglect (in advance) the possibility that a could be larger than b, we will use a one-sided test, which is
in fact the exactly same thing, except that the p-value is halved. In this experiment, we will most usually use
one-sided tests, because our testing is hypothesis-driven.
Our second robust replacement for the t-test is Bootstrap resampling of differences of means. The idea of
bootstrapping is quite simple: simulation. The only assumption required is that the samples seen are representative for the underlying distribution with respect to the statistic that is being tested — this assumption is
of course implicit in all statistical tests. With a computer we can now generate lots of further samples that
correspond to the two given ones, by sampling with replacement (S-Plus: sample(x,replace=T)). This
process is called resampling. For instance resampling of the above sample a might yield
a:
1
3
3
4
4
6
7
8
9 10 21 22 23 24 25 25 25 27 27 30 30
Such a resample can (and usually will) have a different mean than the original one and by drawing hundreds or
thousands of such resamples ar from a and br from b we can compute the so-called bootstrap distribution of all
the differences “mean of ar minus mean of br ”. Now we can compute what fraction of these differences is, say,
greater than zero. Let’s assume we have computed 1000 resamples of both a and b and found that only 4 of the
differences were greater than zero (which is a realistic value; see Figure 3.1 for an actual example). Then 4/1000
A controlled experiment on the effects of PSP training
3.1 Statistical methods
23
or 0.004 is the p-value for the hypothesis that the mean of the distribution underlying a is actually larger than
the mean of the distribution underlying b. From this bootstrap-based test, we can clearly reject the hypothesis.1
Instead of p-values, we can also read arbitrary confidence intervals from the bootstrap distribution. In the
example, 90% of all bootstrap differences are left of the value ;4:0, hence a left 90% confidence interval for
the size of the difference would be (4:0 1); in other words: the difference is 4 or larger with a probability of
0.9.
200
Figure 3.1: Boostrap distribution
of the difference of means of the
two samples a and b as given in
the text.
Count
150
100
50
0
-20
-15
-10
-5
0
5
mean(a)-mean(b)
When we use Bootstrap, we will usually compute 1000 pairs of resamples and indicate the resulting p-value
with two digits. For the example above we would write: mean bootstrap p = 0:00. Occasionally we will use
10000 resamples and then indicate three digits (mean bootstrap p = 0:004)
Sometimes we would like to compare not only means and medians, but also the variability (spread) of two
samples. The conventional method of doing this is the F-test, which compares the standard deviations. It is
related to the t-test and, like the latter, assumes the two samples to come from a normal distribution. Unlike the
t-test, which is quite robust against deviations from normality and whose results deteriorate only slowly with
increasingly pathological inputs, the F-test is very sensitive to data with deviations from normality. Therefore,
the F-test is completely unsuitable for our purposes. Instead we resort to resampling again and compare a
robust measure of spread, namely the inter-quartile range mentioned above. We generate pairs of resamples
and compute the differences of their interquartile ranges. This way, we compute a Bootstrap resampling of
differences of inter-quartile ranges in order to arrive at a test for inequality of variability. Note that the interquartile range nicely corresponds to the box width in our boxplots. Much like for the test on means, we write
the result of a bootstrap-based test on inter-quartile ranges (iqr) like this: iqr bootstrap p = 0:15.
3.1.2 Two-dimensional statistics
Sometimes we do not only want to compare groups of individual measurements, but rather want to understand
whether or how one variable depends on another. Such a question is usually best be assessed by the familiar
x-y-coordinate plot (S-Plus: xyplot()); see Figure 3.38 on page 46. Here, each participant is represented
by one point and we possibly make two (or more) such plots side by side for comparing different groups of
participants.
We will often plot some sort of trend line together with the raw data in an x-y plot in order to show the
relationship between the two variables plotted. If the assumed relationship is linear, the trend line is a straight
line. The most common form of this is the least squares linear regression line (S-Plus: lsfit()), a line
chosen such that the sum of the squares of the so-called residual errors (residuals) is minimized. The residual
error of a point is the vertical distance from the point to the regression line. While the least squares regression
1
By the way, the t-test would produce the more pessimistic p-value 0.010 when comparing a and b — an error that is not relevant
in this case of a very large difference, but that is important in other cases. If the deviations of the samples from normality are large, the
t-test becomes insensitive and can often not detect the differences (statisticians say it has small power).
Technical Report 1/1999, Lutz Prechelt, Barbara Unger, University of Karlsruhe
24
Chapter 3: Experiment results and discussion
line is very simple to compute, it is sensitive to individual far-away points, much like the mean in the onedimensional case. Therefore, we will sometimes add two other sorts of trend lines in order to avoid spurious or
misleading results. These other trend lines are less common and more difficult to compute, but are often more
appropriate because they are more robust against a small number of far-away points. The first of these robust
regression lines is the least distance regression line or L1 regression line (S-Plus: l1fit()). It minimizes
the absolute values of the residual errors instead of their squares, which reduces the influence of the largest
residuals. The L1 regression is computed by linear programming. The second kind of robust regression is the
trimmed least squares regression (S-Plus: ltsfit()), which computes a standard least squares regression
after leaving out those, say, 10% of all points that would result in the largest residual errors. The operation of
leaving out a certain fraction of extreme points is called trimming. Since the identity of these worst points
depends on the regression line chosen and the regression line depends on which points are left out, the trimmed
regression is extremely difficult to compute — it is implemented by a genetic algorithm. The fraction that is
trimmed away is indicated in each case (usually either 10%, 20% or 50%).
If a cloud of points lies close to a line, we can trust the standard least squares regression line. If however there
are severe outliers or the correlation is generally not very high, we may assess the presence or absence of a
certain trend by looking at all three kinds of trend lines at once: if they all point in the same general direction,
the trend is most probably real, but if one of them points up while the other two point down, we should not trust
any conclusions about a trend — except possibly the conclusion that there is no trend at all.
In addition to a trend line, we may also compute (and plot, with dotted lines) a confidence interval for either
the trend line itself or for the data from the underlying distribution; the latter is called a prediction interval. A
confidence interval indicates the range within which the regression line will be with a given probability if one
encounters many different samples of the same size from the same distribution. We typically plot confidence
intervals that cover 90% probability. The prediction interval not only includes the regression line but also the
individual data points, that is, it describes the variability of the individual points instead of the variability of
their mean.
If we do not want to impose a linear relationship onto the two variables plotted, we may use a loess local
regression line instead (S-Plus: loess()). Here, for computing the y-coordinate of each x-coordinate of the
trend line, only a fraction of the data points will be used, and these will be weighted such that the weight of
each point diminishes rapidly with increasing distance of its x-coordinate. The parameter that controls which
fraction of the overall x-axis range will be used at each point of the loess line is called the span. The larger the
span, the more similar the loess line will be to a straight line, small spans allow for much local variation of the
loess line. Dotted lines may be present that indicate a 90% confidence interval around the loess line (S-Plus:
predict(se.fit=T)).
A final way of robustly assessing relationships between two variables is plotting the rank of the values instead
of the values themselves. In this case, the smallest value of the whole set is mapped to 1, the second smallest
to 2, and so on. In the presence of ties, the average rank is assigned to the whole set of tied values. Ranking
ignores the size of differences between values and represents only, how many other values are in between. This
is sometimes useful for variables with very odd distributions, such as reliabilities which tend to be either very
close to 100 percent or very close to zero, with gaping holes between these extremes: If there is a trend, we
might miss it because the values of interest are too close to each other in a plot that covers the whole range, but
after ranking we can clearly see all differences (and non-differences as well).
The correlation (more precisely: linear correlation) quantifies how well the relationship of the x and y values in
a set of pairs (xi yi ) can be described by a straight line. The correlation, usually called r , is a value between ;1
and 1. A value of 1 indicates the pairs all lie on a straight line (when plotted in a standard cartesian coordinate
system). ;1 means just the same, except that the line has negative slope (i.e. is going down to the right) instead
of positive slope. For smaller values of r , there are increasing amounts of fluctuation of the points (xi yi )
around the best possible line through the point cloud (the regression line). If r = 0, the points do not show any
trend at all, just as if they were statistically independent. The notion of correlation can be extended to higher
A controlled experiment on the effects of PSP training
3.2 Group formation
25
dimensions (e.g. describing a set of triples by a plane). The square of the correlation, r2 , describes the amount
of variance explained by the regression line. For instance, if r2 = 0:8, the overall variance of the yi is five
times as large as the variance of the yi against the regression line, or, put differently, 80 percent of the overall
variance, are “explained” by the model of the data that is described by the regression line.
3.1.3 Presentation of results
Since there are quite many comparisons of 1-dimensional statistics, we use a somewhat standardized format
for presenting them. Generally, for each comparison three tests will be computed: A Wilcoxon rank sum test,
a bootstrap test for differences of means, and a bootstrap test for differences of interquartile range.
If the results indicate that the differences are not significant, they will often not be mentioned at all. Typically,
results will be presented quantitatively in the caption of the figure containing the corresponding plot. The test
and its p-value will be indicated in the notation mentioned above and will be preceded by the name of the group
that exhibited the better performance. Example:
The P group has more commented lines (Praw : mean bootstrap p = 0:05, Wilcoxon test p = 0:05).
The main text will usually mention the results only qualitatively. We use the following expressions in that case:
If the p-value is below 0.1, we make a simple statement that there is a difference (e.g. “a is smaller than b”). If
the p-value is between 0.1 and about 0.2, we call it a tendency (“a tends to be smaller than b”). For still larger
p-values, we either consider the difference very small (“a is slightly smaller than b”, “a is hardly smaller than
b”) or not present (“a and b are not different”). When the difference is very substantial, it will be called large
(“a is much smaller than b”) and sometimes such statements are backed up by some form of confidence interval
for the difference. Where it appears to be warranted, all of these language rules are followed only loosely.
Generally, the main text contains the main results and their interpretation, while the captions contain the details.
Some numeric results are printed in a large table in Section 3.14 on page 55.
3.2 Group formation
Before we start with the actual data analysis we need to overcome some complications which limit the comparability of the groups:
1. Some participants did not finish the task; they gave up either before they even tried an acceptance test
(s028, s031, s045, s069) or after one to three failed attempts (s012, s014, s024, s039). It is entirely
unclear how one can sensibly compare the performance on finished tasks with that on unfinished ones.
2. The 8 non-PSP volunteer subjects are rather atypical. On the one hand, all 3 failed participants in the nonPSP group are from the volunteer subgroup. On the other hand, three of the most capable participants of
the experiment are among the other 5 volunteers, all of which are quite extreme in some aspect of their
behavior as is shown in Figure 3.2. We see that of the five successful volunteers two completed extremely
fast, one extremely slow, and the other two are also faster than average although they compile extremely
frequently.
Furthermore, it is conceivable that some of the subjects in the PSP group do not, in fact, use the PSP. We would
like to compare subpopulations such that the possible group differences are not blurred by such participants.
We use the following approach:
Technical Report 1/1999, Lutz Prechelt, Barbara Unger, University of Karlsruhe
Chapter 3: Experiment results and discussion
other
20
25
volunteer
5
10
15
Figure 3.2: Total time for completing the task versus number
of compilations per working hour
for the non-PSP volunteers and
all other participants. 22 subjects are missing in the plot, because their number of compilations is not available.
0
compilation frequency [1/hour]
30
26
10
20
30
40
50
total working hours
From each group remove that half of the participants which had the least performance and compare
only the rest.
Such an operation is called trimming. This way we can expect “cleaned up” group properties and can also
remove the subjects that gave up. Fortunately, the fraction of participants who gave up are the same in the N
group (3 out of 19, 16 percent) and the P group (5 out of 29, 17 percent), so that we do not introduce any bias
by this operation. Hence, we define our final analysis groups by removing another 6 participants from the N
group (ending up with 10 out of 19, or 53%) and another 9 from the P group (ending up with 15 out of 29, or
52%).
The next question is which measure to take for determining performance. We avoid this problem by using
several measures and defining one pair of groups by each. During the analysis we will compare all of these
group pairs.
Raw groups: We define the group Nraw as all of the non-PSP participants minus those that gave up during
the experiment (s014, s028, s031). We define the group Praw as all of the PSP participants (except the two
volunteers s081 and s102 that had previously participated as non-PSP subjects) minus all those that gave up
during the experiment (s012, s024, s039, s045, s069).
Groups trimmed by product reliability: The single most important performance measure as indicated in
the task description given to the participants is the reliability of the delivered program. We choose the output
reliability on the z3 test here (see Section 3.4.2), because it is the most “tame” of all reliability measures and we
expect it to be least likely to produce artifacts. We define the group Nrel as all members of group Nraw except
those 6 with the lowest z3 output reliability (in order: s034, s043, s020, s023, s053, s037). We define the group
Prel as all members of group Praw except those 9 with the lowest z3 output reliability (in order: s036, s075,
s030, s027, s090, s018, s060, s066, s093). Ties were broken based on z2 output reliability, then m3 output
reliability, and finally the output reliability of the first acceptance test.
Groups trimmed by time to completion: Another obvious performance measure is the number of work hours
required until passing the acceptance test. We define the group Nt as all members of group Nraw except those
6 with the longest work time in the experiment (in order: s053, s017, s065, s056, s068, s037). We define the
group Pt as all members of group Praw except those 9 with the longest work time in the experiment (in order:
s060, s027, s048, s087, s090, s021, s057, s018, s051).
Groups trimmed by quality discipline: Next we would like to sort out those participants that appear be less
disciplined with respect to achieving high software quality than the others. To judge quality discipline, first of
A controlled experiment on the effects of PSP training
3.3 Estimation
27
all we consider the number of acceptance tests required, because a high number suggests premature “product
release” and hence an undisciplined process. (Remember that program correctness was the highest-priority
requirement for the experiment.) This coarse measure produces many ties. But there are more aspects of
quality discipline. In particular, for the PSP group we are interested in whether actual PSP quality management
techniques are used and to what degree. Hence, the next criterion is the presence and completeness of a defect
log. There were only 10 defect logs made overall, all by members of the PSP group.2 We break the remaining
ties by the completeness of the handwritten time log that was required from all participants; both defect log and
time log were judged on a four-point ordinal scale (missing, dubious or highly incomplete, slightly inaccurate,
apparently accurate). Still remaining ties are broken by the output reliability achieved in the first acceptance
test. We define the group Ndpl as all members of group Nraw except those 6 with the lowest quality discipline
according to the above ordering (in order: s065, s068, s037, s056, s053, s050). We define the group Pdpl as
all members of group Praw except those 9 with the lowest quality discipline (in order: s051, s060, s090, s057,
s075, s087, s048, s054, s036).
Groups trimmed by product size: The cost of a program is determined not only by its production cost,
but also by its maintainability, which is a function of modularity, clarity of structure, design documentation,
program length, and other measures. The only one of these that can be objectively measured is program length.
Hence we use the number of statement lines of code as a substitute maintainability measure and regard longer
programs as harder to maintain. Note that this criterion may fail, because some participants may write overly
compact “clever” code that is hard to understand. However, we expect that the opposite case of long-winded
and fussy code is more frequent.
We define the group Nsz as all members of group Nraw except those 6 with the highest number of statement
lines in their resulting program (in order: s017, s065, s023, s037, s053, s034). We define the group Psz as all
members of group Praw except those 9 with the highest number of statement lines in their resulting program
(in order: s090, s057, s018, s027, s015, s072, s084, s054, s087).
Groups trimmed by product efficiency: Finally, one would also prefer more efficient programs over less
efficient ones — there were rather huge differences in this respect. We measure efficiency by number of
minutes of CPU time required for the test set z3 (see Section 3.4.4). We define the group Neff as all members
of group Nraw except those 6 with the highest CPU time requirement for test z3 (in order: s068, s062, s023,
s047, s059, s034). We define the group Peff as all members of group Praw except those 9 with the highest
CPU time requirement for test z3 (in order: s072, s093, s018, s060, s048, s054, s021, s063, s090).
Note that many of the participants are excluded by more than one of the criteria. In particular, s090 is removed
all five times and s018, s037, s053, and s060 are excluded four times.
3.3 Estimation
Can the PSP participants really estimate the time required for a task more accurately? As we see in Figure 3.3, the answer is probably yes, although for the given task the difference is not overly consistent. The
mis-estimations tend to be less variable in the P group compared to the N group.
Looking at absolute instead of relative deviations from the estimated time and discriminating pessimistic from
optimistic estimates, we again find smaller variability in the P group (Figure 3.4), although the difference is
not significant. The worst underestimation is much more limited in the P group (about three times the median).
The difference in mean or median is not significant.
The estimation questionnaire asked not only for a point estimate of the expected work time but also for a 90%
confidence interval. Figure 3.5 compares the actual misestimation to the size of this estimation interval. If the
2
We think that about three other PSP participants may have created a log but did not deliver it along with their work products.
Technical Report 1/1999, Lutz Prechelt, Barbara Unger, University of Karlsruhe
28
Chapter 3: Experiment results and discussion
M
6 Psz
oooo
oooooo
6 Nsz
M
oo oo o oo
5 Peff
M
o o oo o
ooooo
o
5 Neff
o oo
ooo
M
oooooo ooooo o
4 Ndpl
o o
oo o o oo
3 Pt
M
o oo oo oo o
o oo
3 Nt
M
o o
oo oo o o
o
o
o
ooo
o
M
o
4 Pdpl
o
o
o
oo
o
o
o
oo
M
o
oo
2 Prel
M
o oo oo
ooo ooo
2 Nrel
o oo o oo
1 Praw
oooooo
o oooooo oo
1 Nraw
oo oo
o o o oo
o
o
o
o
o
M
oo
o
o
work ; 1 : Time misFigure 3.3: ttestim
estimation for various pairs of
subgroups. The variability of
the estimation accuracy tends
to be smaller for the P groups
(Peff : iqr bootstrap p = 0:12)
and the error means and medians tend to be smaller as well
(Praw : mean bootstrap p = 0:18;
Peff : mean bootstrap p = 0:033,
Wilcoxon test p = 0:089).
M
oooo
o
0
o
o
o
o
M
ooo
o
o
o
200
400
600
800
work time estimation error [percent]
Praw
M
oo o ooo oooo oo oo oo o o
o
Nraw
o
-10
oo o o
o o oo o
0
M
oo
10
o o oo
o
20
o
o
o
30
40
50
Figure 3.4: twork ; testim : Actual
work time minus estimated work
time. The variability in the P
group tends to be smaller (Praw :
iqr bootstrap p = 0:28) and more
symmetric.
work time under-estimation [hours]
A controlled experiment on the effects of PSP training
3.3 Estimation
29
6 Psz
oo
oo o oo
6 Nsz
oo
1 missing
o
M
o oo o
o
o
2 missing
o
M
o
o
o
o
o
1 missing
M
5 Peff
oo o o o
o o oo
o
5 Neff
o
o
o
o
M
o
o
o
4 Pdpl
o
oo o o oo
4 Ndpl
oo
M
o oo
o
o
1 missing
oo o o
M
o
o
3 Pt
oo o oo o
3 Nt
oo
2 Prel
oo
2 Nrel
o
M
oo
o
o
o
o o o
o
4 missing
o
o
M
ooo oo o o
oo o
2 missing
o
M
o
o
1 Praw
M
oo o oo
ooo o o ooo ooo o o
1 Nraw
oo
o
o
o
0
o
o
o
o
o
3 missing
2 missing
o
M
o
4 missing
o
M
o
2 missing
o
o
o
o
500
3 missing
o
o
6 missing
o
1000
;testim
Figure 3.5: ttwork
extr ;testim : Fraction
of estimated time interval actually used, in percent. textr is
tlow for those few estimates that
were too high and thigh for estimates that were too low. Mean,
median, and variability all tend
to be smaller in the P group
(Praw : mean bootstrap p = 0:20,
Wilcoxon test p = 0:25, iqr bootstrap p = 0:16). All differences
are more pronounced in the reliable subgroups (Prel : mean
bootstrap p = 0:03, Wilcoxon
test p = 0:11, iqr bootstrap p =
0:06). The means difference is
also a bit more pronounced for
the disciplined subgroups (Pdpl :
mean bootstrap p = 0:15).
1500
fraction of estimation interval used [percent]
Praw
Nraw
o
o
o
o o o oo o o
M
o
o
o oo oo
2 missing
o
oo
M
o o
o
oo
o
o
o
50
100
o
o
150
6 missing
200
estimation interval size [percent of estimated time]
Technical Report 1/1999, Lutz Prechelt, Barbara Unger, University of Karlsruhe
t ;tlow : Relative
Figure 3.6: high
testim
size of estimation interval compared to estimated time. The
variability of the relative estimation interval size tends to be
smaller in the P group (Praw : iqr
bootstrap p = 0:19).
30
Chapter 3: Experiment results and discussion
intervals were good, about 90% of all points should be in the range 0 : : : 100 (remember that whisker-to-whisker
the boxplot covers 80% of the values). As we see, three quarters of all participants hit the wall quite badly with
their estimates. However, both the average and the variability is smaller for the P group. Interestingly, the
difference is most pronounced if one ignores the subjects with the less reliable programs, because the bad
estimators in Prel happen to produce the worse programs, too, while no such correlation exists in Nrel . Many
N participants did not even indicate an estimation interval (hence the 6 missing values), presumably because
they had no idea how large it might be. The variability in the size of the estimation interval itself (relative to
the estimated time) is also smaller in the P group (Figure 3.6).
Praw
M
ooo
o oo ooo o
oooo
Nraw
oo o
o
o o
oo o
oo
M
oo o
oo
oo
o
0
o
1000
2000
3000
o
1 missing
4000
5000
estimated/actual productivity [percent]
prodestim : Quotient of
Figure 3.7: prod
actual
estimated and actual productivity. Mean, median, and variability are all significantly smaller in
the P group (Praw : mean bootstrap p = 0:000, Wilcoxon test
p = 0:001, iqr bootstrap p =
0:009).
If one compares the estimated productivity (in lines of code per hour) to the actual productivity (using the
participants’ own definition of LOC), roughly half of the P group is outside the reasonable range of plus or
minus a factor of two around their estimation; see Figure 3.7. In contrast, half of the N group achieved less
than one tenth of their expected productivity. With a probability of 0.9, the difference of the means is greater
than 120 and the difference of the inter-quartile ranges greater than 96. These values clearly show that many N
participants have rather little knowledge about their own productivity.
Praw
oo oo o o
o
o o
M
oo
o o
oo o
o o
1 missing
o
o
M
Nraw
o
oo o o o
0
o
o
o
10
o
o
o o
20
30
40
estimated test time per test defect [minutes]
o
ttestestim . Estimated
Figure 3.8: N
errtest
time per defect found for the test
phase. One P subject is missing because his estimated number of defects was zero. Mean
and median are smaller in the
N group (Praw : mean bootstrap
p = 0:00, Wilcoxon test p =
0:00), the variability also tends
to be smaller (iqr bootstrap p =
0:11).
One possible source of these misestimations is shown in Figure 3.8: The N participants are far too optimistic
about how long it takes to find and remove each defect in the test phase. With a probability of over 80%, their
mean estimated time per defect (computed by dividing the estimated time for the test phase by the estimated
number of errors to be found in the test phase) is at most half as large as in the P group.
As one last test of the estimation capabilities we can compare what fraction of the estimated total(!) testing time
was performed after the first acceptance test; it turns out that many participants spent more than the planned
testing time even after the program was supposed to be fully tested and correct. The results in Figure 3.9 show
less of this effect for the P group and also once again slightly smaller variability (Praw : iqr bootstrap p = 0:30).
In summary, the P group is clearly better at estimating the total working time, the productivity, or the time for
fixing defects.
A controlled experiment on the effects of PSP training
3.4 Reliability and robustness
Praw
Nraw
31
M
o
oo
o oo oooo
oo o
o
o o
o
o
o
1 missing
o
M
o
o
oo
0
ooo
o
200
o
o
400
600
o
o
o
800
1000
conservative actual/planned testing time [percent]
t ;taccept1 : RemainFigure 3.9: work
ttestestim
ing work time after first acceptance test divided by estimated
time for testing phase. This is
a rather conservative comparison of actual and expected testing time — much testing time is
spent before the first acceptance
test. There is one value for N
at 2323%. The mean tends to
be smaller in the P group (Praw :
mean bootstrap p = 0:19).
3.4 Reliability and robustness
In this section we will assess the reliability of the delivered programs. Producing a defect-free and hence reliable
program was stated as the most important goal in the experiment instructions. Therefore, this comparison
should be considered the most important of all.
We will assess the reliability by functional testing, Section 3.4.1 explains why. In Sections 3.4.2 and 3.4.3
we describe the inputs used for the tests and the reliability measures we apply for describing the test results.
Sections 3.4.4 and 3.4.5 then present the results of the reliability and robustness tests, respectively. Finally,
Section 3.4.6 will assess whether the programming language used by the participants had an influence on
reliability. The final subsection summarizes the results.
3.4.1 Black box analysis and white box analysis
The single most important goal set forth for the experiment participants was “write a correct program”. There
are basically two ways for judging the correctness of a program. The white box method analyzes the program
text and determines whether it contains any defects that might produce incorrect behavior. The black box
method executes the program on a variety of inputs and records how often the program actually fails.
Both methods have their disadvantages: While the white box method might in principle prove if a program is
entirely correct, it cannot directly assess the program’s reliability in the presence of defects; the practical consequences of any single defect remain unclear and one can hardly compare one defect to another. Furthermore,
finding all defects is very difficult. The black box method, on the other hand, requires that a concrete set of
inputs be chosen. Unless the input space can be executed exhaustively (which is usually infeasible even for
the most trivial programs), the black box method can never prove that a program is entirely correct, but always
returns a quantitative estimate of the program’s reliability.
With respect to effort, the white box method is enormously costly. In contrast, in our case the black box method
is quite cheap, because we can generate inputs randomly and can generate the corresponding correct outputs
using the gold program (see Section 2.5.5).
Another advantage of the black box method is that the relevance of the results is clearer, because the relation
of the chosen input distribution to the intended application domain is more easily understood. We will thus
exclusively use black box analysis to determine the correctness and reliability of the participant’s programs.
Technical Report 1/1999, Lutz Prechelt, Barbara Unger, University of Karlsruhe
32
Chapter 3: Experiment results and discussion
3.4.2 The test inputs: a, m, and z
The inputs used both for the acceptance tests during the experiment and also during the experiment evaluation
were always generated at random according to a specific probability distribution.
For our purposes, the inputs (called “telephone numbers”) are arbitrary strings of no more than 50 characters
length, where the characters are chosen from the set
C = f = ; 1 2 3 4 5 6 7 8 9 0g
For each acceptance test, a new set of 500 inputs was generated according to probability distribution A described
below.
For reliability determination during experiment evaluation we used four sets of inputs m2, m3, m4, and m5
containing 102 , 103 , 104 , and 105 examples, respectively, all generated according to probability distribution M
described below. We furthermore used four sets of inputs z2, z3, z4, and z5 containing 102 , 103 , 104 , and 105
examples, respectively, all generated according to probability distribution Z described below.
Distribution M: The length of the strings is uniformly distributed over the integer range 1 : : : 50. Each character from C is randomly chosen with probability 1/12 at each position of each string independently.
Distribution Z: Like M, except that strings without a digit (i.e., consisting only of dashes and slashes) do never
occur. To obtain a sample of values from Z, take a sample from M and remove all elements not containing a
digit.
Distribution A: Like Z, except that the distribution of input lengths is no longer uniform. Instead, it is as
p
follows. Consider the random variable z = b u + v c, where u is uniformly random on 0 : : : 36 and v is
uniformly random on 0 : : : 44:5. u and v are both continuous and independent. Lengths are distributed like z
restricted to the range 1 : : : 50.
This means that M is a straightforward general distribution of legal input values, Z avoids the possibly difficult
case of “telephone numbers” that would produce an empty output encoding, and A additionally makes very
short and very long numbers less frequent than those with moderate length.
3.4.3 Reliability measures
Given a mistake of the program, how is the reliability actually computed? We will use two different compound
measures of reliability, called input reliability and output reliability.
Given a particular input, a program may produce no, one, or several outputs. There are several possibilities for
the correctness of the actual program outputs Oact for a set I of inputs. Consider the following statements:
1. Each output o explicitly refers to exactly one input i(o).
2. Each input i must produce a certain set of expected outputs O (i).
3. We call the set of expected outputs Ocorr :=
S
i2I O(i).
4. We call the expected number jOcorr j of outputs E .
5. An actual output x 2 Oact produced by the program may be a correct output. We call the set of correct
outputs c := Oact \ Ocorr .
6. An actual output may be spurious, i.e., wrong. We call the set of wrong outputs w := Oact n Ocorr .
A controlled experiment on the effects of PSP training
3.4 Reliability and robustness
33
7. An expected output may be missing. We call the set of missing outputs m := Ocorr n Oact .
8. Any single input may produce exactly the corresponding set of expected outputs or it may fail, i.e,
produce at least one missing or wrong output. Failures include outputs for “phantasized” inputs, i.e.,
outputs which refer to inputs not in I . We call the set of failed inputs f := fi : (9o 2 Oact : i =
i(o) ^ (o 2= O(i) _ i 2= I )g
No we define the following two reliability measures: the output reliability reloutI := E +jcjjwj and the input
reliability relinI := 1 ; f=jI j. We will usually express these values in percent.
3.4.4 Inputs with nonempty encodings
For the first round of tests, we consider inputs from distribution Z, that is the random but fixed test sets z2, z3,
z4, and z5. The digit stands for the size of the test set such that zi has 10i inputs, i.e., z2 has 100 inputs (102 ),
z3 has 1000 inputs (103 ) etc. Some of the programs of successful participants were very slow, hence only z2
and z3 were administered to all programs. Some extremely slow programs were not used with z4 (s034, s072,
s093) and these plus also some modestly slow programs were not used with z5 (s020, s021, s023, s027, s034,
s036, s047, s048, s054, s059, s060, s062, s063, s065, s068, s072, s090, s093).
6 Psz
M
o
oo
oo
o
o
M
o
ooo
6 Nsz
5 Peff
M
o
o
o
o
o oo
o
M
ooo
oo
o
5 Neff
4 Pdpl
M
o
o
oooo
o
o
M
4 Ndpl
o
3 Pt
o
3 Nt
o
o
oo
oo
o
M
oo
o
o
o
M
o
ooo
2 Prel
M
o
o
o
2 Nrel
M
o
o
M
1 Praw
o
o
1 Nraw
o
o
ooooo
o
o
o
Figure 3.10: reloutz 3 : Output
reliability on the z3 test set.
The mean results are better
in the N group for some subgroups (Nsz : mean bootstrap
p = 0:01; Neff : mean bootstrap p = 0:02), but the differences are not significant for the
median (Nsz : Wilcoxon test p =
0:20; Neff : Wilcoxon test p =
0:28). Other differences in median, mean, or iqr are not significant (Nraw : mean bootstrap
p = 0:33, Wilcoxon test p = 0:32,
iqr bootstrap p = 0:22).
M
0
oooo
o
20
40
60
80
100
output reliability on z3 [percent]
As we see in Figure 3.10, the reliability is generally quite high: the median reliability in all subgroups is 100%
and even the 0.25 quantile is usually higher than 99%. Apparently the N group has a slight advantage, which
we can see more closely in the magnification shown in Figure 3.11: N is better for the Nsz and Neff subgroups,
which only means that the largest and least efficient P programs were all among the very reliable ones. Overall,
Technical Report 1/1999, Lutz Prechelt, Barbara Unger, University of Karlsruhe
34
Chapter 3: Experiment results and discussion
the difference is insignificant. As we see, there are a few programs that achieve below 100 percent reliability
but are quite close to it. The median reliability is 100 percent in all subgroups, so a majority of the programs is
perfect in this test.
6 Psz
o
o
6 Nsz
o
o
o
o
o
o
o
o
o
o
M
5 Peff
o
M
5 Neff
o
o
o
o
o
o
M
4 Pdpl
o
4 Ndpl
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
3 Pt
3 Nt
o
2 Prel
M
o
o
o
2 Nrel
M
o
o
1 Praw
o
o
1 Nraw
98.0
98.5
o
o
o
o
99.0
Figure 3.11: 98% to 100% range
of output reliability on the z3 test
set.
o
o
o
o
o
o
99.5
100.0
output reliability on z3 [percent]
The exact reliability of all programs with a reliability of more than 0 but less than 100 depends a lot on the
specific set of inputs chosen. We can reduce this influence by arranging the programs equidistantly according
to the reliability ordering instead of by the reliability values themselves, that is, we look at the rank ordering.
The results are shown in Figure 3.12. We find that the advantage of the N group depends on only 10 programs
that fall between the minimum and the maximum reliability. The difference is again not significant.
M
Praw
o
Nraw
o
o
o
o
o
o
o
o
o
o
o
o
o
M
5
10
15
o
o
o
o
o
o
20
25
output reliability rank on z3
30
Figure 3.12: rank(reloutz 3 ): Rank
order of output reliability on the
z3 test set. The differences in
mean, median, or iqr are not significant (Nraw : mean bootstrap
p = 0:30, Wilcoxon test p = 0:31,
iqr bootstrap p = 0:21).
The reliability results are similar for the other test sets as we can see for z2 in Figure 3.13, for z4 in Figure 3.14,
and for z5 in Figure 3.15. The comparisons for the test sets with 10000 or 100000 examples are distorted by
the fact that due to time constraints not all programs participated in them, which may bias the results.
A controlled experiment on the effects of PSP training
3.4 Reliability and robustness
Praw
35
M
o
oo o
oo
o
o
M
o
o oo o
o
Nraw
0
20
40
60
80
100
120
output reliability on z2 [percent]
Praw
M
o
o oo o
o
o
oo
o
o
M
oo oooo
o
Nraw
0
20
40
60
80
2 missing
1 missing
100
120
output reliability on z4 [percent]
Praw
10 missing
oo
M
o
o
Nraw
o
o
o
o
0
20
40
60
80
output reliability on z5 [percent]
M
o
oo
8 missing
100
120
Figure 3.13: reloutz 2 : Output reliability on the z2 test set. The differences in mean, median, or iqr
are not significant (Nraw : mean
bootstrap p = 0:24, Wilcoxon
test p = 0:44, iqr bootstrap p =
0:40).
Figure 3.14: reloutz 4 : Output reliability on the z4 test set. The differences in median and iqr are
not significant (Nraw : Wilcoxon
test p = 0:49, iqr bootstrap p =
0:46), but the difference in mean
is (Nraw : mean bootstrap p =
0:08). Note that some programs
were not measured which may
bias the results.
Figure 3.15: reloutz 5 : Output reliability on the z5 test set. The differences in mean, median, or iqr
are not significant (Nraw : mean
bootstrap p = 0:18, Wilcoxon
test p = 0:25, iqr bootstrap p =
0:43).
Summing up, we find a slight tendency that the reliability in the N group is better, but the differences are not
significant.
3.4.5 Arbitrary inputs
We now proceed to the more difficult inputs from the M distribution. These contain requests that never occurred
in any of the acceptance tests: inputs without digits (but containing slashes and dashes), which should result
in an empty encoding. A programmer would not intuitively expect such inputs because the domain metaphor
calls the inputs “telephone numbers”.
As for the Z inputs, we have test sets m2, m3, m4, and m5, where mi has 10i inputs. Again, some of the
programs of successful participants were very slow, hence only m2 and m3 were administered to all programs.
One extremely slow program was not used with m4 (s072) and several were not used with m5 (s023, s060,
s065, s068, s072, s093).
Plotting the distributions for the various subgroups for the 1000-example test produces rather striking results:
The P groups are all much better than the N groups (Figure 3.16). However, this plot is visually quite misleading. It is dominated by the fact that there is a huge gap in the reliability values between one group with 10.16%
and the next larger value at 89.47%. Nevertheless, the difference in means is at least 25 and the difference
in iqr at least 58, both with a confidence of 0.8. For the more disciplined subgroups, these values are 28 for
Technical Report 1/1999, Lutz Prechelt, Barbara Unger, University of Karlsruhe
36
Chapter 3: Experiment results and discussion
means and 64 for iqr. It is quite satisfying to see that this result, where we find a clear advantage for the P
groups, is also one where the more disciplined PSP users are more successful than the P group average. This
is an indication that the PSP techniques indeed help considering program designs more carefully and result in
writing more robust programs.
6 Psz
M
o
o
o
o
4 Pdpl
o
3 Nt
o
oo
o
oo
oo
M
o
o
o
oo
oo
o
o
o
M
o
o o
o
o
o
o
oo
M
o
o
o
o
o
o
M
o
2 Nrel
o
o
1 Nraw
o o
M
o
2 Prel
1 Praw
oo
o
M
o
3 Pt
o
M
o
5 Neff
4 Ndpl
o o
oo
M
6 Nsz
5 Peff
o o
o
o
oo
o
o
M
oo o
M
o
o o
o
o
M
o
0
o
o
o
20
40
60
80
Figure 3.16: reloutm3 : Output reliability on the m3 test set. All
N subgroups are less reliable
than the corresponding P subgroups (Praw : mean bootstrap
p = 0:01, Wilcoxon test p = 0:00)
and the variability tends to be
larger in N as well (Praw : iqr
bootstrap p = 0:12). The difference is most pronounced in
the more disciplined subgroup
(Pdpl : mean bootstrap p = 0:00,
Wilcoxon test p = 0:00, iqr bootstrap p = 0:05).
o
oo o o
o
oo
oo
100
output reliability on m3 [percent]
A more appropriate and less misleading representation is again the plot that is based on ranks instead of raw
values, as is shown in Figure 3.17. We see that all P subgroups have a large advantage over the corresponding
N subgroups. The difference from Nraw to Praw is at least 7 ranks with a confidence of 0.9; for Ndpl to Pdpl it
is at least 10 ranks.
The other three test sets show a similar result; their rank ordering is shown in Figures 3.18, 3.19, and 3.20.
How come these huge differences between the inputs from distribution M versus those from Z? The reason
is that many programs crash when they encounter the first input that does not contain any digit, because the
processing of many programs assumes there is at least one word or digit in the output encoding. Most notably,
many of the Java programs exit with an “illegal array index” or a similar exception.
3.4.6 Influence of the programming language
Since there are relatively more Java users in the N group (see Figure 2.7), we should check whether using or
not using Java is the major factor for the above difference. Since the number of non-Java users is too small in
the N group, we simply compare only the Java users in both groups. The result for the m3 output reliability
ranks is shown in Figure 3.21.
A controlled experiment on the effects of PSP training
3.4 Reliability and robustness
37
M
6 Psz
o
o
o o
o
o
o
o
o
o
o
o
M
o
o
o
o
o
M
6 Nsz
o
5 Peff
o
o
o
o
o
o
o
M
5 Neff
o
o
o
4 Pdpl
o
o
o
M
o
o
M
4 Ndpl
o
o
o
o
o
o
o
o
o
o
o
o
M
3 Pt
o
o
o
o
3 Nt
o
M
o
o
o
2 Prel
o
o
o
o
o
o
o
o
o
o
o
o
o
o
M
o
o
M
2 Nrel
o
o
1 Praw
o
o
o
o
o
o
o
o
o
Figure 3.17:
rank(reloutm3 ):
Rank order of output reliability
on the m3 test set. All P groups
rank significantly higher than the
corresponding N groups (e.g.
Praw : mean bootstrap p = 0:001,
Wilcoxon test p = 0:003). The
difference is most pronounced
in the disciplined and reliable
subgroups (Pdpl : mean bootstrap p = 0:000, Wilcoxon test
p = 0:001).
M
o
o
o
o o
o
o
M
1 Nraw
o
o
o
o
0
o o
o
10
20
30
40
output reliability rank on m3
Praw
M
o
o
o
o
Nraw
o
o
o
o
o
o
10
M
o
o
o
o
o
o
o
o
o
o
o
20
30
Figure 3.18:
rank(reloutm2 ):
Rank order of output reliability
on the m2 test set.
40
output reliability rank on m2
Praw
Nraw
M
o
o
o
o o
o o
o
o
o
o
o
1 missing
o
o o
o
o
M
o
o
o
10
o
o
o
o
o
o
o
o
20
30
output reliability rank on m4
Technical Report 1/1999, Lutz Prechelt, Barbara Unger, University of Karlsruhe
o
40
o
Figure 3.19:
rank(reloutm4 ):
Rank order of output reliability
on the m4 test set.
38
Chapter 3: Experiment results and discussion
Praw
3 missing
M
o
o
o
o
o
o
o
o o
o
o o
3 missing
M
Nraw
o
o
o
o
o
10
o
o
o
o
o
20
Figure 3.20:
rank(reloutm5 ):
Rank order of output reliability
on the m5 test set.
30
output reliability rank on m5
6 Psz
6 Nsz
o
o
o
o
4 Pdpl
o
o
o
o
o
o
o
M
o
o
o
o
o
o
M
o
o
o
o
M
o
o
o
M
o
o
o
o
M
o
o
o
o
2 Prel
o
o
o
M
o
o
o
o
o
o
o
o
o
o
M
o
o
o
o
o
M
1 Praw
1 Nraw
o
M
3 Pt
2 Nrel
o
o
5 Neff
3 Nt
o
M
o
5 Peff
4 Ndpl
M
o
o
o
o
o
o
o
10
o
Figure 3.21: rank(reloutm3 )jJava :
Rank order of output reliability
on the m3 test set, where only
the ranks assigned to Java programs are considered. All P
groups rank significantly higher
than the corresponding N groups
(Praw : mean bootstrap p = 0:02,
Wilcoxon test p = 0:04). The
difference is most pronounced in
the disciplined and reliable subgroups (Pdpl : mean bootstrap
p = 0:01, Wilcoxon test p =
0:02).
M
o
20
o o
30
40
Java program output reliability rank on m3
A controlled experiment on the effects of PSP training
3.5 Release maturity
39
As we see, all PSP Java subgroups still have a large advantage over the corresponding non-PSP Java subgroups,
but the differences are not larger than for all programs. Specifically, the 0.9-confidence minimum rank difference from Nraw to Praw has decreased from at least 7 ranks to at least 4. For Ndpl to Pdpl it has decreased from
at least 10 ranks to at least 8. The values are directly comparable, because we have just left the rank numbers
of the non-Java programs out of the comparison instead of computing the ranking for the Java programs only.
However, most of the decrease comes from the growth of the confidence intervals that is due to the smaller
number of data points in this comparison. Nevertheless, these results indicate that the validity of the experiment is not endangered by the fact that more Java programs crash and the fraction of Java users is higher in the
N groups. Our conclusion still holds that the P groups produced the more robust programs.
For the more “tame” z4 data, there is still a slight advantage for the P group, but the difference is not significant
(raw reliability is shown in Figure 3.22, ranks thereof in Figure 3.23). A still smaller difference is found for the
z3 data.
Praw
Nraw
2M missing
o
o
o
o
M
o
o
o
o
94
95
96
97
98
oo
o
99
100
Figure 3.22: reloutz 4 jJava : Output reliability on the z4 test
set for the Java programs only.
The P group tends to be better
(Praw : mean bootstrap p = 0:29,
Wilcoxon test p = 0:17).
Java program output reliability on z4
Praw
Nraw
2M missing
o
o
o
o
M
o
o
o
10
o
15
o
o
o
o
20
25
Java program output reliability rank on z4
30
Figure 3.23: rank(reloutz 4 )jJava :
Rank order of output reliability
on the z4 test set for the Java
programs only. The P group
tends to be better (Praw : mean
bootstrap p = 0:15, Wilcoxon
test p = 0:17).
Summing up these two results, we find that the tendency of Java programs to crash should not be attributed to
the language, but rather to its careless use in the N group.
3.4.7 Summary
For “normal” inputs, the two groups show hardly any difference in reliability. However, there is a significant
advantage for the PSP group when it comes to unexpected kinds of inputs. This difference becomes even
more pronounced if we consider only that half of the participants of each group with the higher quality discipline. Together, these observations suggest that using PSP techniques does indeed improve the correctness of
programs.
3.5 Release maturity
Do non-PSP programmers have a stronger tendency to deliver software that is not yet reliable? In the context
of this experiment there are three ways of assessing this question: the success in the first acceptance test
Technical Report 1/1999, Lutz Prechelt, Barbara Unger, University of Karlsruhe
40
Chapter 3: Experiment results and discussion
(quantified by either the acceptance rate or the average reliability achieved), the overall number of acceptance
tests required, and the fraction of working time that took place after the first acceptance test. We will look at
all of these.
M
6 Psz
o
o
6 Nsz
o
5 Peff
o
o
5 Neff
o
oo
4 Pdpl
o
o
oo
o
M
o
o
o
o
o
M
o
o
o
o
oo
o
o
M
o
o
o
o
o
M
o
o
4 Ndpl
3 Pt
o
o
o
o
o
o
o
oo
M
o
o
o
M
o
oo
o
o
oo
M
3 Nt
o
2 Prel
o
o oo
2 Nrel
o
o
o
o
o
o
o
o
o
o
M
o
o
o
o
M
o
o
o
o
o
Figure 3.24: reloutaccept1 : Output reliability achieved in the first
acceptance test attempted by
each participant. The P group
tends to be better for Peff , the
N groups elsewhere, but none
of the differences are significant,
with one exception: In the N
group, the smaller half of the
programs was more reliable than
in the P group (Nsz : mean bootstrap p = 0:05, Wilcoxon test
p = 0:06, iqr bootstrap p = 0:06).
M
1 Praw
o
o oo
1 Nraw
o
oo
o
o
o
o
o
M
0
o
o
20
40
o
o
60
o
80
oo
o
o
o
o
o
o
100
output reliability in first acc. test [percent]
Figure 3.24 shows the output reliability of the first delivered program, that is, the reliability in the first acceptance test, whether that was successful or not. As we see, there is clearly no advantage for the P group. The
slight advantage of several N subgroups is not significant, except for one: The N subgroup with the smallest
programs (Nsz ) is significantly more successful in the first acceptance test than the rest of the group while no
such effect is present in Psz , thus provoking a significant difference between Nsz and Psz . The success rate
gives a similar result: there were 7 out of 16 successful first acceptance tests in the N group (44%) and 8 out of
24 (33%) in the P group. There is an advantage for the N group but the difference is not significant (Fisher exact
p = 0:53). The cumulative success rate in subsequent acceptance tests is also quite similar for both groups as
shown in Figure 3.25
The amount of work that has to be expended after the first attempt at an acceptance test also shows no clear
group differences: Both groups spent an average of 17% of their work time after the first test (Figure 3.27) and,
except for two outliers in the N group, also a similar absolute amount of time (Figure 3.28). About a quarter of
all subjects spend more than 5 hours completing the program they thought was already correct.
Summing up, there is no visible difference in release maturity between the two groups in this experiment.
A controlled experiment on the effects of PSP training
100
41
80
P
P
N
P
N
P
P
N
60
Figure 3.25: Fraction of programs
that have been accepted in the
two groups depending on the
number of acceptance tests performed. The plot is cumulative:
once a program was accepted, it
is counted in all subsequent acceptance tests as well.
40
N
20
P
N
P
0
fraction of programs accepted [percent]
3.5 Release maturity
0
1
2
3
4
5
6
number of acceptance tests
Praw
Nraw
M
o
o
o
o
o
o
o
o
o
o
M
o
o
o
o
o
o
1
2
3
o
o
o
4
5
Figure 3.26: Naccept : Number of
acceptance tests performed by
each participant.
The differences are not significant.
6
number of acceptance tests
1 missing
M
Praw
o
o
o ooo
Nraw
o
o
o o o
o o
o o
ooo
o
oo
o
o
o
M
oo
o
o
o o
0
o
o
10
20
o
o
o
30
o
40
o
t ;taccept1 : FracFigure 3.27: worktwork
tion of total working time spent
after the first acceptance test.
The differences are not significant.
50
worktime after first acc. test [percent]
Praw
Nraw
1 missing
M
o ooo
oo
oooo
o
ooo o o o
oo
0
o
oo
o oo o o
o
o
M
o oo o
5
o
10
15
20
worktime after first acc. test [hours]
Technical Report 1/1999, Lutz Prechelt, Barbara Unger, University of Karlsruhe
o
25
Figure 3.28: twork ; taccept1 : Number of working hours spent after
the first acceptance test.
42
Chapter 3: Experiment results and discussion
3.6 Documentation
It is obviously impossible to judge the quality of design or program documentation objectively, except by
means of another controlled experiment for evaluating the quality. Such an experiment, however, would require
multiple subjects for each of the programs from the present experiment and we can obviously not afford such
effort.
However, the deliverables in this experiment were program code only and many programmers’ source code is
notorious for having hardly any comments. Hence, we may use the amount of commenting in the source code
as a substitute for the quality of documentation, because the difference will mostly consist of none versus some,
not so much of differences in content quality.
For objectively assessing the amount of commenting, we consider the lines (called “lines of code”, LOC)
making up the program source code and partition the set of lines into the following subsets:
1. Empty LOC are all lines containing only white space and also all lines containing only comment consisting of non-letters and non-digits (such as lines of asterisks etc.).
2. Pure statement LOC are lines containing meaningful syntactical elements of the programming language,
but no comment.
3. Commented LOC (LOCcommented ) are lines containing both meaningful syntactical elements of the programming language and comment.
4. Comment LOC (LOCcomment ) are lines containing only comment that are not empty lines. Note that this
may include statements that were commented out of the program.
Furthermore, we call the union of pure statement LOC and commented LOC statement LOC (LOCstmt ).
We now express the amount of documentation in both absolute numbers of lines and as a percentage relative to
the number of statement LOC. Note that the latter can in principle be larger than 100.
The relative amount of comment lines (see Figure 3.29) shows no difference of location between the full groups,
but two of the N subgroups are inferior: in the fast subgroup (Nt ) as well as the more disciplined subgroup
(Ndpl ), the N subjects appear to sacrifice documentation for speed. No such effect is present in the corresponding P subgroups.
The relative amount of commented statement lines (see Figure 3.30), is larger in the full P group than in the N
group and also in several subgroups (in particular again in the faster subgroup). However, the P group median
of 2 percent3 is still quite small: Only very few (difficult?) statements were commented.
For the absolute numbers of comment lines (Figure 3.31) or commented lines (Figure 3.32), the results are
roughly the same as for the percentages.
Summing up, although the PSP participants do not provide more source code documentation overall, they
appear to have a somewhat higher willingness to explain a few specific (presumably difficult) statements. In
particular, the faster subjects do not comment less than average, which is typical for the non-PSP participants.
3.7 Trivial mistakes
We expect the P group to be more careful with respect to avoiding simple, easy-to-correct mistakes, because the
PSP course shows that the resulting defects may sometimes be quite harmful. While the experiment provides
3
The raw data was rounded to integral percent values
A controlled experiment on the effects of PSP training
3.7 Trivial mistakes
6 Psz
43
M
ooo
o
oo oo o
o
o
o
o
M
6 Nsz
5 Peff
o
5 Neff
ooo
4 Pdpl
o
o
3 Nt
2 Prel
o
M
o
oooo
o
M
oo
o o
o
oo
1 Nraw
oo o o o
0
o
o
o
o
o
o
oo
o
o
oo
o
oo
o
M
o
M
o o
o
oo
o
o o o
M
o oo ooooo
ooo
1 Praw
o
o o o o
oo oo oo oo
2 Nrel
o
o
o oo
ooo o
oo o o o
o
o
o
M
o
M
o
o
oo o o
3 Pt
M
o
o
o
o
o
M
oo
o o
o
o
o o
o
oo
4 Ndpl
oo
o o
oo
o
10
o
o
o
o
o
M
o o o o
20
o
oo
o
30
o
comment
Figure 3.29: LOC
LOCstmt : Fraction
of comment lines in delivered
programs (measured in percent
of the number of statement
LOC). The amount is smaller for
N in the fast and the disciplined
subgroups (Pt : mean bootstrap
p = 0:04, Wilcoxon test p =
0:05); Pdpl : mean bootstrap p =
0:05, Wilcoxon test p = 0:08).
o
o
o
o
o
o
40
50
60
relative amount of comment LOC [percent of stmtLOC]
M
Praw
o
o
o
o
o
o
o
Nraw
o
o
o
o
M
o
o
0
o
o
o
o
o
o
o
5
o
o
o
20
25
o
o
10
15
fraction of commented stmt LOC [percent]
Technical Report 1/1999, Lutz Prechelt, Barbara Unger, University of Karlsruhe
commented
Figure 3.30: LOC
LOCstmt : Fraction of commented statement
LOC in delivered programs
(measured in percent of the
number of total statement
LOC). Several P groups have
significantly more commented
lines, in particular the full group
(Praw : mean bootstrap p = 0:05,
Wilcoxon test p = 0:05) and
again the faster subgroup (Pt :
mean bootstrap p = 0:10,
Wilcoxon test p = 0:06).
44
Chapter 3: Experiment results and discussion
M
Praw
Nraw
o
o o oo o ooo o o
oo o oo o o
o
0
ooo oo oo
o
M
oo
o
o
50
o
o
o
o
o
o
100
o
150
200
250
number of comment LOC
Praw
Nraw
M
o oo o oo oo o
o
o
oo
o
o
o
o
o o
M
o
o
o
o o oo
0
o o
o
20
o
40
60
Figure 3.31: LOCcomment : Absolute number of pure comment lines in delivered programs. There is significantly
more comment in the P groups
for the more disciplined subgroup (Pdpl : mean bootstrap p =
0:04, Wilcoxon test p = 0:07) and
the faster subgroup (Pt : mean
bootstrap p = 0:01, Wilcoxon
test p = 0:03); the difference is
not significant for the full group
(Praw : mean bootstrap p = 0:33,
Wilcoxon test p = 0:23).
Figure 3.32: LOCcommented : Absolute number of commented
statement LOC in delivered programs.
80
number of commented stmt LOC
no direct insight into the frequency of simple programming errors in general, we have observations (at least for
many of the participants) of the number of compilations they executed and the number of these compilations
that went through without a compiler error message (“clean compiles”). A high fraction of clean compiles is a
good indicator of making few trivial mistakes and an overall small number of compilations can be interpreted
similarly (although this is less reliable).
11 missing
M
Praw
o
Nraw
o
0
20
oo
oo
40
o
M
o
60
o
ooo o
o o
o
11 missing
oo
80
fraction of successful compilations [percent]
100
N
Figure 3.33: compOK
Ncomp : The fraction of compilations without a
compiler error message. This
data is missing for many participants. For the rest, the P
group tends to be more successful (Praw : mean bootstrap p =
0:14, Wilcoxon test p = 0:12).
The fraction of compilations that were successful, i.e., that did not produce an error message from the compiler
is shown in Figure 3.33. The values are indeed somewhat higher for the P group, except for one outlier with
only 25% success. The median success rate of 82% for the P group is quite good, in particular given the much
smaller number of overall compilations shown in Figure 3.34.
Let us now normalize the number of compilations by the total work time or by the program size. The compilation frequency (number of compilations per hour) is clearly higher in the N group, also suggesting less care for
trivial mistakes (Figure 3.35).
The average compilation density (number of compilations per program statement) as shown in Figure 3.36 is
about twice as high in the N group compared to the P group. The small absolute values of this statistic indicate
A controlled experiment on the effects of PSP training
3.7 Trivial mistakes
45
10 missing
M
Praw
o
oo o
oo o o
oo
o o
o
o
10 missing
M
Nraw
o
oo
0
o
100
o
o
200
300
400
total number of compilations
Praw
o
o
oo o
o
M
o o
o
oo
o
10 missing
o
o
10 missing
o
M
Nraw
o
o
0
5
10
o
o
15
20
25
30
number of compilations per work hour
Praw
10 missing
M
o
oo
o
o
o oo
ooo
o
o o
10 missing
o
M
Nraw
0.0
0.2
o
o
oo
0.4
0.6
o
0.8
1.0
1.2
number of compilations per statement
Figure 3.34: Ncomp : Total number of compilations. This data
is missing for many participants.
For the rest, the P group compiles less often (Praw : mean
bootstrap p = 0:05, Wilcoxon
test p = 0:06).
comp : Number of
Figure 3.35: Ntwork
compilations per work hour. This
data is missing for many participants. For the rest, the P group
compiles much less frequently
(Praw : mean bootstrap p = 0:00,
Wilcoxon test p = 0:01).
Ncomp : NumFigure 3.36: LOC
stmt
ber of compilations per resulting program statement. This
data is missing for many participants. For the rest, the P group
compiles much less frequently
(Praw : mean bootstrap p = 0:00,
Wilcoxon test p = 0:04).
that subjects in both groups either preferred a rather incremental development style or had lots of compilations
during the test phase.
11 missing
M
Praw
o oo oo
o
o
o
o
o
o
11 missing
M
Nraw
o
0
o
o
o
o
5
10
15
20
number of statements per successful compilation
25
stmt
Figure 3.37: NLOC
compOK : Number of resulting program statements per successful compilation. There is one outlier at 68 in
the P group. This data is missing
for many participants. For the
rest, the P group produces many
more statements between any
two clean compiles (Praw : mean
bootstrap p = 0:00, Wilcoxon
test p = 0:03).
The same proportion holds for the number of successful compilations per program statement (see Section 3.6).
Looking at the inverse of this value, the number of final program statements written per successful compilation
(Figure 3.37), we find that many of the N participants obviously did quite a lot of trial and error changes
(presumably in the later stages of the development). The P group median and mean are roughly 50% and
300% larger, respectively, than those of the N group. One P subject achieved the remarkable value of 68 LOC
Technical Report 1/1999, Lutz Prechelt, Barbara Unger, University of Karlsruhe
46
Chapter 3: Experiment results and discussion
per clean compile and three quarters of the P subjects are as good or better than the best quarter of the N
subjects. The maximum value in the N group is only 5 lines! On the other hand, even in the P group half of the
participants achieve less than 4 lines.
40
Nraw
30
compilations per hour
60
80
Praw
o
o
25
20
o
o
15
10
o
o
o
o
o
oo
o
o o
o
o
5
o
o
0
40
60
Figure 3.38: Number of compilaNcomp ) by
tions per work hour ( twork
fraction of successful compilaN
tions ( compOK
Ncomp ). This data is
missing for many participants.
The plot includes a linear regression trend line with its 80% confidence band.
80
compilation success [percent]
The plot of compilation frequency (number of compilations per hour) versus compilation success (fraction of
clean compiles) as shown in Figure 3.38 indicates that the compilation frequency of the N group is dramatically
higher for subjects with low compilation success. The P subjects seem to choose their compilation frequency
much more by themselves. Note that the effect in the N group is much too large to be explainable by the
additional recompilations of unsuccessful program versions alone. As we will see now, the reason is probably
a generally much less careful development by these subjects.
40
60
Nraw
o
100
Praw
o
o
o
o
oo o
o
o
80
reliability
80
o
60
o
40
o
20
o
o
0
40
60
80
o
oo
Figure 3.39: Output reliability in
first acceptance test (relouta1 )
depending on the compilation
N
success ( compOK
Ncomp ) in percent.
This data is missing for many
participants. The continuous red
line is a normal least squares linear regression line, the dashed
magenta line is a robust L1 least
distances regression line, and
the long-dashed green line is
a resistant 10%-trimmed least
squares regression line.
compilation success [percent]
Interestingly, there is a dependency of the reliability achieved in the first acceptance test on the compilation
success. Figure 3.39 shows a scatter plot of these two values, decorated with various trend lines. The plot shows
that for the N group all of these trend lines indicate decidedly increasing reliability for increased compilation
success. In contrast, the P group shows no consistent trend at all. Some of this effect is also visible in similar
plots based on the reliability of the final program (not shown here).
Summing up, as far as the data is available we find more care for the seemingly trivial details of program
development in the P group compared to the N group. Furthermore, we find some indications that lack of such
A controlled experiment on the effects of PSP training
3.8 Productivity
47
care, as exhibited by some members of the N group, may directly relate to problems later in development (here:
the first acceptance test).
3.8 Productivity
Although improving the productivity is not a major goal of the PSP course, the PSP education may increase
the productivity as a side effect of defect prevention and earlier defect detection. In this experiment, there are
basically two possible views of productivity we can assess. First, as all programs solve the same task, we can
simply define productivity as the inverse of the total working time required. Second, we may also use the more
common view of productivity as the number of LOC written per working hour; see Section 3.6.
M
o o
o o oo ooo
o ooo o o o o
Praw
Nraw
ooo
o
o
o
o
0.05
o
o o o
o o
M
oo o
0.10
o
0.15
o
o o
0.20
o
0.25
o
o
0.30
productivity [1/hour]
Praw
Nraw
o
o
M
oo o o o
o
o o oo o o o o o o oo o
o
o
M
ooo oo oo o
o
10
o
o
o
o
20
30
40
total working time [hours]
o
50
o
60
1
Figure 3.40: twork
: Productivity
measured as the inverse of total working time (number of programs per hour, one could say).
The productivity is higher in the
N group (Nraw : mean bootstrap
p = 0:06, Wilcoxon test p = 0:10,
and even more pronouncedly in
the subgroups Nsz , Ndpl , and
Nt ), but is more stable (less variable) in the P group (Praw : iqr
bootstrap p = 0:05).
Figure 3.41: twork : Total number
of working hours (which can be
interpreted as cycle time). The N
group has a lower median (Nraw :
Wilcoxon test p = 0:10), but
a slightly higher mean due to
some very slow subjects (Praw :
mean bootstrap p = 0:28).
We find that the average productivity as measured by inverse working time alone is actually lower in the P
group (Figure 3.40). However, the variability is once again much smaller in the P group. This may mean that
in real software development the productivity of a team of PSP participants might actually be similar to or
even better than that of a team of non-PSP participants, because the unpredictability of productivity is a major
risk factor for the cycle time in a team software development. This is also nicely illustrated by the cycle time
produced by the individuals alone (Figure 3.41): If you pick one subject from each group, chances are the N
subject is faster (lower median). However, if you pick a whole team, chances are the P group is faster (lower
mean). This weird effect happens because all of the really bad cycle times, which could put a project at risk,
occurred in the N group (despite it being smaller in size!)
If we take program length into account and measure productivity by the number of statements written per hour
as shown in Figure 3.42, the difference almost disappears (Nraw : mean bootstrap p = 0:30, Wilcoxon test
p = 0:42) but the larger variance of the N group remains. As a consequence of this higher variance, the faster N
subgroup is significantly more productive than the faster P subgroup (Nt : mean bootstrap p = 0:09, Wilcoxon
test p = 0:12). Conversely, P would be slightly faster if we looked at the slower half.
Technical Report 1/1999, Lutz Prechelt, Barbara Unger, University of Karlsruhe
48
Chapter 3: Experiment results and discussion
M
Praw
Nraw
ooo o o o
o
o
o
oo o ooo o o o
oo o
oo
o
M
o
oo
o
10
o
o o
oo
o
20
30
o o
oo
o
40
50
o
60
productivity [statementLOC/hour]
Praw
Nraw
oooo
o
o
o
o
oo
o o
o oo
o oo
o
M
o o o
o oo
M
oo
oo
o oo
oo o
20
o
o
40
o
o
60
80
productivity [LOC/hour]
stmt
Figure 3.42: LOC
twork : Productivity measured as the number
of statement LOC written per
hour. The differences of mean
and median are not significant,
but the variability is smaller in
the P group (Praw : iqr bootstrap
p = 0:04).
LOCcomment :
Figure 3.43: LOCstmt +
twork
Productivity measured as the
number of total LOC (including
comments) written per hour. The
differences of mean and median
are not significant (Nraw : mean
bootstrap p = 0:42, Wilcoxon
test p = 0:49), but the variability
tends to be lower in the P group
(Praw : iqr bootstrap p = 0:14).
If we count comments as LOC as well (in addition to the statements), the difference disappears entirely, but the
lower variability remains; see Figure 3.43.
Summing up, productivity tended to be somewhat lower (rather than higher) in the P group, but also more stable
(less variable), hence inducing much less development risk.
3.9 Quality judgement
The PSP course teaches estimation not only of development times, but also of an expected number of defects.
Therefore, it is plausible that the PSP participants make more accurate judgement of the quality of their own
delivered program.
To assess this effect, two questions in the postmortem questionnaire asked for the expected number of defects
remaining in the delivered program and the estimated frequency of program failures. We cannot determine the
actual number of defects, but will compare the reliability estimation to an actual value.
M
Praw
o
o
o
Nraw
o
o
o
o
o
o
o
o
o
o
o
1
2
3
o
M
0
o
4
Figure 3.44: ndefectsestim : The
subjects’ postmortem estimation
of the number of defects in the
delivered program. The differences are not significant.
5
estimated number of remaining defects
First of all, the P group on average was not more optimistic of their programs’ remaining defect content than
the N group (Figure 3.44). As for the reliability (Figure 3.45), the P group tended to be slightly more optimistic
— and rightly so: a plot of the quotient of estimated and actual input reliability shows that more than half of
A controlled experiment on the effects of PSP training
3.10 Efficiency
Praw
49
M
o
o o
o
o
oo
o
o
M
Nraw
97
o
o
o
o
o
98
99
o
o oo
o o
o
100
estimated input reliability [percent]
M
Praw
o
o
Nraw
o
oooo
o o
oo
o
o
o
M
oo o
ooo
oo
o
0.90
0.92
0.94
0.96
0.98
o
o
o
1.00
1.02
quotient of actual and estimated reliability
Figure 3.45: relinestim : Subject’s
postmortem estimation of the input reliability of the delivered
program.
The differences of
mean and median are not significant (mean bootstrap p = 0:25,
Wilcoxon test p = 0:22), but the
variability is clearly smaller in the
P group (iqr bootstrap p = 0:03).
relinz3 : Quotient
Figure 3.46: relinestim
of actual input reliability for the
“tame” z3 test set and subject’s
postmortem estimate of input reliability. Median, and variability
tend to be smaller in the P group
(Praw : mean bootstrap p = 0:33,
Wilcoxon test p = 0:14, iqr bootstrap p = 0:17).
the P group’s estimates are within a factor of 0.0015 (1.5 per thousand) of the actual value, while a few more of
the N participants were somewhat too pessimistic about their programs’ reliability (Figure 3.46), resulting in a
somewhat lower variability of the estimation accuracy for the P group.
M
Praw
o
oo
Nraw
o
oo
oo o o
o
oo
o ooo
o
o
M
0.90
0.92
oo
o
0.94
0.96
0.98
o
1.00
o
o
1.02
quotient of act. robustness and estim. reliability
relinm3 : Quotient
Figure 3.47: relinestim
of input reliability for the “difficult”
m3 test set and subject’s postmortem estimate of input reliability. Note that the huge difference in box size is misleading;
it is due to the large empty gap
between 0.99 and 0.92. Still,
the P group estimates are significantly better (Praw : mean bootstrap p = 0:01, Wilcoxon test p =
0:03) and less variable (Praw : iqr
bootstrap p = 0:07).
In contrast, most estimates were too optimistic when judged against the robustness test set m3 — and much
more so for the N group than for the P group (Figure 3.47), due to the lack of robustness of many N programs .
Some participants in both groups estimated their program to contain one or even several defects, but still be at
least 99.9% reliable.
Summing up, the quality judgement is indeed both more accurate and more reliable in the P group.
3.10 Efficiency
Without any real expectation what to find we are also interested in comparing the efficiency of the programs
created by our two groups. We assess efficiency in terms of run time as well as memory consumption. We can
Technical Report 1/1999, Lutz Prechelt, Barbara Unger, University of Karlsruhe
50
Chapter 3: Experiment results and discussion
use only the z2 and z3 test runs, as all others may be biased, either due to programs that crashed prematurely
or due to missing values for the most inefficient programs; see Sections 3.4.4 and 3.4.5.
20000
40000
other
Praw
Java
M
ooo o o o
o oo
oo o
o
o
o
o
oo
M
o
o
M
Nraw
o o
o
o
20000
60000
40000
o o o
o
o o
M
o o oo
o
o
60000
memory consumption for z3 test set [Kbytes]
Figure 3.48: memz 3 : Memory
consumption on the z3 test set
compared for the P group (top
row) versus the N group (bottom
row), separately for the Java programs (right diagram) and the
programs written in other languages (left diagram). The differences are significant neither
for the Java programs nor for the
programs in other languages.
Obviously for these measures it would be misleading to ignore the programming language of the programs,
because Java programs as such have (at least as of JDK 1.2) both much higher run time and higher memory
consumption than equivalent programs in any of the other languages. We could try and put all of the latter into
one bin and the Java programs into another and compare these subgroups separately as shown in Figure 3.48.
Unfortunately, there are only 3 non-Java programs for the N group (2 C++ programs and 1 C program) and we
cannot draw useful conclusions from such a small set. Therefore, in the following we will compare only the
Java programs from the N group to the Java programs from the P group.
As we see in Figure 3.48, there is a slight trend that P group memory consumption might be smaller, but the
difference is insignificant. A very similar trend occurs for the z2 test. Note that the average programs use about
three times as much memory as the greediest one.
10
20
30
Nraw
70
40
Praw
least squares fit
L1 (least distance) fit
least trimmed sq. fit
least squares fit
L1 (least distance) fit
least trimmed sq. fit
50
o o
o
30
20
60
o
60
40
50
o
o
o
o
o
o o
o
o
o
o
o
o
o
o
o
o
o
10
10
20
30
40
50
60
Figure 3.49: Memory consumption memz 3 in Megabytes (vertical) depending on total subject working hours twork (horizontal). The continuous red line
is a standard least squares linear regression line, the dashed
magenta line is a robust L1 least
distances regression line, and
the long-dashed green line is
a resistant 20%-trimmed least
squares regression line.
memory consumption by work hours
There is a trend that the P group Java programs consume less memory when the participant worked on them
longer. No such trend is visible for the N group; see Figure 3.49. Unless this effect is spurious, it may mean
that PSP participants tended to actively work towards lower memory consumption and invested time to achieve
it, whereas the memory requirements of the non-PSP participants “just happened”.
For the run times, there is no salient group difference whatsoever for the z3 test runs (Figure 3.50) nor for the
z2 test runs.
A controlled experiment on the effects of PSP training
3.11 Simplicity
M
6 Psz
o
6 Nsz
o
o
oo
5 Peff
M
ooo
5 Neff
M
o
o
oo
o
o
4 Pdpl
4 Ndpl
51
o
o
o
M
o
o
o
o
Figure 3.50: trunz 3 : Run time (in
minutes) for the Java programs
on the z3 test set. There is one
outlier in the N Java group at 342
and one in the P Java group at
371. The overall group differences are not significant.
M
oo
o
o
ooo
o
M
o
o
o
M
3 Pt
oo
3 Nt
oo
o
o o
o
o
2 Prel
o o o
oo
2 Nrel
oo
o
o
M
o
o
o
M
M
o
o
o
o
M
1 Praw
ooo o o
1 Nraw
oo
o
o
o
0
o
o
o
o
M
50
o
o
100
150
200
run time of z3 test set [minutes]
Summing up, no differences in program efficiency between the two groups could be found, neither for run time,
nor for memory consumption.
3.11 Simplicity
As still no good general measures for software complexity (or simplicity) are available, we compare the groups
on two of the simplest and most useful alternative measures instead, namely program length (see the definition
of LOC in Section 3.6) and the number of decisions in the program (McCabe cyclometric complexity). Both of
these measures mostly reflect size, which usually is the dominant aspect of program complexity.
As we see in Figure 3.51, the programs are noticeably longer in the P group. Figure 3.52 shows that the median
number of decisions in the program is larger in the P group as well, but the means are the same. The variance
once again tends to be larger in the N group, however.
It is not at all clear how to interpret these results. More statement lines per decision (as in the P group) may
result in less dense, easier-to-read code but could also be the result of fussy, inelegant coding style. Summing
up, the results are inconclusive with respect to simplicity.
3.12 Analysis of correlations
All of the above analyses were more or less hypothesis-driven: we searched for specific expected effects.
However, we also performed an entirely data-driven analysis in order to possibly find further, surprising effects.
Technical Report 1/1999, Lutz Prechelt, Barbara Unger, University of Karlsruhe
52
Chapter 3: Experiment results and discussion
M
Praw
Nraw
o o
o
oo o oo o oo
o
oo
oo o
o
o
o ooo
o
M
oo
o
o
oo
100
o
oo
o
o
o
o o
200
300
o
400
o
500
number of statement LOC
Praw
Nraw
o
o
o oo o o o
o
o
o
M
oooo o o
oo
o
o
o
M
o
o
20
o
o
o o oo
o
oo
40
o
o
60
o
o
80
100
number of static program decisions
Figure 3.51: LOCstmt : Total
number of statement lines of
code in the delivered programs.
The N programs are significantly
smaller (Nraw : mean bootstrap
p = 0:09, Wilcoxon test p =
0:04). Similar differences occur
in the subgroups as well, least
pronounced for the efficient programs (Nraw : mean bootstrap
p = 0:41, Wilcoxon test p =
0:20).
Figure 3.52: ndecisions : Total
number of program decisions
in the source program. This
value is an approximation to
the McCabe cyclometric complexity and was determined by
a simple keyword count.
It
is the number of if, while,
catch, break keywords etc.
Note that counting break is correct in switch statements but
unnecessary elsewhere. The N
group tends to have a smaller
median (Nraw : Wilcoxon test p =
0:13), but the same mean (Praw :
mean bootstrap p = 0:49) and
slightly higher variability (Praw :
iqr bootstrap p = 0:27).
We computed all pairwise correlations and pairwise rank correlations of 34 different variables measured in the
experiment and tabulated the results ordered by the absolute size of the correlation. This process found 103
correlations with a size greater than 0.4. We investigated all of them and sorted out the many obvious ones
as well as the spurious ones. As an example of an obvious correlation, the high correlation between input
and output reliabilities is uninteresting, the high correlation between programming experience in years and the
size of the largest program ever written is quite expected, etc. On the other hand, we found, for example, a
rank correlation of 0.59 between the number of compilations and the run time of the m3 reliability test. This
is spurious and misleading, because the m3 run times ranks reflect two dissimilar subpopulations, namely the
successful and the crashed runs. Much of what this correlation reflects is just noise.
However, after sorting out the garbage, we found a few pearls as well.
3.12.1 How time is spent
The length of the program written is a predictor of the total working time (and similarly of the time to the first
acceptance test). See Figure 3.53 for a plot of the relationship. Put differently, the people who took longer for
solving the task typically spent their time actually writing and debugging more statements.
Since most of the subjects’ time was spent removing defects from the programs, this is an impressive indicator
A controlled experiment on the effects of PSP training
3.12 Analysis of correlations
53
100
200
Nraw
300
400
500
Praw
o
total working time [hours]
60
50
o
o
40
30
o
20
o
o
10
100
oo
o
ooo
o oo
oo
o
o
oo
o
oo o oo
200
o
300
400
o
ooo
o
o
o o o
o
o
o
Figure 3.53: Total working time
twork depending on the length
LOCstmt of the final program.
The relationship is closer for the
P group.
500
number of statement LOC
that there is something as a constant defect density in programs — one should strive for compact programs, at
least for this programming task.
An alternative explanation would be to assume that the less capable programmers (who would take longer in
any case) are also the ones writing the longer programs, because they are not able to produce compact code. A
mix of both explanations is probably the most accurate model for understanding this data.
In any case, the relationship is closer in the P group (correlation r = 0:57) than in the N group (r = 0:51),
which represents another instance of more stable, predictable performance due to PSP training.
3.12.2 Better documentation saves trivial mistakes
One rather interesting relationship was found in the answers in the postmortem questionnaire. The subjective
evaluation of the quality of the description of procedures embedded in the program correlates with the fraction
of compilations that were successful (r = ;0:57). The better the subjects judged their own descriptions, the
higher was their fraction of compilations without compiler error messages, which increases from about 49% for
participants with “bad” descriptions to about 84% for “very good” ones. However, the trend is not statistically
significant and may hence be spurious. The same effect, only weaker, is found for the design descriptions. The
difference between the groups is unclear, because there are only five valid data points in the N group for this
measure.
3.12.3 The urge to finish
We also found an interesting correlation between four correlations. There are obviously correlations between
the estimated working time and either the actual working time and the time to the first acceptance test. We
t ork =
can build one-parameter linear models to predict the latter two variables from the estimated time: w
ctot testim and taccept1 = cacc testim . We compute the parameters separately for the N and P groups. Comparing
the optimal ctot parameters for these models, we find the following:
twork = 1:50 testim
explains 82% of the variance in P
twork = 2:49 testim
explains 54% of the variance in N
Technical Report 1/1999, Lutz Prechelt, Barbara Unger, University of Karlsruhe
Chapter 3: Experiment results and discussion
fraction of successful compilations [percent]
54
o
o
o
80
o
o
o
o
Figure 3.54: The fraction of sucNcompOK
cessful compilations
Ncomp
depending on the self-evaluation
of the quality of the procedure
descriptions (documentation) in
the code on a scale from 1 (very
good) to 5 (bad).
o
o
o
o
60
o
o
o
o
40
o
1
2
3
4
5
self-evaluation of procedure documentation
These values just reflect the fact (already discussed in Section 3.3) that the estimations are better and their errors
are less variable in the P group. However, we also find the following for cacc :
taccept1 = 1:20 testim
explains 86% of the variance in P
taccept1 = 1:65 testim
explains 52% of the variance in N
The new insight here is that, for the P group, the quality of the model for taccept1 is higher than that for twork ,
whereas there is no such effect in the N group. A possible interpretation is as follows: The P group feels more
urged to finish within the predicted time than the N group and has a stronger inclination to pick the time for
the first acceptance test close to the predicted total work time. This is obviously not a good idea, calling for an
acceptance test should be driven by product reliability, not by expired working time.
3.13 Subjects’ experiences
In this section we analyze what the participants reported in their postmortem questionnaire.
The first section of that questionnaire revolved around the quality of the delivered program, concerning its
documentation, complexity, extensibility, and reliability. There are hardly any differences between the groups
in the answers to most of these questions. An exception is the documentation of procedures and difficult
individual statements. The participants in the P group judged their own description of the procedures better
than the N group participants theirs, as shown in Figure 3.55.
M
Praw
Nraw
o
o
o
o
o
o
o
o
o
o
o
o
o
3
4
5
o
o
o
o
o
o
o
o
1
2
M
documentation of difficult statements
Figure 3.55: Self-evaluation of
the quality of the documentation
of individual difficult statements
in the program, on a scale from
1 (“very good”) to 5 (“bad”). Both
mean and median are better in
the P group (mean bootstrap p =
0:06, Wilcoxon test p = 0:06.)
A controlled experiment on the effects of PSP training
3.14 Mean/median/iqr overview table
55
A similar, slightly weaker difference is found for the description of the procedures. Although the difference for
statement documentation is statistically significant, it may be accidental, because it is not backed up by similar
differences in any of the other questions about documentation quality (program usage, program purpose, design
ideas, data types, variables). The questions about unnecessary program complexity in data types, control flow,
or algorithm and about understandability and extensibility of the program also show no clear group differences.
Asked for the number of defects they believed were still in their program, a majority of participants from both
groups believed there were none. The rest of the distributions are also very similar. Ironically, a large fraction of
participants answered “no defects” here but then in the immediately following question admitted they expected
their program to fail sometimes. This happened with 31% of the answers in the N group and 37% in the P
group. See Section 3.9 for a discussion of the estimated reliability.
The second section of the postmortem questionnaire asked where the participants saw their greatest weaknesses
and which means they consider suitable for improvement.
For the question “where could you most reduce the number of defects?”, the most frequent answer was “design”.
The profile of the answers looks quite similar in the two groups. Asked how defects could best be reduced, the
most frequent answer was “design review” in both groups. However, the second most frequent answer, “code
reviews”, was given much more often in group N.
As for increasing productivity, again the most frequent answer was “more design” and was given about as often
in one group than in the other.
3.14 Mean/median/iqr overview table
The following multipage table summarizes some of the results again in numeric form. It contains data for the
following variables
1. amount of work time mis-estimation (see also Figure 3.3 on page 28)
2. reliability on the “tame” z inputs (see also e.g. Figure 3.11 on page 34)
3. reliability on the difficult m inputs (“robustness”, see also e.g. Figure 3.16 on page 36)
4. reliability in the first acceptance test (“release maturity”, see also Figure 3.24 on page 40)
5. comment density in the delivered program (see also Figure 3.29 on page 43)
6. productivity (inverse of total work time, see also Figure 3.40 on page 47)
7. length of the delivered program (see also Figure 3.51 on page 52)
For each of these variables, the data for all subgroups (as described in Section 3.2 on page 25) is presented. For
each subgroup, three different statistics are shown: mean, median, and interquartile range (iqr, as described in
Section 3.1.1 on page 20). For each statistic, the value of both the P group and the N group is shown, the ratio
of these values, and an 80-percent confidence interval for this ratio (computed by Bootstrap resampling with
1000 iterations, which will typically result in a 2-digit accuracy).
The rightmost column indicates which group was “better”, where smaller iqr, underestimation, and program
length is considered better, and larger reliability, comment density, and productivity. The name of the better
group is a dash if the ratio is within the range 0.99 to 1.01, is printed in a small font if the confidence interval
includes 1.00, and is printed in boldface if it does not. The latter is equivalent to statistical significance on a
0.1 level. Note that this test uses a ratio rather than a difference criterion and so will sometimes give results
different from the tests presented in the main text and captions of the previous sections. Overall, however, the
conclusions are the same.
Technical Report 1/1999, Lutz Prechelt, Barbara Unger, University of Karlsruhe
56
Chapter 3: Experiment results and discussion
groups
statistic
P
N
ratio
80% confidence
P/N
interval
VARIABLE jtwork =testim ; 1j (time estimation inaccuracy):
Praw /Nraw mean
98.4
163
0.605
0.374. . . 1.08
.
median
62.6
74.5
0.84
0.495. . . 1.76
.
iqr
103
129
0.799
0.307. . . 1.69
Prel /Nrel mean
89.3
145
0.615
0.36. . . 1.14
.
median
61.7
76.1
0.811
0.407. . . 1.53
.
iqr
58.5
119
0.492
0.16. . . 1.59
Pt /Nt
mean
60.4
39.8
1.52
0.978. . . 2.29
.
median
43.9
33.5
1.31
0.621. . . 2.24
.
iqr
46.4
47.4
0.98
0.57. . . 3.27
Pdpl /Ndpl mean
55.9
101
0.552
0.298. . . 1.54
.
median
46
47.2
0.974
0.575. . . 1.84
.
iqr
46.4
49.8
0.933
0.456. . . 2.49
Peff /Neff mean
75.3
225
0.335
0.195. . . 0.687
median
61.7
112
0.553
0.215. . . 1.25
.
.
iqr
72.9
199
0.366 0.0947. . . 0.916
Psz /Nsz mean
76.7
74.5
1.03
0.611. . . 1.88
.
median
43.9
63.2
0.694
0.44. . . 2.04
.
iqr
109
55.4
1.96
0.38. . . 3.13
VARIABLE reloutz 3 (reliability for “tame” inputs):
Praw /Nraw mean
90.4
93.6
0.966
0.862. . . 1.08
.
median
100
100
1
1. . . 1
iqr
1.15
0.476
2.41
0. . . 1
.
Prel /Nrel mean
100
100
1
1. . . 1
median
100
100
1
1. . . 1
.
iqr
0
0
—
—
.
Pt /Nt
mean
85
89.7
0.948
0.799. . . 1.15
.
median
100
100
1
1. . . 1.01
iqr
0.38
0.76
0.5
0. . . 176
.
Pdpl /Ndpl mean
98.2
89.7
1.09
0.973. . . 1.25
.
median
100
100
1
1. . . 1.01
iqr
0.38
0.76
0.5
0. . . 1
.
Peff /Neff mean
84.9
99.8
0.851
0.732. . . 0.969
median
100
100
1
0.996. . . 1
.
iqr
1.33
0.286
4.66
0. . . 1
.
Psz /Nsz mean
85
99.8
0.851
0.733. . . 0.97
median
100
100
1
1. . . 1
.
iqr
0.953
0
1
0.655. . . 1
.
VARIABLE reloutm3 (reliability for difficult inputs):
Praw /Nraw mean
83.2
47.8
1.74
1.3. . . 2.49
median
98.4
10.2
9.69
1.04. . . 9.76
.
iqr
4.78
88.3
0.0541
0.018. . . 1.02
.
Prel /Nrel mean
87.4
45.6
1.92
1.38. . . 3.14
median
98.4
10.2
9.69
1.02. . . 9.84
.
iqr
1.58
88.1
0.0179
0.012. . . 1
.
Pt /Nt
mean
79.8
52.6
1.52
1.07. . . 2.33
median
98.4
50.5
1.95
1. . . 9.84
.
continued. . .
better
P
P
P
P
P
P
N
N
P
P
P
P
P
P
P
N
P
N
N
—
N
—
—
—
N
P
P
P
P
P
N
—
N
N
—
N
P
P
P
P
P
P
P
P
A controlled experiment on the effects of PSP training
3.14 Mean/median/iqr overview table
groups
statP
N
ratio
80% confidence
istic
P/N
interval
.
iqr
9.2
88.3
0.104
0.0179. . . 1.36
Pdpl /Ndpl mean
92.9
52.6
1.76
1.29. . . 2.7
median
100
50.5
1.98
1.02. . . 9.84
.
iqr
1.58
88.3
0.0179 0.0166. . . 0.231
.
Peff /Neff mean
85.7
62.5
1.37
1.03. . . 1.94
median
98.4
94.4
1.04
1. . . 9.69
.
iqr
4.73
88.3
0.0536 0.0179. . . 1.23
.
Psz /Nsz mean
85.2
53.7
1.59
1.17. . . 2.41
median
98.4
50.5
1.95
1. . . 9.69
.
iqr
5.02
88.3
0.0569 0.0157. . . 0.98
.
VARIABLE reloutaccept1 (release maturity):
Praw /Nraw mean
51.7
59.4
0.87
0.628. . . 1.23
.
median
58.8
78.2
0.752
0.109. . . 1.92
.
iqr
100
95.9
1.04
0.908. . . 2.22
Prel /Nrel mean
56.6
70.2
0.806
0.541. . . 1.15
.
median
93.4
90.7
1.03
0.0337. . . 1.33
.
iqr
99.1
47
2.11
0.92. . . 5.35
Pt /Nt
mean
69.3
82.8
0.837
0.638. . . 1.1
.
median
93.4
100
0.934
0.841. . . 1.03
.
iqr
65.3
14
4.66
0.365. . . 1
Pdpl /Ndpl mean
73.8
80.3
0.919
0.717. . . 1.16
.
median
100
100
1
0.922. . . 1.23
iqr
39
23.4
1.67
0.186. . . 1
.
Peff /Neff mean
66
51.8
1.27
0.88. . . 2.05
.
median
92.2
58.4
1.58
0.841. . . 65.2
.
iqr
73.3
99.6
0.736
0.246. . . 1.69
Psz /Nsz mean
50
76.9
0.651
0.457. . . 0.91
median
67.6
100
0.676
0.0337. . . 1.03
.
.
iqr
96.7
45.4
2.13
1.06. . . 1
VARIABLE LOCcomment =LOCstmt (comment density):
Praw /Nraw mean
19.2
19.6
0.984
0.744. . . 1.33
.
median
15.5
18
0.861
0.64. . . 1.44
.
iqr
14.2
20.8
0.687
0.351. . . 1.16
Prel /Nrel mean
21.2
22.6
0.938
0.674. . . 1.45
.
median
16
21
0.762
0.536. . . 1.68
.
iqr
14
23.8
0.589
0.256. . . 1.23
Pt /Nt
mean
17.7
11.3
1.56
1.11. . . 2.45
median
16
9.5
1.68
0.941. . . 4
.
.
iqr
14
16.2
0.862
0.412. . . 1.94
Pdpl /Ndpl mean
19
12.6
1.51
1.07. . . 2.29
median
17
12.5
1.36
0.75. . . 4.25
.
.
iqr
17
18
0.944
0.515. . . 1.85
Peff /Neff mean
22.9
21.3
1.07
0.754. . . 1.64
.
median
22
18
1.22
0.65. . . 2.14
.
iqr
18
21.2
0.847
0.405. . . 1.68
Psz /Nsz mean
15.4
17.1
0.901
0.63. . . 1.44
.
median
15
12.5
1.2
0.591. . . 2.17
continued. . .
Technical Report 1/1999, Lutz Prechelt, Barbara Unger, University of Karlsruhe
57
better
P
P
P
P
P
P
P
P
P
P
N
N
N
N
P
N
N
N
N
N
—
N
P
P
P
N
N
N
N
N
P
N
N
P
P
P
P
P
P
P
P
P
P
N
P
58
Chapter 3: Experiment results and discussion
groups
.
Praw /Nraw
.
.
Prel /Nrel
.
.
Pt /Nt
.
.
Pdpl /Ndpl
.
.
Peff /Neff
.
.
Psz /Nsz
.
.
Praw /Nraw
.
.
Prel /Nrel
.
.
Pt /Nt
.
.
Pdpl /Ndpl
.
.
Peff /Neff
.
.
Psz /Nsz
.
.
statistic
iqr
P
N
ratio
80% confidence
P/N
interval
8.5
19.5
0.436
0.228. . . 0.955
VARIABLE 1=twork (productivity):
mean
0.0984
0.14
0.704
0.546. . . 0.929
median 0.0867 0.134
0.646
0.48. . . 1.01
iqr
0.0683 0.139
0.493
0.272. . . 0.865
mean
0.0991 0.142
0.698
0.518. . . 0.966
median 0.0893 0.148
0.602
0.408. . . 1.38
iqr
0.0547 0.148
0.369
0.206. . . 0.743
mean
0.127
0.198
0.641
0.524. . . 0.787
median 0.103
0.181
0.571
0.463. . . 0.849
iqr
0.0465 0.113
0.412
0.252. . . 1.37
mean
0.119
0.179
0.665
0.511. . . 0.873
median 0.102
0.151
0.675
0.441. . . 0.849
iqr
0.0503 0.114
0.443
0.263. . . 1.92
mean
0.111
0.134
0.834
0.575. . . 1.27
median 0.0917 0.0959 0.956
0.459. . . 2.22
iqr
0.059
0.174
0.34
0.173. . . 0.949
mean
0.114
0.182
0.625
0.471. . . 0.831
median 0.099
0.181
0.548
0.403. . . 0.823
iqr
0.0665 0.129
0.514
0.279. . . 1.12
VARIABLE LOCstmt (program length):
mean
290
246
1.18
1. . . 1.4
median
290
215
1.35
1.04. . . 1.69
iqr
119
118
1.01
0.495. . . 1.94
mean
288
235
1.23
0.985. . . 1.56
median
302
199
1.52
1.12. . . 1.86
iqr
106
74.5
1.43
0.359. . . 2.63
mean
255
204
1.25
1.06. . . 1.46
median
237
176
1.35
1.1. . . 1.48
iqr
106
52
2.04
0.537. . . 4.33
mean
276
236
1.17
0.937. . . 1.45
median
244
189
1.29
1.04. . . 1.82
iqr
133
76.8
1.73
0.465. . . 3.19
mean
277
267
1.04
0.849. . . 1.29
median
243
222
1.09
0.788. . . 1.46
iqr
106
166
0.638
0.298. . . 2.14
mean
235
175
1.34
1.21. . . 1.49
median
228
166
1.37
1.15. . . 1.49
iqr
55.5
52
1.07
0.601. . . 3.59
better
P
N
N
P
N
N
P
N
N
P
N
N
P
N
N
P
N
N
P
N
N
—
N
N
N
N
N
N
N
N
N
N
N
P
N
N
N
A controlled experiment on the effects of PSP training
Chapter 4
Conclusion
It is a capital mistake to theorize before one has data.
Sherlock Holmes
4.1 Summary of results
Put very shortly, our experiment for comparing PSP-trained and non-PSP-trained subjects has found the following significant differences between these groups:
1. The estimations of total required working time are much better in the PSP group; the same holds for
related measures such as estimated productivity measured in lines of code per hour. The PSP group
is also better when estimating the average time for finding and fixing a defect and when judging the
reliability of the program they had produced.
2. The productivity tends to be lower in the PSP group. However, it is also more constant (less variable)
than in the non-PSP group.
3. The program reliability for unexpected (but legal) inputs is better in the PSP group, in particular when
we compare the more disciplined halfs of each group. For more standard inputs, the differences are not
significant, but there is a slight tendency that the non-PSP group is better.
4. The PSP group exhibits more care towards avoiding simple mistakes (such as compiler errors) and we
find for the non-PSP group that the frequency of such simple mistakes correlates with decreasing program
reliability in the first acceptance test, i.e., with a tendency to deliver unreliable first program versions.
5. For many of the observed variables we find that the variability in the PSP group is smaller than in the
non-PSP group. Hence a team of PSP programmers would be easier to manage because its performance
is more predictable.
4.2 Possible reasons
We should mention that we believe the PSP methodology is useful and the PSP course is a valid method of
teaching its ideas as well as an initial set of basic psp techniques. We believed so before this experiment and
we still do. From this perspective, the results of the experiment are disappointing: In some areas we found no
59
60
Chapter 4: Conclusion
advantage for the PSP group and in others the difference appears to be smaller than expected. Why did we not
find further and more pronounced differences? We will discuss several possible explanations.
PSP is useless: The first and foremost possibility is obviously that the PSP might be not very useful. This
would leave the results as they stand and not call for further interpretation. When taking this perspective one
would conclude that our results generalize to most programming situations and that learning and using the PSP
methodology is a waste of time. However, this point of view makes it almost impossible to explain the very
positive results from within PSP courses (our own, those of Humphrey and the SEI, or those of many other
successful instructors). We thus reject this explanation.
PSP is useless for this task: The PSP methodology might be of little help for the specific task to be solved in
this experiment and could still be highly useful in other situations. This explanation is sort of an excuse. What
could be so special about our task? Admittedly, the experimental task was quite different from the tasks in
the PSP course. However, a similar statement is true for the control group as well. Most importantly, the PSP
claims to be general, so even severe peculiarities in the task should not invalidate the PSP advantages. Hence,
we reject this explanation as well.
Bad teaching: When Watts Humphrey first heard about these results and that we did not find dramatic differences, he immediately said “I suggest that you look at your teaching methods.” However, we have no reason to
believe that our PSP teaching is unsuccessful. First, the subjective experience of the students is so positive that
both of the PSP courses ranked first among all course evaluations in our department for the respective semester.
Second, the objective results of our students during the course are quite similar to those published by Humphrey
for his own courses [4]. For instance, the average test defect density decreases from about 20 to 40 test defects
found per KLOC in the first three programming exercises to under 10 in the tenth exercise. We believe that our
teaching is definitely OK.
Incorrect PSP use: The second sentence in Humphrey’s above-mentioned statement was “The students may,
in fact, not being properly using the PSP”. Such improper use could happen in two ways. Either they could use
the PSP not at all (at least in parts) or they could use it in an incorrect way. We consider the second explanation
unlikely for our students.
Lack of PSP use: This leaves the one explanation which we think is correct. Most of our PSP participants used
little or none of the PSP techniques in their actual experiment work and probably also had no well-developed
PSP mindset. This impression is corroborated by statements that some of our participants made after the
experiment, for instance
“I should really have performed a code review, but I didn’t.”
“I started to code too early although I had learned in the PSP course that this is a bad idea.”
“I have not used any of the PSP techniques and that made me suffer.”
We cannot quantify the adherence to PSP techniques well. We did not ask for such information in the postmortem questionnaire, because we felt that the resulting information would be unreliable. However, the small
fraction of PSP participants who created time/defect logs (10 out of 29, 34%) suggests that the adherence was
generally quite low.
In this context, we made an interesting observation: Out of the 5 PSP participants who gave up on the experiment, as many as 4 (or 80%) had created a time/defect log! The fraction is so much higher than average that the
difference is probably not accidental (Fisher exact p = 0:13). This suggests that the least capable programmers
have a much higher tendency to cling on to PSP techniques, which is corroborated by the average total working
times of the successful PSP participants: Those with no time/defect logs had an average of 12 hours, those
with logs had an average of 17 hours; the difference is significant (mean bootstrap p = 0:02). Presumably,
adapting PSP techniques helps the more, the less capable a programmer is. If this is the case, it explains why
the variability is consistently lower in the PSP group for almost all variables compared to the non-PSP group.
A controlled experiment on the effects of PSP training
4.3 Consequences
61
4.3 Consequences
We conclude that the PSP is indeed a helpful methodology and the cost of learning it is probably worthwhile.
However, one must not overestimate the influence of the PSP on the average performance of the software
engineers. Our results show that the improvements observed over time within each single PSP course must not
be generalized, because that is too optimistic as a prediction of live software practice. On the other hand, we
found a very important benefit from PSP use that has been completely neglected so far: The variability across
individual software engineers is reduced by PSP training. Small variability is highly desirable in practical
software organizations, because projects profit most from improvements of their worst parts.
We recommend adapting the PSP and consider it to be a risk reduction technique as much as it is a productivity
or quality improvement technique.
Acknowledgements
Thanks to Oliver Gramberg for his contributions to the PSP course, to Holger Hopp, Thomas Schoch, Christian
Nester, and Georg Grütter for guinea-pigging the experimental setup, and to Watts Humphrey for inventing the
PSP in the first place. Thanks most of all to our participants (most of which, although not volunteers, liked the
experiment and gave their best).
Technical Report 1/1999, Lutz Prechelt, Barbara Unger, University of Karlsruhe
Appendix A
Experiment materials
I found in taking the PSP course that the intuition and experience
of someone else I respected a lot were sometimes wrong;
the ”someone else” was me.
I now believe that data on actual practice
is a far better guide for what processes work than anyone’s intuition.
Jim Alstad
This appendix contains the original materials given to the subjects (except for somewhat different page breaks).
The materials were handed out in parts:
1. a questionnaire about personal background information
2. the description of the task and the work time logging sheet
3. a questionnaire which asked for an estimation of expected working time for the task and some other
attributes,
4. a postmortem questionnaire (handed out after delivery of the program) for grading quality attributes of
one’s own program and suggesting improvement areas for one’s own process.
There are two versions of the materials in this report: The original German version as given to the subjects
(Sections A.6 to A.10) and an English translation (Sections A.1 to A.5). Note that in the translation of the task
description the relative weight of the parts of a sentence or paragraph could not always be perfectly maintained.
The page breaks differ from the original materials in both the German and the English version.
62
A.1 Experiment procedure
63
Dear participant,
thank you very much that you agreed to participate in our experiment. The purpose of this experiment is
comparing different methods for producing defect-free programs. Thus, if you have had training in PSP or
Cleanroom, remember what these methods suggested for the production of defect-free software.
It is not the purpose of the experiment to evaluate the performance of the individual participants. Therefore,
your name will not be recorded on the experiment materials; we record only a subject number and a group
designation.
A.1
Experiment procedure
The experiment will be conducted as follows: Each participant receives a programming task that is to be solved
and handed in as error-free as possible. Of course it is nice if you are also fast to complete the task, but that is
of less importance.
The experiment sequence is as follows:
1. fill in the general questionnaire
2. read the task description,
3. fill in the task questionnaire,
4. perform the task and pass the acceptance test
5. fill in the postmortem questionnaire
Prepare yourself for at least half a day, more probably a whole day, for the experiment!
If you solve the task correctly you will receive a copy-card valued DM 50. We will test your solution by a welldefined and objective acceptance test, in which you have to achieve a correctness of at least 95 percent. This
acceptance test simulates the acceptance testing of a real customer — if your program fails in the acceptance
test, a contractual penalty must be paid. This penalty is DM 10 in our case. We will execute the acceptance test
whenever you ask for it; if you fail, the value on your copy-card will drop to DM 40, then DM 30 etc.
Make sure your program is correct. The programming task is not very difficult; you can surely manage to write
a correct program if you try hard enough. To pass the acceptance test upon first try is within your capabilities,
so accept the challenge and take away the full DM 50.
Please read the instructions for the questionnaires and the task carefully. When questions arise, always
ask them immediately.
At some points during the experiment you need to record the time of day. Please be exact and do not forget
it. If you find you forgot anyway, estimate the time retrospectively and mark your entry as an estimation by
prepending “ca.”.
Technical Report 1/1999, Lutz Prechelt, Barbara Unger, University of Karlsruhe
64
Chapter A: Experiment materials
Participant:
A.2
Questionnaire – personal information
Group:
insert number
will be added later
In order to maintain anonymity in the experiment data, you will be identified by a number only. Please write
this number in the top left corner of each questionnaire and also in your program (e.g. in a header).
1. Your semester number?
semester
2. How many years of programming experience do you have?
3. How much programming experience do you have in KLOC?
year(s)
KLOC
4. In how many different programming languages have you written more than one program?
languages
5. How long (in work hours) have you programmed overall outside of your university education?
hours
6. How long (in KLOC) is the biggest program (or your part of a program) you have written so far?
KLOC
7. In which programming language will you solve the task?
programming language
8. How much programming experience do you have in the language chosen in question 7?
KLOC
9. Have you programmed a lot recently?
Yes, I took a lab course with programming exercises.
Yes, for other reasons.
Well, it wasn’t so much
No
10. Do you know how to use Emacs?
yes
a bit
no
11. How do you define “Lines of Code” (LOC)?
Each line counts as a LOC, even if it is empty or contains comment only.
All code lines and all comment lines count as LOC, empty lines do not.
A controlled experiment on the effects of PSP training
A.1 Experiment procedure
65
All code lines count as LOC, empty lines and comment lines do not.
My own definition:
Now carefully read the task description. It is important that you understand it thoroughly. Then fill in the
field for the time of day at the bottom of page 7.
Technical Report 1/1999, Lutz Prechelt, Barbara Unger, University of Karlsruhe
66
Chapter A: Experiment materials
A.3
Task description
Attention: Please follow these instructions super accurately, or else both you and we will become frustrated. . .
First read through the whole of it in order to get an overview. Concentrate on the details only upon second
reading.
The following mapping from letters to digits is given:
E
e
0
JNQ
jnq
1
RWX
rwx
2
DSY
dsy
3
FT
ft
4
AM
am
5
CIV
civ
6
BKU
bku
7
LOP
lop
8
GHZ
ghz
9
We want to use this mapping for encoding telephone numbers by words, so that it becomes easier to remember
the numbers.
Functional requirements
Your task is writing a program that finds, for a given phone number, all possible encodings by words, and prints
them. A phone number is an arbitrary(!) string of dashes - , slashes / and digits. The dashes and slashes
will not be encoded. The words are taken from a dictionary which is given as an alphabetically sorted ASCII
file (one word per line).
Only exactly each encoding that is possible from this dictionary and that matches the phone number exactly
shall be printed. Thus, possibly nothing is printed at all. The words in the dictionary contain letters (capital or
small, but the difference is ignored in the sorting), dashes - and double quotes " . For the encoding only the
letters are used, but the words must be printed in exactly the form given in the dictionary. Leading non-letters
do not occur in the dictionary.
Encodings of phone numbers can consist of a single word or of multiple words separated by spaces. The
encodings are built word by word from left to right. If and only if at a particular point no word at all from the
dictionary can be inserted, a single digit from the phone number can be copied to the encoding instead. Two
subsequent digits are never allowed, though. To put it differently: In a partial encoding that currently covers k
digits, digit k + 1 is encoded by itself if and only if, first, digit k was not encoded by a digit and, second, there
is no word in the dictionary that can be used in the encoding starting at digit k + 1.
Your program must work on a series of phone numbers; for each encoding that it finds, it must print the phone
number followed by a colon, a single(!) space, and the encoding on one line; trailing spaces are not allowed.
All remaining ambiguities in this specification will be resolved by the following example. (Still remaining
ambiguities are intended degrees of freedom.)
Dictionary (in file test.w):
an
blau
Bo"
Boot
bo"s
da
Fee
fern
A controlled experiment on the effects of PSP training
A.3 Task description
Fest
fort
je
jemand
mir
Mix
Mixer
Name
neu
o"d
Ort
so
Tor
Torf
Wasser
Phone number list (in file test.t):
112
5624-82
4824
0721/608-4067
10/783--5
1078-913-5
381482
04824
Program start command (must have this form!):
telefonkode test.w test.t
Corresponding correct program output (on screen):
5624-82: mir Tor
5624-82: Mix Tor
4824: Torf
4824: fort
4824: Tor 4
10/783--5: neu o"d 5
10/783--5: je bo"s 5
10/783--5: je Bo" da
381482: so 1 Tor
04824: 0 Torf
04824: 0 fort
04824: 0 Tor 4
Any other output would be wrong (except for different ordering of the lines).
Wrong outputs for the above example would be e.g.
562482: Mix Tor, because the formatting of the phone number is incorrect,
10/783--5: je bos 5, because the formatting of the second word is incorrect,
4824: 4 Ort, because in place of the first digit the words Torf, fort, Tor could be used,
1078-913-5: je Bo" 9 1 da , since there are two subsequent digits in the encoding,
Technical Report 1/1999, Lutz Prechelt, Barbara Unger, University of Karlsruhe
67
68
Chapter A: Experiment materials
04824: 0 Tor , because the encoding does not cover the whole phone number, and
5624-82: mir Torf , because the encoding is longer than the phone number.
The above data are available to you in the files test.w (dictionary), test.t (telephone numbers) and
test.out (program output).
Quantitative requirements
Length of the individual words in the dictionary: 50 characters maximum.
Number of words in the dictionary: 75000 maximum
Length of the phone numbers: 50 characters maximum.
Number of entries in the phone number file: unlimited.
Quality requirements
Work as carefully as you would as a professional software engineer and deliver a correspondingly high grade
program. Specifically, thoroughly comment your source code (design ideas etc.).
The focus during program construction shall be on correctness. Generate exactly the right output format
right from the start. Do not generate additional output. We will automatically test your program with hundreds
of thousands of phone numbers and it should not make a single mistake, if possible — in particular it must not
crash. Take youself as much time as is required to ensure correctness.
Your program must be run time efficient in so far that it analyzes only a very small fraction of all dictionary entries in each word appending step. It should also be memory efficient in that it does not use 75000
times 50 bytes for storing the dictionary if that contains many much shorter words. The dictionary must be read
into main memory entirely, but you must not do the same with the phone number file, as that may be arbitrarily
large.
Your program need not be robust against incorrect formats of the dictionary file or the phone number file.
Other constraints
Please use only a single source program file, not several source modules, and give that file the name telefonkode.c (or whatever suffix is used in your programming language). Otherwise the evaluation will become
too difficult for us. If you need to quickly try something out you may of course use a separate file for that
purpose.
A large dictionary, which you can use for testing purposes, is available as the file woerter2.
For compiling your program please use the command make telefonkode, a suitable Makefile is present in
your account. Java users can use this command only in the window entitled “for make and run”.
What’s next?
You have now read the task description. If you have any questions about it, ask them right now. The same is
true for all other questions: ask them immediately. When you want to perform an acceptance test, just tell us.
If the task is clear now, fill in the time of day and read the following pages.
Time of day:
A controlled experiment on the effects of PSP training
A.3 Task description
69
From now on you should log any interruptions of your work. Note start and end time of each interruption. The
reason is not mandatory, but may be useful for reconstructing missing times. Interruptions are, for instance,
questions or going to the restroom.
Start
End
Duration
of
of
of
interruption
interruption
interruption
Technical Report 1/1999, Lutz Prechelt, Barbara Unger, University of Karlsruhe
Reason
70
Chapter A: Experiment materials
Participant:
A.4
insert number
Questionnaire – Estimation
Group:
will be added later
Imagine that your company is asked to implement the “phone number memorization utility”. Your boss has to
prepare a cost estimate and asks you, how much work that will be (since you are also the programmer planned to
implement the program). He asks you to estimate the effort in terms of LOC as wel as time units. Furthermore,
you shall estimate the number of defects as a characterization of complexity and difficulty. In addition, provide
a range in which you expect the actual values to lie with a probability of 90 percent.
1. Estimate how long you will need for solving the task (starting with the time stamp
you just wrote and ending with the complete, working, well-tested program).
time in minutes
time range [min–max]
2. Estimate your productivity measured in LOC/hour
LOC/hour
productivity range [min–max]
3. How many defects will you find during compilation and how many defects will you
find during test?
a) during compilation
number of defects
b) during test
number of defects
How long will designing, coding, compiling, and testing your program take? These
intervals are defined as follows: Coding time is the time required for transforming a
design into code; compiling time is the time from the first start of the compiler until
the first successful compilation; and testing time is the time required for thoroughly
testing the compiled program. Often program development is not linear. If, for example, you code only partially, then compile, then code further, then compile, then
test etc. you have to add the times for the individual coding, compilation, and testing
phases.
4. How long will you take for design and coding?
in minutes
5. How long will you take for compiling your program?
in minutes
6. How long will you take for testing your program?
in minutes
A controlled experiment on the effects of PSP training
A.3 Task description
71
time of day:
Now that you have answered the questions, please throw this questionnaire in the box. Then start programming.
Technical Report 1/1999, Lutz Prechelt, Barbara Unger, University of Karlsruhe
72
Chapter A: Experiment materials
Now please start
programming!
A controlled experiment on the effects of PSP training
A.3 Task description
73
Participant:
A.5
insert number
Questionnaire – Self-evaluation
Group:
will be added later
Now that your implementation is finished completely. Please write down the time of day here.
Time of day:
Evaluate your finished program according to the following criteria:
(1) very good
(2) good
(3) OK
Scale:
(4) weak
(5) bad
The documentation (should be informative and precise but not verbose)
Program usage: How well can a new user handle the program; will s/he know how
to call it?
Program purpose: Can somebody who does not know the program find out from
the documentation what the program does?
Design ideas: How well can design decisions be understood?
Purpose of procedures: How well documented is the purpose of individual procedures?
Purpose of data types: How well documented is the purpose of the program’s data
types?
Purpose of variables: How easily can the purpose of variables be recognized?
Code in procedures: How well documented are difficult code sequences in procedures?
The complexity (should be as low as possible)
Data structures: How complex are the data structures that are used?
Control flow: How easily can control flow be followed?
Algorithms: How understandable/complex are the algorithms that are used?
Technical Report 1/1999, Lutz Prechelt, Barbara Unger, University of Karlsruhe
74
Chapter A: Experiment materials
Evaluate overall quality
Understandability: How readable is the whole program, e.g. with respect to changes
to be made by other programmers?
Extensibility: How well-suited is the program for later changes and extensions?
Correctness: How well does the program satisfy its requirements?
How many defects do you expect have remained in the program despite the tests?
How reliable do you expect your program to be? 1 failure for every
test cases.
Where do you expect the largest potential for improvement if you want to improve the speed or quality of your
program development:
defect reduction
reducing design defects
reducing implementation defects
reducing defects during translation of design into code
How could you reduce your defects?
Perform design reviews
Perform code reviews
use formal methods
use different programming language
use tools
which, or what needs to be changed in the current one
which?
achieve higher productivity by
better programming languages
which, or what needs to be changed in the current one
more design (more precise designs, more detailed designs)
leave out unimportant process steps, e.g. defect protocols or overly precise designs
more tool use, e.g.
more programming practice
a better programming education
A controlled experiment on the effects of PSP training
A.3 Task description
75
That’s all. Thanks for participating!
One more request: Please do not talk about the experiment to other persons,
unless these have already participated themselves. In particular, don’t describe
the task. If somebody knows the task, s/he will unavoidably start thinking about
possible solutions, which might massively distort the results if that person later
participates in the experiment. In that case, several months of preparation work
we did would be wasted.
Technical Report 1/1999, Lutz Prechelt, Barbara Unger, University of Karlsruhe
76
Kapitel A: Experiment materials
Liebe Versuchsteilnehmerin, lieber Versuchsteilnehmer,
vielen Dank, daß Sie sich bereiterklärt haben, an unserem Experiment teilzunehmen. Dieses Experiment soll
dazu dienen, verschiedene Methoden zur Herstellung fehlerfreier Programme zu vergleichen. Wenn Sie also
einmal PSP oder Cleanroom gelernt haben, erinnern Sie sich daran, was diese Methoden für die Herstellung
fehlerfreier Software vorschlagen.
Das Experiment dient nicht dazu, die einzelnen Versuchspersonen in ihren Leistungen zu bewerten. Deshalb
werden auf den Unterlagen keine Namen festgehalten, sondern nur Nummern und eine Gruppenbezeichnung.
A.6
Versuchsablauf
Das Experiment wird folgendermaßen ablaufen: Jede Versuchsperson bekommt eine Programmieraufgabe
gestellt, die möglichst fehlerfrei gelöst und abgegeben werden soll. Natürlich ist es schön, wenn das auch noch
schnell geht, aber die Zeit spielt eine untergeordnete Rolle. Der Ablauf des Versuches ist wie folgt:
1. den allgemeinen Fragebogen ausfüllen,
2. die Aufgabenbeschreibung lesen,
3. den Fragebogen zur Aufgabe ausfüllen,
4. die Aufgabe lösen und den Akzeptanztest bestehen,
5. den Schlußfragebogen ausfüllen.
Stellen Sie sich darauf ein, daß das Experiment mindestens einen halben, eher einen ganzen Tag in Anspruch
nimmt!
Wenn Sie die Aufgabe korrekt lösen, erhalten Sie als Ausgleich eine Copy-Card im Wert von bis zu DM 50.
Wir prüfen Ihre Lösung durch einen wohldefinierten und objektiven Akzeptanztest, bei dem Sie eine Korrektheit von 95 Prozent erreichen müssen. Dieser Akzeptanztest simuliert den Abnahmetest durch einen echten
Softwarekunden — wird der Abnahmetest nicht bestanden, wird eine Vertragsstrafe fällig! Diese beträgt bei
uns DM 10. Wir führen den Akzeptanztest immer dann durch, wenn Sie danach fragen; beim Durchfallen sinkt
also der Wert ihrer Copy-Card erst auf DM 40, dann auf DM 30 und so weiter.
Achten Sie also auf die Korrektheit Ihres Programms. Die Aufgabe ist nicht übermäßig schwierig; Sie können
es mit Sicherheit schaffen, ein korrektes Programm zu schreiben, wenn Sie sich genug Mühe geben. Den
Akzeptanztest auf Anhieb zu Bestehen, liegt durchaus im Rahmen ihrer Möglichkeiten, also nehmen Sie die
Herausforderung an und kassieren Sie die DM 50 komplett!
Bitte lesen Sie die Anweisungen zu den Fragebögen und der Aufgabe genau durch. Falls Fragen auftauchen, klären Sie sie immer sofort.
An einigen Stellen im Experiment müssen Sie die Uhrzeit protokollieren. Bitte seien Sie möglichst exakt und
vergessen Sie es nicht. Falls es doch einmal dazu kommen sollte, schätzen Sie die Uhrzeit nachträglich und
markieren sie durch vorangestelltes ca.“ als geschätzt.
”
A controlled experiment on the effects of PSP training
A.6 Versuchsablauf
77
Versuchsperson:
A.7
Fragebogen – persönliche Angaben
Gruppe:
Hier Nummer eintragen
Wird später eingetragen
Damit die Versuchsdaten anonym bleiben, bekommen Sie nur eine Nummer. Bitte scheiben Sie diese Nummer
oben rechts auf jeden Fragebogen und auch in Ihr Programm (z. B. in den Header).
1. In welchem Semester sind Sie?
Semester
2. Wieviele Jahre Programmiererfahrung haben Sie?
3. Wieviel Programmiererfahrung haben Sie in KLOC?
Jahr(e)
KLOC
4. In wie vielen verschiedenen Programmiersprachen haben Sie schon mehr als ein Programm geschrieben?
Programmiersprachen
5. Wie lange (in Arbeitsstunden) haben Sie insgesamt außerhalb Ihrer unmittelbaren Studienausbildung proStunden
grammiert?
6. Wie lang (in KLOC) ist das größte Programm (bzw. Ihr Anteil an einem Programm), das Sie bisher
KLOC
geschrieben haben?
7. In welcher Programmiersprache werden Sie die Aufgabe lösen?
Programmiersprache
8. Wieviel Programmiererfahrung haben Sie in der (in Frage 7) gewählten Sprache?
KLOC
9. Haben Sie in der letzten Zeit viel programmiert?
Ja, ich habe ein Praktikum mit Programmierübungen gemacht
Ja, ich habe aus anderen Gründen viel programmiert
Naja, soviel war’s nicht
Nein
10. Kennen Sie sich mit dem Emacs aus?
ja
wenig
nein
11. Wie definieren Sie das Maß Lines of Code“ (LOC)?
”
Zu LOC zählt jede Zeile, auch wenn sie nur einen Kommentar enthält oder eine Leerzeile ist.
Als LOC zählen alle Code-Zeilen und alle Kommentarzeilen, Leerzeilen jedoch nicht.
Als LOC zählen nur Code-Zeilen, Leerzeilen und Kommentarzeilen jedoch nicht.
Technical Report 1/1999, Lutz Prechelt, Barbara Unger, University of Karlsruhe
78
Kapitel A: Experiment materials
Eigene Definition:
Lesen Sie sich bitte die Aufgabenstellung gut durch. Es ist wichtig, daß Sie diese verstanden haben. Danach
füllen Sie unten auf Seite 7 das Feld für die Uhrzeit aus.
A controlled experiment on the effects of PSP training
A.8 Aufgabenstellung
A.8
79
Aufgabenstellung
Achtung: Befolgen Sie diese Anleitung bitte supergenau, sonst gibt es nur Frust auf beiden Seiten. . .
Lesen Sie erst einmal alles komplett durch, um ein Gesamtbild der Aufgabe zu bekommen, und konzentrieren
Sie sich erst beim anschließenden zweiten Lesen auf die Details.
Gegeben sei die folgende Abbildung der Buchstaben des Alphabets auf Ziffern:
E
JNQ
RWX
DSY
FT
AM
CIV
BKU
LOP
GHZ
e
jnq
rwx
dsy
ft
am
civ
bku
lop
ghz
0
1
2
3
4
5
6
7
8
9
Wir wollen diese Abbildung benutzen, um Telefonnummern durch Wörter zu kodieren, damit wir sie uns leichter merken können.
Funktionale Anforderungen
Ihre Aufgabe ist es, ein Programm zu schreiben, das zu einer eingegebenen Telefonnummer alle möglichen
Kodierungen durch Wörter findet und ausgibt. Eine Telefonnummer ist eine beliebige(!) Kette aus Querstrichen
- , Schrägstrichen / und Ziffern. Die Quer- und Schrägstriche werden nicht kodiert. Die Wörter sind einem
Wörterbuch zu entnehmen, das als alphabetisch sortierte ASCII-Datei vorliegt (ein Wort pro Zeile).
Nur genau jede laut diesem Wörterbuch mögliche Kodierung, die exakt auf die Telefonnummer paßt, soll ausgegeben werden. Gegebenenfalls wird also gar nichts ausgegeben. Die Wörter im Wörterbuch können Buchstaben (Groß- und Kleinbuchstaben, die aber gleich sortiert werden), Querstriche - und Anführungszeichen
" enthalten. Für die Kodierung werden nur die Buchstaben beachtet, jedoch sollen Wörter genau in der Form
ausgegeben werden, wie sie im Wörterbuch stehen. Führende Sonderzeichen kommen im Wörterbuch nicht
vor.
Kodierungen von Telefonnummern können aus einem einzelnen Wort oder aus mehreren Wörtern hintereinander (durch Leerzeichen getrennt) bestehen. Die Kodierungen werden von vorne nach hinten wortweise aufgebaut. Wenn und nur wenn an einer Stelle überhaupt kein Wort aus dem Wörterbuch eingesetzt werden kann,
darf stattdessen auch eine einzelne Ziffer aus der Telefonnummer direkt übernommen werden, niemals jedoch
zwei Ziffern hintereinander. Mit anderen Worten: In einer teilfertigen Kodierung, die bisher k Ziffern abdeckt,
steht für Ziffer k + 1 dann und nur dann die Ziffer selbst, wenn erstens für Ziffer k keine Ziffer steht und
zweitens kein Wort im Wörterbuch steht, das sich ab Ziffer k + 1 beginnend einsetzen läßt.
Ihr Programm soll gleich eine ganze Liste von Telefonnummern abarbeiten und für jede gefundene Kodierung die Telefonnummer gefolgt von Doppelpunkt, einem(!) Leerzeichen und der Kodierung auf einer Zeile
ausgeben; nachfolgende Leerzeichen sind nicht erlaubt.
Alle verbleibenden Mehrdeutigkeiten in der Spezifikation werden von folgendem Beispiel ausgeräumt (übrige
Mehrdeutigkeiten sind gewollte Freiheitsgrade).
Wörterbuch (in der Datei test.w):
an
blau
Bo"
Technical Report 1/1999, Lutz Prechelt, Barbara Unger, University of Karlsruhe
80
Kapitel A: Experiment materials
Boot
bo"s
da
Fee
fern
Fest
fort
je
jemand
mir
Mix
Mixer
Name
neu
o"d
Ort
so
Tor
Torf
Wasser
Telefonnummernliste (in der Datei test.t):
112
5624-82
4824
0721/608-4067
10/783--5
1078-913-5
381482
04824
Programmaufruf (muß unbedingt diese Form haben!):
telefonkode test.w test.t
Zugehörige korrekte Programmausgabe (auf den Bildschirm):
5624-82: mir Tor
5624-82: Mix Tor
4824: Torf
4824: fort
4824: Tor 4
10/783--5: neu o"d 5
10/783--5: je bo"s 5
10/783--5: je Bo" da
381482: so 1 Tor
04824: 0 Torf
04824: 0 fort
04824: 0 Tor 4
Jede abweichende Ausgabe wäre falsch (ausgenommen andere Reihenfolge der Zeilen).
A controlled experiment on the effects of PSP training
A.8 Aufgabenstellung
81
Falsche Ausgaben im obigen Beispiel wären also z.B.
562482: Mix Tor, weil das Format der Telefonnummer verkehrt ist.
10/783--5: je bos 5, weil das Format des zweiten Worts verkehrt ist.
4824: 4 Ort, weil bei der ersten Ziffer die Wörter Torf, fort, Tor eingesetzt werden können,
1078-913-5: je Bo" 9 1 da , da in der Kodierung zwei Ziffern aufeinanderfolgen,
04824: 0 Tor , weil die Kodierung nicht die ganze Telefonnummer abdeckt und
5624-82: mir Torf , weil die Kodierung länger ist als die Telefonnummer.
Die obigen Daten stehen Ihnen in den Dateien test.w (Wörterbuch), test.t (Telefonnummern) und
test.out (Programmausgaben) zur Verfügung.
Quantitative Anforderungen
Länge der einzelnen Wörter im Wörterbuch: maximal 50 Zeichen.
Anzahl Wörter im Wörterbuch: maximal 75000
Länge der Telefonnummern: maximal 50 Zeichen
Anzahl Einträge in der Telefonnummerndatei: beliebig.
Qualitätsanforderungen
Arbeiten Sie so sorgfältig, wie sie es auch als professioneller Softwareingenieur tun würden und liefern Sie ein
entsprechend hochwertiges Programm ab. Kommentieren Sie insbesondere Ihren Quellkode gründlich (Entwurfsideen etc.).
Das Hauptaugenmerk bei der Erstellung des Programms soll auf der Korrektheit liegen. Erzeugen Sie
von Anfang an genau das richtige Ausgabeformat. Machen Sie keinerlei zusätzliche Ausgaben. Wir werden Ihr
Programm maschinell mit mehreren Hunderttausend Telefonnummern testen und es sollte wenn irgend möglich
keinen einzigen Fehler machen — insbesondere keinesfalls abstürzen. Nehmen Sie sich soviel Zeit wie nötig,
um die Korrektheit sicherzustellen.
Ihr Programm muß laufzeiteffizient sein insofern, daß bei jedem Wortanfügungsschritt nur ein sehr
kleiner Teil der Wörterbucheinträge betrachtet wird. Es sollte auch speichereffizient sein, indem nicht
75000 mal 50 Bytes für das Wörterbuch benutzt werden, wenn viele viel kürzere Wörter darin vorkommen.
Das Wörterbuch muß komplett in den Hauptspeicher eingelesen werden, die Telefonnummerndatei darf dies
jedoch keinesfalls, da sie beliebig groß werden kann.
Ihr Programm braucht nicht robust zu sein gegen fehlerhafte Formate von Wörterbuch oder Telefonnummernliste.
Sonstige Randbedingungen
Bitte benutzen Sie die ganze Zeit nur eine Programmdatei, also nicht mehrere Module, und geben Sie ihr den
Namen telefonkode.c (oder welchen Suffix auch immer ihre Programmiersprache braucht). Andernfalls wird
die Auswertung für uns zu kompliziert. Wenn Sie zwischendurch kurz etwas ausprobieren müssen, können sie
dafür natürlich auch eine andere Datei benutzen.
Ein großes Wörterbuch, das Sie zu Testzwecken verwenden können, steht Ihnen als woerter2 zur Verfügung.
Zum Übersetzen Ihres Programms benutzen Sie bitte das Kommando make telefonkode, einen passenden
Makefile haben wir für Sie vorbereitet. Java-Benutzer müssen beachten, daß dieses Kommando aus technischen
Gründen ausschließlich in dem mit Nur hier Java“ betitelten Fenster funktioniert.
”
Technical Report 1/1999, Lutz Prechelt, Barbara Unger, University of Karlsruhe
82
Kapitel A: Experiment materials
Wie geht’s weiter?
Sie haben nun die Aufgabenstellung gelesen. Wenn Sie dazu noch Fragen haben, stellen Sie sie bitte jetzt
sofort. Dies gilt auch für alle anderen Fragen. Sofort alle Unklarheiten beseitigen. Wenn Sie irgendwann einen
Akzeptanztest durchführen wollen, sagen Sie uns einfach Bescheid. Wenn die Aufgabenstellung klar ist, tragen
Sie nun bitte die Uhrzeit ein und lesen noch die nächsten Seiten.
Uhrzeit:
A controlled experiment on the effects of PSP training
A.8 Aufgabenstellung
83
Ab diesem Zeitpunkt sollen Sie jede Unterbrechung mitprotokollieren. Schreiben Sie immer bei Beginn und
Ende einer Unterbrechung die Uhrzeit auf. Der Grund der Unterbrechung ist nicht unbedingt erforderlich, ist
aber nützlich, um evtl. fehlende Uhrzeiten nachzuvollziehen. Als Unterbrechungen zählen z. B. Fragen, Gang
auf die Toilette.
Beginn
Ende
Dauer
der
der
der
Unterbrechung
Unterbrechung
Unterbrechung
Technical Report 1/1999, Lutz Prechelt, Barbara Unger, University of Karlsruhe
Grund
84
Kapitel A: Experiment materials
Versuchsperson:
A.9
Fragebogen – Selbsteinschätzung
Gruppe:
Hier Nummer eintragen
Wird später eingetragen
Stellen Sie sich vor, Ihre Firma bekommt den Auftrag Programmieren einer Telefonnummer-Merkhilfe“. Ihr
”
Chef muß einen Kostenvoranschlag machen und will von Ihnen wissen, wieviel Arbeit das ist (da Sie auch
der Programmierer sein werden, der das implementieren soll). Sie sollen den Aufwand sowohl in LOC als
auch in Zeiteinheiten angeben. Darüberhinaus soll die Abschätzung der Fehler eine ungefähre Einschätzung
der Komplexität und des Schwierigkeitsgrades ermöglichen. Geben Sie jeweils zusätzlich einen Bereich an, in
dem das Ergebnis Ihrer Einschätzung nach mit einer Wahrscheinlichkeit von 90% liegen wird.
1. Was schätzen Sie:
Wie lange werden Sie ungefähr für das Lösen der Aufgabe brauchen (gerechnet ab
dem letzten Uhrzeit-Protokollieren bis zum fertigen, lauffähigen und ausgetesteten
Programm)?
Zeit in Minuten
Zeitbereich [von–bis]
2. Wie hoch schätzen Sie ihre Produktivität in LOC/Stunde?
LOC/Stunde
Produktivitätsbereich [von–bis]
3. Wieviele Fehler werden Sie beim Übersetzen (Kompilieren) beseitigen, und wieviele
Fehler werden Sie im Test finden?
a) beim Übersetzen
Fehleranzahl
b) beim Testen
Fehleranzahl
Wie lange werden Sie jeweils zum Entwerfen und Kodieren, Kompilieren und Testen
Ihres Programmes brauchen? Die einzelnen Zeiten sind wie folgt definiert: Kodierzeit ist die Zeit, die man benötigt, um einen Entwurf in Kode umzusetzen; dasÜbersetzen dauert von der ersten Benutzung des Compilers bis zum ersten fehlerfreien
Übersetzen; und das Testen ist die Zeit, die benötigt wird, das übersetzte Programm
auszutesten. Bitte beachten Sie: Da ein Programm meist nicht linear bearbeitet wird,
etwa zuerst nur kodieren, übersetzen und dann testen, sondern z. B. teilweise kodieren, übersetzen, weiter kodieren, nochmals übersetzen und dann erst austesten.
Falls das Ihre übliche Arbeitweise ist, müssen Sie die Werte der einzelnen Kodier-,
Übersetzungs- und Testphasen jeweils aufaddieren.
4. Wie lange werden Sie zum Entwerfen und Kodieren brauchen?
in Minuten
5. Wie lange werden Sie zum Übersetzen Ihres Programmes brauchen?
in Minuten
6. Wie lange werden Sie zum Testen Ihres Programmes brauchen?
in Minuten
aktuelle Uhrzeit:
A controlled experiment on the effects of PSP training
A.8 Aufgabenstellung
85
Nachdem Sie die Fragen beantwortet haben, werfen Sie nun diesen Fragebogen in die Box. Bitte fangen
Sie dann an zu programmieren.
Technical Report 1/1999, Lutz Prechelt, Barbara Unger, University of Karlsruhe
86
Kapitel A: Experiment materials
Bitte programmieren
Sie jetzt die Aufgabe!
A controlled experiment on the effects of PSP training
A.8 Aufgabenstellung
87
Versuchsperson:
A.10
Fragebogen – Eigenbeurteilung
Gruppe:
Hier Nummer eintragen
Wird später eingetragen
Bitte tragen Sie, nachdem sie die Implementierung vollständig abgeschlossen haben, hier die Uhrzeit ein.
Uhrzeit:
Nachdem Sie nun Ihr Programm fertig haben, versuchen Sie es an folgenden Kriterien zu beurteilen:
(1) sehr gut
(2) gut
(3) ausreichend
Bewertungsskala:
(4) schwach
(5) schlecht
Im Bereich der Dokumentation (sollte informativ und genau, aber nicht ausschweifend sein)
Programmverwendung: Wie gut kann ein fremder Nutzer mit dem Programm umgehen; weiß er, wie er es aufrufen muß?
Programmzweck: Kann jemand, der das Programm nicht kennt, an der Dokumentation erkennen, was das Programm macht?
Entwurfsideen: Wie gut kann man Entwurfsentscheidungen nachvollziehen?
Zweck der Prozeduren: Wie gut ist der Zweck der einzelnen Prozeduren dokumentiert?
Zweck der Datentypen: Wie gut ist dokumentiert, wozu die einzelnen Datentypen
dienen?
Zweck der Variablen: Wie gut erkennt man den Zweck von Variablen?
Kode in Prozeduren: Wie gut sind einzelne schwierige Kodestücke dokumentiert?
Im Bereich der Komplexität (sollte so gering wie möglich sein)
Datenstrukturen: Wie komplex sind die verwendeten Datenstrukturen?
Kontrollfluß: Wie gut ist der Kontrollfluß zu verfolgen?
Algorithmen: Wie verständlich/komplex sind die verwendeten Algorithmen?
Technical Report 1/1999, Lutz Prechelt, Barbara Unger, University of Karlsruhe
88
Kapitel A: Experiment materials
Beurteilen Sie die Gesamtqualität
Verständlichkeit: Wie verständlich ist das Gesamtprogramm, z. B. wenn es von
fremden Programmierern inspiziert werden muss?
Erweiterbarkeit: Wie gut ist das Programm für Änderungen und Erweiterungen
geeignet?
Korrektheit: Wie gut entspricht das Programm den gestellten Anforderungen?
Wieviele Fehler, vermuten Sie, sind trotz Test im Programm verblieben?
Wie zuverlässig schätzen sie Ihr Programm ein? 1 Versagen pro
Testfälle
Wo erwarten Sie Ihr größtes Verbesserungspotential, wenn Sie schneller und/oder bessere Software entwickeln
möchten:
Fehlerreduktion
Entwurfsfehler verringern
Implementierungsfehler verringern
Fehler bei der Umsetzung von Entwurf in Implementierung verringern
Wie könnten Sie die Fehler reduzieren?
Entwurfsdurchsichten machen
Kodedurchsichten machen
formale Methoden einsetzen
andere Programmiersprache einsetzten
Werkzeuge einsetzen
welche, bzw. was muß an der aktuellen geändert werden
welche?
höhere Produktivität anstreben durch
bessere Programmiersprachen
welche, bzw. was muß an der aktuellen geändert werden
mehr Entwurf (genauere Entwürfe, ausführlichere Entwürfe)
unwichtige Zwischenschritte weglassen, z. B. Fehlerprotokolle oder zu genaue Entwürfe
mehr Werkzeuge einsetzen, z. B.
mehr Übung im Programmieren
eine bessere Programmierausbildung
A controlled experiment on the effects of PSP training
A.8 Aufgabenstellung
89
Das wars. Danke fürs Mitmachen!
Wir haben noch eine dringende Bitte an Sie: Sprechen Sie nicht mit anderen
Personen über das Experiment, wenn diese nicht selbst schon an dem Experiment teilgenommen haben. Beschreiben Sie insbesondere niemandem die Art
der Problemstellung der Programmieraufgabe. Wer sie kennt, fängt unweigerlich an, über mögliche Lösungen nachzudenken, was unsere Experimentergebnisse total verfälschen würde, wenn die betreffende Person später am Experiment teilnimmt. Dann hätten wir monatelange Vorbereitungen umsonst
gemacht.
Technical Report 1/1999, Lutz Prechelt, Barbara Unger, University of Karlsruhe
Appendix B
Glossary
This glossary contains only the most important terms. Please refer to the main text for the exact definitions or
descriptions, as (and if) indicated in the short glossary description. The definitions of related terms are often
found at nearby points in the text.
bootstrap resampling
iqr
a general statistical technique based on stochastic simulation (see Section 3.1.1
on page 20)
the inter-quartile range, a specific measure of variability of data (see Section 3.1.1 on page 20)
iqr bootstrap p = 0:1
a p-value computed by a bootstrap resampling of differences of interquartile
ranges (see Section 3.1.1 on page 20)
LOC
lines of code. Either refers to a subject’s own definition of the term (estimation)
or to a common standard definition as indicated (measurement)
mean bootstrap p = 0:1
median
a p-value computed by a bootstrap resampling of differences of means (see
Section 3.1.1 on page 20)
A value such that 50% of the values are smaller and 50% are larger (see Section 3.1.1 on page 20)
M
in boxplots: symbol that marks the arithmetic mean
N
control group of non-PSP-trained subjects (see Section 2.4.1 on page 11). Usually refers to Nraw as well as its subgroups.
Nraw
control group with unsuccessful participants removed (see Section 3.2 on
page 25)
P
experiment group of PSP-trained subjects (see Section 2.4.1 on page 11). Usually
refers to Praw as well as its subgroups.
p-value
the probability (computed by a statistical test) that a certain observation is only
due to chance (see Section 3.1.1 on page 20)
participant
A person who has participated in the experiment, also called subject
Praw
experiment group with unsuccessful participants removed (see Section 3.2 on
page 25)
90
A.8 Aufgabenstellung
91
prodestim
productivity (in lines of code written per work hour) as estimated by the subject
PSP
The Personal Software Process methodology (see Section 1.1 on page 4)
psp
An individual personal software process (see Section 1.1 on page 4)
quantile
rank
A value just larger than a certain fraction of all values (see Section 3.1.1 on
page 20)
the order number of a value when the sample is sorted (see Section 3.1.2 on
page 23)
regression line
a trend line computed by a mathematical procedure (see Section 3.1.2 on page 23)
significant
A difference is called significant if its p-value is too small to be incidental (see
subject
A person who has participated in the experiment, also called participant
testim
total working time as estimated by the subject
twork
actual total working time of subject
variability
or variation or variance or spread: how much the values in a sample differ from
each other (see Section 3.1.1 on page 20)
Wilcoxon test p = 0:1
a p-value computed by a Wilcoxon rank sum test (see Section 3.1.1 on page 20)
Section 3.1.3 on page 25)
Technical Report 1/1999, Lutz Prechelt, Barbara Unger, University of Karlsruhe
92
Chapter B: Glossary
A controlled experiment on the effects of PSP training
Bibliography
[1] Larry B. Christensen. Experimental Methodology. Allyn and Bacon, Needham Heights, MA, 6th edition,
1994.
[2] Bradley Efron and Robert Tibshirani. An introduction to the Bootstrap. Monographs on statistics and
applied probability 57. Chapman and Hall, New York, London, 1993.
[3] Watts Humphrey. A Discipline for Software Engineering. SEI Series in Software Engineering. Addison
Wesley, Reading, MA, 1995.
[4] Watts S. Humphrey. Using a defined and measured personal software process. IEEE Software, 13(3):77–88,
May 1996.
[5] Julien L. Simon.
Resampling:
http://www.statistics.com.
The new statistics.
93
Duxbury Press, Belmont, CA, 1992.

Download Report

A controlled experiment on the effects of PSP training: Detailed

Paperzz.com

Your Paperzz