A Replicated Study on Correlating Agile Team Velocity Measured in

A Replicated Study on Correlating Agile Team
Velocity Measured in Function and Story Points
Hennie Huijgens
Rini van Solingen
Delft University of Technology
The Netherlands
[email protected]
Delft University of Technology and Prowareness
The Netherlands
[email protected]
scribe the backgrounds of estimating the relative effort of software products, and the coherence between both SP and size
measures such as FP [4] [5] [8] [9] [10] [11] [12]. Yet not much
of the present literature presents a thorough quantitative analysis
on any statistical correlation between both metrics.
ABSTRACT
Since the rapid growth of agile development methods for software
engineering, more and more organizations measure the size of
iterations, releases, and projects in both function points and story
points. In 2011 Santana et al. performed a case study on the relation between function points and story points, from data collected
in a Brazilian Government Agency. In this paper we replicate this
study, using data collected in a Dutch banking organization.
Based on a statistical correlation test we find that a comparison
between function points and story points as measured in our repository indicates a moderate negative linear relation, where Santana et al. concluded a strong positive linear relation between both
size metrics in their case study. Based on the outcome of our
study we conclude that it appears too early to make generic claims
on the relation between function points and story points; in fact
FSM-theory seems to underpin that such a relationship is a spurious one.
To make things more complicated, the limited set of quantitative
studies on both FP and SP are not in sync with regard to their
outcomes. In 2009 Fuqua [13] reported on a controlled experiment
on the use of FP in combination with XP. For 100 User Stories he
tried to estimate the effort in FP and also in the number of ideal
engineering days, a SP like measure often used within XP-projects. The outcome of the experiment was that the used FP procedure was unable to estimate the required effort, and that FP supports the measurement of the load factor insufficiently.
However more recently, in 2011, a quantitative analysis of the
correlation between SP and FP was performed by Santana et al.
[7]. That case study demonstrated a strong positive correlation
between SP and FP that were both collected on 19 iterations from
a project realized by a Brazilian public agency. Together with a
number of qualitative research studies that argued that FP and SP
are from different planets and cannot be compared [5] [2] [11], a
rather confusing picture of size metrics in software engineering
emerges.
Categories and Subject Descriptors
D.2.8 [Metrics]: Product metrics, Process metrics
General Terms
Measurement.
Keywords
Given the growing importance of reliable, standardized, and
quantifying tools for measurement and analysis of software engineering in agile environments, we propose to replicate the study
performed by Santana et al. [7] with our own research data. As
described in an earlier study [14], we collected a repository with
primary data of more than 350 finalized software engineering
projects. A subset of this repository represented 26 small software
releases that were performed during one year by two software
development teams within a Finance & Risk department of a large
banking company in The Netherlands. 14 Of these iterations were
performed according to an agile development method (Scrum); for
those iterations the size of each release was measured in FP and
estimated in SP.
Story Point, Function Point, Function Point Analysis.
1. INTRODUCTION
The importance of effort estimation in software industry is well
known. Project overruns are frequent, and managers state that
accurate estimation is perceived as a problem [1]. Software size
can be an important component of a cost or effort estimate [2].
Two ways of size measurement are often used in practice. Function points (FP) are an industry standard for expressing the functional size of software products. Story points (SP), on the other
hand, are used to determine velocity of Scrum teams [3]. In this
paper, we explore the relationship between these two measures.
Since the rapid growth of agile development methods for software
engineering an increased interest can be found in literature on
effort estimation of product features in agile , such as SP [4] [5]
[6] [7] [2]. In particular, articles on agile delivery methods de-
Similar to Santana et al. [7] the research objective of our study is
to quantitatively analyze the relationship between SP and FP,
based on empirical data from a real life case study where two sets
of releases are measured in these two size metrics.
The contribution of this research is to examine whether FP
are compatible with SP on agile projects.
Permission totomake
digital
or hard
copies
of all of
or all
partorofpart
this work
forwork
personal
Permission
make
digital
or hard
copies
of this
for or
classroom or
useclassroom
is granted without
fee provided
that copies
are not made
or distributed
personal
use is granted
without
fee provided
that copies
are
for profit or commercial advantage and that copies bear this notice and the full citation
not
or distributed
forcomponents
profit or of
commercial
advantage
on themade
first page.
Copyrights for
this work owned
by othersand
thanthat
ACM
copies
bear thisAbstracting
notice and
thecredit
fulliscitation
onTothe
first
page. To
copy
must be honored.
with
permitted.
copy
otherwise,
or republish,
to
post
on
servers
or
to
redistribute
to
lists,
requires
prior
specific
permission
and/or
otherwise, or republish, to post on servers or to redistribute to lists, a
fee. Request permissions from [email protected].
requires
prior specific permission and/or a fee.
WETSoM’14,
3, 2014,
WETSoM'14, June
June 3,
2014, Hyderabad,
Hyderabad, India
India.
Copyright 2014 ACM 978-1-4503-2854-8/14/06...$15.00
978-1-4503-2854-8/14/06... $15.00.
http://dx.doi.org/10.1145/2593868.2593874
In this paper we subsequently discuss in Section 2 the background
of size metrics in both traditional software development environments (e.g. FP), and agile delivery environments (e.g. SP), in
Section 3 the problem statement, in Section 4 the original study,
in Section 5 the implementation aspects, in Section 6 evaluation
30
Table 1. Data for two variables in three samples
results, in Section 7 discussion on the results, in Section 8 threats
to validity, in Section 9 related work, and in Section 10 conclusions and future work.
Santana et al.
2. BACKGROUND
Function Point Analysis (FPA) is an established industry standard
for expressing the functional size of software products [15]. FPA
measures the functional size of software engineering activities by
looking at relevant functions for users and (logical) data sets. The
unit of measurement is the function point (FP). The functional
size of an information system or a project is thus expressed in a
number of FP. FPA is used for both new construction [16] [17]
and maintenance activities (e.g. not valid for corrective and part of
adaptive activities) [18]. Counting guidelines, which describe how
to perform an FPA, are improved, monitored and controlled by
organizations of FPA users such as IFPUG [16] and NESMA
[17]. The various methods differ in practice, though not much of
one another [19]. Both methods are certified and recorded as an
ISO standard for Functional Size Measurement (FSM) [16] [17]1.
SP on the other hand are a roughly estimate by experience and
analogue of the relative ‘size and complexity’ of each user story
compared to other user stories in the same project [20]. SP are
closely related to the conception of velocity. Velocity of a Scrum
team is determined by the number of SP that the team can complete using a standardized definition of done in a single iteration
[4]. SP are defined within a software development team and are
used upfront by the team members to avoid taking up too much
work and afterwards to measure the actual velocity of successive
iterations. The basic idea is that over time these teams become
proficient at assigning SP to the software features they are asked
to develop [2] (although this is not equivalent to good and accurate results). Product features are represented by user stories and
size and complexity of a story is represented in SP [5]. In contrast
to a function point, which is an absolute size metric that is assumed to be counted according to pre-arranged, standardized
counting guidelines, a SP is relative. No formalized, standardized
FSM guidelines are available for counting SP. Instead, a software
development team uses mutually agreed values for expressing the
relative size and complexity of one or more user stories. This
relative size and complexity is related to the estimated size and
complexity of previous, already completed user stories. A team of
software developers for example says “given that this previous
feature was 20 SP, this new one is 40 SP, because it is more complex and harder to test”. Another software development team may
decide that in terms of weight similar stories have a relative size
and complexity of 8 SP, and that the still to be built stories are
estimated at 13 SP. Although the stories might be comparable
from the point of functionality, different teams show different
outcomes of relative size and complexity estimation; yet they all
call them SP. As such they are not at all absolute nor comparable
across teams [21].
Period
FP
SP
Period
FP
SP
Feb 09
64
540
A Feb 12
14
26
Mar 09
41
437
A Mar 12
21
57
Apr 09
67
787
A Apr 12
13
75
May 09
51
593
A May 12
48
48
Jun 09
65
474
A Jun 12
41
29
Jul 09
130
648
A Jul 12
32
45
Aug 09
156
787
A Aug 12
55
27
Sep 09
159
758
Max
55
75
Oct 09
106
535
Mean
32.00
43.86
Nov 09
91
480
Median
32
45
Dec 09
54
262
Min
Jan 10
45
373
St.Dev.
Feb 10
43
312
Mar 10
71
506
Apr 10
49
358
Period
FP
SP
May 10
90
819
B Jan 12
5
43
Jun 10
74
742
B Feb 12
18
35
Jul 10
66
469
B Mar 12
24
25
Aug 10
71
652
B Apr 12
20
51
Max
159
819
B May 12
25
45
Mean
13
26
16.69
18.19
Bank Data B
78.58
554.32
B Jun 12
41
21
Median
67
535
B Jul 12
57
14
Min
41
262
Max
57
51
35.76
171.01
Mean
27.14
33.43
24
35
St.Dev.
Median
Min
St.Dev.
5
14
16.95
13.78
team members in such teams are mutually responsible for the
planning of their work. FP on the other hand are, besides for estimating purposes, often used for analyzing and benchmarking of
organization wide portfolios of software engineering activities [2].
3. PROBLEM STATEMENT
Most of the studies performed on agile size metrics are not based
on research of historic data and are written from a theoretic or
qualitative point of view. A quick scan we performed on related
work indicates a lack of quantitative studies that are based on
primary data, collected in companies that deliver software solutions according to an agile approach. Only recently, in 2011, a
case study was performed comparing SP and FP that were based
on a set of 19 iterations performed within a Brazilian Government
Agency where both SP and FP had been counted [7]. And contrary to the majority of theoretical studies that argued that SP and
FP could not be used in comparison [23] [5] [2] the case study
resulted in a positive correlation (a Spearman’s rank correlation of
0.7 with p-value < 0.01) between the 19 iterations measured in SP
and FP. This remarkable discrepancy between theory and practice
The value of SP however, is in the fact that it is a commonly
shared estimate. All development team members within one team
deliver their input in the determination of its size. This is in contrast to traditional size metrics, like FP, where size is measured by
one or two experts, based on pre-defined guidelines. In practice
SP are used to estimate the amount of work and measure the velocity within teams that work in an agile way. The development
1
Bank Data A
The IFPUG method consists of two parts: Unadjusted Function
Points and Adjusted Function Points. ISO has not recognized
the full IFPUG method; it has accepted only the unadjusted part.
31
900
800
Story Points
700
600
500
Santana et al.
400
Bank Data A
300
Bank Data B
200
100
0
0
20
40
60
80
100
120
140
160
180
Function Points
Figure 1. Plotter chart representing FP versus SP.
encouraged us to perform complementary quantitative research on
this subject by replicating the study done by Santana et al. [7]
with our own set of primary historic data.
For 19 iterations, performed from February 2009 to August 2010,
both SP and FP were counted (see Table 1) and analyzed. These
were velocity measures of the team; they measured the amount of
software actually delivered (output), measured in FP and estimated in SP. In the left part of Table 1 the data in columns FP and
SP represent the amount of FP and SP collected in each month
respectively within the ATI organization.
However, in addition to the theoretical differences between FP
and SP, and the assumed impossibility to compare them [5] [2]
[11], we emphasize that analyzing any correlation between FP and
SP might be seen as spurious in a way. A fundamental issue behind comparing both metrics is that FP are a software size
measure based on pre-defined measurement guidelines that cover
functional requirements only, yet SP do not correspond to a software size, and not even to actual effort, since they are about estimation (and not measurement), and they cover both functional and
non-functional requirements. Backgrounds of this theoretical differences are described in specific studies on FSM-theory in agile
settings (e.g. [24]), yet the question of the day exists that these
theoretical differences up to now are poorly tested in quantitative
research. Since the paper by Santana et al. [7] is the only available
study that quantifies this subject (resulting in a conclusion that
most carefully-worded implies that automated conversion of FP to
SP might ever be possible) we argue that, in spite of the assumed
spurious character of it, our study might help to gain a better
knowledge on this subject in both the empirical and practical
software engineering world.
Santana et al. [7] concluded that the value of ρ ≈ 0.7173 from the
Spearman’s rank correlation test means a strong positive correlation. Due to the low p-value of 0.0005989 a large confidence interval was concluded. For visual verification of the strength of
linear correlation, a scatter plot was constructed (see Figure 1).
The results shown in the scatter plot presents points growing on a
linear pattern, which supported the Spearman’s rank correlation
test [7].
Based on the outcomes of these tests, Santana et al. [7] argued
that, despite the strong differences of size definition of FP and SP,
a strong positive correlation between both size metrics suggests
that FP, in that particular case, could be related with the initial
value of SP found after planning poker sessions. Santana et al.
state that the result of their study cannot be generalized, and propose to replicate the method of assessment within other organizations. For further research the suggestion is made to find a more
or less generic conversion method for FP and SP based on the
correlation found in the study.
4. THE ORIGINAL STUDY
Driven by rules and regulations of the Brazilian government, a
Brazilian Government Agency, called ATI, was forced to follow
an instruction that states that measurement is an integral part of
the outsourcing strategy. In practice, due to this a situation occurred where size was counted in both FP (rules and regulations
driven), and SP (the development team already worked according
to Scrum). IFPUG guidelines [16] were used to measure FP. SP of
all demands of a sprint were estimated during planning poker
meetings in which ATI and its supplier participated. Due to the
fact that ATI planned to reduce the initial measurement work at
the beginning of each sprint (count both SP and FP), a plan existed to create a method of converting between both metrics, once
the statistical correlation proved to be strong enough. The case
study that is part of the paper of Santana et al. [7] is based on the
analysis performed to derive this conversion method, although
analysis to prove a linear regression based on strong statistical
correlation was not finished when the paper was published.
5. REPLICATION DESIGN
Encouraged by the results from the study of Santana et al. [7] and
especially triggered by its implication to seek for generalization of
a conversion method for FP and SP, we replicate the research with
our own set of primary historic data [14]. The method used for
data collection was based at quantification of both duration and
cost of a repository of 26 finalized software engineering releases
that were collected over time by members of a measurement team
within two Finance & Risk departments in a banking company in
The Netherlands. For 14 of the inventoried releases delivered by
each team, size was measured in FP and estimated in SP. FPA was
performed by an experienced measurement specialist, based on
unadjusted FP according to NESMA guidelines [17], and reviewed by a certified FPA specialist (CFPA NESMA). The 14
analyzed software releases where performed during the period
from February 2012 to August 2012 within two software engi-
32
neering teams, in this study referred to as A and B. All releases
that were performed during that period were incorporated in the
research sample; no delivery activities were left out. Both sets of
releases were performed within an IT-department that delivered
solutions at group level of a large bank. All releases were performed on two Finance & Risk systems. One release was always
mapped on one system (single application mapping). For each
system a fixed team of experienced software developers was in
place. Both teams worked separately from each other, each at their
own location and for different business departments.
Unlike Santana et al. [7] the normality test of both the SP variable
and the FP variable (FP) of both dataset A and B are considered
normal. Usually this would lead to the use of a parametric method
to test the statistical correlation; however to be in sync with the
method used by Santana et al. [7] we use both the Spearman’s
rank correlation test and the Pearson’s correlation test for this
purpose. The results of the Spearman’s rank correlation test on
our set of data were:
Spearman's rank correlation rho
Data: Bank_Data_SP_A and Bank_Data_FP_A
S = 76, p-value = 0.4444
Alternative hypothesis: true rho is not equal to 0
Sample estimates:
rho
-0.3571429
During the 13 months that were subject of the collected data, both
teams delivered a set of software solutions on a monthly basis,
consisting of new developments and enhancements on existing
software and maintenance. The heartbeat was steady: once a
month a ‘Go Live moment’ was scheduled and during each month
two sprints (or iterations) of each two weeks were performed.
Both development teams transformed during the measurement
period from a more or less plan-driven delivery model (waterfall)
towards an agile approach (Scrum). Data was collected during a
data collection period from three months at the end of the software engineering period that was in scope for this study. The implemented size of the analyzed small software releases was measured in a number of FP, based on the set of functional requirements that was prepared for each separate release. During the last
8 months of the measurement period both software engineering
teams worked according to Scrum. For all releases that where
performed according to a Scrum delivery approach (2012-01 and
on) the delivered size was estimated in a number of SP too. The
SP were counted by the release-team itself based on what was
delivered; the velocity. No data, neither measured in FP nor estimated in SP, was excluded from the dataset. Exclusion of outliers
was not applicable for the analyzed release data.
Data: Bank_Data_SP_B and Bank_Data_FP_B
S = 90, p-value = 0.1667
Alternative hypothesis: true rho is not equal to 0
Sample estimates:
rho
-0.6071429
In contradiction to the outcome of the Spearman’s rank correlation test on Santana’s data, we conclude that the value of resp. ρ ≈
-0.3571 and ρ ≈ -0.6071 with regard to our datasets indicates a
weak, resp. moderate downhill (negative) linear relation. Due to
the high p-values of 0.4444 and 0.1667 a rather small confidence
interval can be concluded for both datasets. For visual verification
of the strength of linear correlation, we included the data in the
scatter plot in Figure 1. The results shown in the scatter plot presents points slightly decreasing on a linear pattern, which supported the Spearman’s rank correlation test.
We replicate Santana’s study with the data from our own repository. Due to the fact that SP are a relative measure we analyze the
data of both teams separately (referred at as Bank_Data_A and
Bank_Data_B), instead of analyze our data as one set. Based on
Santana’s research we perform subsequently three tests on each of
the 7 releases measured in both FPs and SPs within both research
datasets A and B:
1.
Test of the normality of the FP by using the ShapiroWilk normality test;
2.
Test of the normality of the SP by using the ShapiroWilk normality test;
3.
Test of the statistical correlation between FP and SP by
using the Spearman’s rank correlation test.
Due to the fact that both tested variables (FP and SP) are considered normal we use, besides the test with a non-parametric
method above, a parametric method too to test the statistical correlation. The results of the Pearson’s correlation test are:
Pearson's product-moment correlation
Data: Bank_Data_SP_A and Bank_Data_FP_A
t = -1.2154, df = 5, p-value = 0.2785
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.9051144 0.4302074
sample estimates:
cor
-0.4775694
According to the research performed by Santana et al. we used the
statistical tool R to perform the statistical tests. The result of the
Shapiro-Wilk normality tests were:
Data: Bank_Data_SP_B and Bank_Data_FP_B
t = -2.8498, df = 5, p-value = 0.03583
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.96692886 -0.08263097
sample estimates:
cor
-0.7867338
Shapiro-Wilk normality test
Data: Bank_Data_FP_A
W = 0.9235, p-value = 0.4974
Data: Bank_Data_SP_A
W = 0.9017, p-value = 0.3411
The outcome of the Pearson’s correlation test on our datasets indicates a weak downhill (negative) linear relationship between FP
and SP for dataset A, and a strong downhill (negative) linear relationship for dataset B. Due to the p-value of resp. 0.2785 and
0.03683 a strong confidence interval is concluded. Although, minor differences occur between both used correlation test methods,
and taking into account that we only used a limited set of research
Data: Bank_Data_FP_B
W = 0.9323, p-value = 0.5702
Data: Bank_Data_SP_B
W = 0.9479, p-value = 0.7102
33
data for our tests, the overall outcome justifies the conclusion that
a comparison between FP and SP as measured in our repository
indicates in one case a weak moderate negative linear relation and
in the other a strong negative linear relation.
amounts of research data from different companies, might reveal
this expectation.
In our study we did not find evidence for the promising idea for
future research or application that Santana et al. [7] proposed; the
description of a method that can be used by companies that are
facing the problem on how to assess the relationship between FP
and SP, and finally find a first degree equation (y = Ax + B)
where y refers to the number of FP, x is the amount of SP, and A
and B are constants. However we emphasize that we based our
study on a limited, quite small set of data. We support the idea
stated by Cohn [23] that SP are meant to describe the size of a
feature or user story compared to another within the scope of a
development team. The goal of SP is in estimation of work to do
and communicating this to the business [2] [5] [8]. Due to their
subjective and relative nature we expect that SP cannot be used
for project exceeding purposes such as portfolio management,
program management and internal and external benchmarking of
software engineering activities. For those purposes FP have
proven to be suitable [5] [2] [10]. Therefore, we strongly support
the idea that companies that work in either a traditional way or an
agile way do collect traditional size measures (such as FP) for
portfolio management, program management, and benchmarking
purposes; and that companies that work according to an agile
method do beside that collect size measures in an agile size metric
(such as SP) for estimation purposes.
6. EVALUATION RESULTS
Based on the outcome of our research we argue that it appears still
too early to make generic claims on the relation between FP and
SP; in fact FSM-theory seems to underpin that such a relationship
is a spurious one. The case study performed by Santana et al. [7]
revealed that comparison of 19 software development iterations,
all measured in both FP and SP, showed a strong positive linear
relation. On the contrary, research where we replicated Santana’s
study with our own data of 14 software development iterations
from two development teams, also both measured in FP and SP,
showed a resp. weak and strong negative linear relation between
both size metrics. Summarized we conclude that both studies revealed a linear relation between FP and SP; yet in one case this
relation proved to be strong positive one, while in the other two
cases the nature of the relation was weak negative in the first and
strong negative in the second.
7. DISCUSSION
The results from our study support the often heard saying that SP
cannot be (or should not be) compared with functional size
measurements such as FP. FP are assumed to be objective functional size measurements, based on ISO/IEC standardized guidelines [16] [17]. They are widely known and used within the industry [15] and cover functional requirements only. SP on the other
hand, are at best reliable within the scope of one software
development team, yet since no commonly shared and formalized
guidelines exists for the measurement of SP, results from these
estimations cannot be compared with other teams or companies
[22] [23] [11] [25]. Besides that SP cover both functional and
non-functional requirements, where FP are about functional requirements only. Taking all considerations into account, we argue
that these are a good premise for revising and replicating Santana
et al.’s study. Looking at the remark made by Santana et al. [7],
p.188, the outcome of their research was seen as a surprise too:
“Even being used to the same goal, FP and SP presents strong
theoretical differences. Whereas the results of this study it is still
surprising, once was observed a correlation between the functional size that is obtained accurately with impersonal method of
sizing, and SP obtained purely from the experience of the team”.
8. THREATS TO VALIDITY
Since our study is a replication of research by Santana et al. [7]
our study suffers from some of the same threats to validity. This
holds particular to the threats to construct and conclusion validity.
Furthermore, based on our relatively high p-values, similar to
Santana et al. [7] we emphasize that “the number of samples are
still small to reach any definitive conclusion on this study”.
With respect to internal validity we relate to the method(s) used
for counting FP by both involved organizations. Within the case
study performed in the research by Santana et al. [7] the IFPUG
guidelines [16] were used, while the Dutch company that was
subject in our study performed Function Point Analysis according
to NESMA guidelines [17]. Practice shows that both guidelines
are quite compatible [19]; due to that no differences in the outcomes of the correlation tests are to be expected.
Santana et al.’s [7] remark that a lack of knowledge of what
statistical method is most appropriate to use, affects the threats to
construct validity. As we followed the approach as described in
the case study of Santana’s paper, this threat to internal validity
might be applicable for us too, although by using the same methods comparison of the outcomes of both studies seems valid to us.
So, what can we do with our results? Santana et al. [7] stated the
goal of their research to be to motivate how companies can find
their ratio between FP and SP. Unless the theory on the assumed
spuriousness of a study on correlation between both FP and SP,
our research does not completely prove this is an unfeasible idea.
We used a limited set of research data, and it still could be valid
for one company to find a reliable ratio between both size metrics
within the scope of the company itself, assuming that within that
company all software developers determine SP in the same way.
Our research shows that when looked upon a wider scope of software development teams from different companies in different
countries, a reliable ratio between FP and SP is not found, contrary to that we found major differences in outcomes of correlation tests. Based on these outcomes and due to the lack of standardized guidelines for SP measurement, our expectation however
is, that even within the boundaries of one single company different software development teams will show different ratios between both metrics. We expect that future research, on larger
Another threat to internal validity mentioned by Santana et al. [7]
is applicable for us too: a lack of theoretical concepts consolidated
about a possible correlation between the approaches might have
contributed to weaken several factors in the study, such as selection of the wrong method or pooling of demands.
The way of working with regard to data collection and measurement within the banking company that was in scope of our study
is representative for a large organization that performs software
engineering activities on Finance & Risk systems, supported by
internal teams. We expect that in comparable companies (e.g.
relatively large companies that perform software engineering activities on systems for internal use, supported by internal teams)
the outcome of analysis will be comparable too.
34
SP are about valuing and measuring User Stories, although standardization is missing. What does this imply when we compare SP
with the more traditional measure for size, FP? Some statements
underpin the use of FP by stating that the best standard metric to
compare productivity across projects is FP [5]. As SP are not
translatable between projects the size of a project has also to be
measured in FP [5]. FP create a context for software measurement
based on the software’s business value [2]. Yet, other studies argue the opposite by stating that FP are not suitable for effort estimates in agile projects, due to its too fine granularity and its insufficient support of feedback and change [8].
9. RELATED WORK
Practice shows that the use of both traditional FP2 and more contemporary SP in the same software engineering environment can
be challenging. An unambiguous definition of a SP seems not
easy to find, maybe due to the fact that measurement and analysis
simply is not mentioned in agile methods. Jones [9] states it quite
clearly: “There is much more quantitative data available on the
results of projects using the CMMI than for projects using the
agile methods. In part this is because the CMMI includes measurement as a key practice. Also, while agile projects do estimate
and measure, some of the agile metrics are unique and have no
benchmarks available as of 2007. For example, some agile projects use SP, some use running tested features, some use ideal
time, and some measure with use case points” [9].
Despite the differences between FP and SP, they are more and
more used commonly, although FP has the largest following in
comparison to Object Points, Feature Points, Use Case Points, and
SP [25] [26]. Especially within organizations that use Scrum, 50%
use SP for estimating [26]. In such cases FP are mostly used at a
portfolio level, to prioritize projects in the portfolio backlog and
to analyze and benchmarks a company’s overall IT-performance
against internal and external peer groups. FP do not interfere with
agile estimation techniques; it does give a better handle on the
initial estimate and on the size of the scope delivered.
Agile methodologies seem to be mainly founded on functional
analysis and not supported by historical data. The practice to collect data from projects is not required by most agile methods [21].
And with regard to standardization, each agile methodology and
each team applying a certain agile approach uses its own definitions; adequate measurements require to be based on standards
and require inputs with a validated, quality level [21]. Chemuturi
[25] states that “there is no universally accepted definition of what
constitutes a SP. There is also no defined methodology to which
software estimators can refer when estimating in SP” [25]. And
Fehlmann and Santillo [11] are even more clear by arguing that
“SP are not a prediction function, since they don't identify the
controls needed for a transfer function – in Six Sigma terms, the
difference between predicted size and implemented size is unpredictable, thus out of tolerance; hence SP are not a measurement
method” [11].
Furthermore FP are one of the metrics that can be presented at
portfolio level to the management [10]. Bhalerao and Ingle [12]
state that “agile methods do not hold big upfront for early estimation of size, cost and duration due to uncertainty in requirements.
It has been observed that agile methods mostly rely on an expert
opinion and historical data of project for estimation of cost, size
and duration. It has been observed that these methods do not consider the vital factors affecting the cost, size and duration of project for estimation. In absence of historical data and experts, existing agile estimation methods such as analogy, planning poker
become unpredictable. Therefore, there is a strong need to devise
simple algorithmic method that incorporates the factors affecting
the cost, size and duration of project [12].” In short, many companies that work in both a traditional way and an agile way do collect traditional size measures (like FP) in combination with agile
size measures (like SP) [27] [21].
Yet looking at some definitions of SP teaches us that a certain
need for new size measurements, besides the more traditional FP,
is present. Product features are represented in User Stories and
size of a story is represented in SP [4] [5]. A SP is a (variable)
number of days needed to realize a User Story [21]. The value of a
SP correlates with the required implementation effort of a User
Story [7].
10. CONCLUDING REMARKS
Cohn [23] argues that SP are meaningless and thus relative size
measures: “One way to measure the size of a User Story or a feature is to use SP. In contrast to FP, SP are unique to one team.
The numbers do not actually mean anything; they are only a way
to describe the size of one feature/user story compared to another.
In other words, SP are relative. It is an estimate that includes the
complexity of the feature, the effort needed to develop the feature
and the risk in developing it” [23]. This characterization is mentioned in many studies on agile software development; cost per SP
cannot be standardized across the industry [5]; SP are regarded to
be indicators of business value [5]; and in a number of studies it is
mentioned that SP speak more directly to the business value added
in each iteration than more traditional measures [2] [5] [8], or that
SP “while having little value outside of a specific software
development group for benchmarking or comparison studies, offer
a great deal of external value for communicating productivity and
quality and provide an excellent tool for negotiating features with
management” [2].
2
Based on the outcome of our study we conclude that it is still far
too early to make generic claims on the relation between FP and
SP; in fact FSM-theory seems to underpin that such a relationship
is a spurious one. Although SP show a (positive linear or negative
linear) correlation when compared with FP, comparison over
different software development teams proved to be unreliable.
Therefore we see no basis for the suggested automated conversion
between FP and SP.
ACKNOWLEDGMENTS
We thank Georgios Gousios, Arie van Deursen, and all other reviewers for their valuable and guiding contributions.
REFERENCES
[1] K. Moløkken and M. Jørgensen, "A Review of Surveys on
Software Effort Estimation," in IEEE Proceedings of the
2003 International Symposium on Empirical Software
Engineering (ISESE’03), 2003.
In the context of this paper we refer to FP according to IFPUG
and NESMA guidelines only. Other FSM, such as the more
recent COSMIC-ISO 19761 method are not within the scope of
our study.
[2] A. F. Minkiewicz, "The Evolution of Software Size: A
Search for Value," Software Engineering Technology, vol.
March/April, pp. 23-26, 2009.
35
[3] K. Schwaber, "SCRUM Development Process," in Business
Object Design and Implementation, Springer-Verlag London
Limited, 1997, pp. 117-134.
Function Point Measurement: An Empirical Study," IEEE
Transactions on Software Engineering, vol. 18, no. 11, pp.
1011-1024, 1992.
[4] J. Sutherland, G. Schoonheim and M. Rijk, "Fully
Distributed Scrum: Replicating Local Productivity and
Quality with Offshore Teams," in IEEE 42nd Hawaii
International Conference on System Sciences, 2009.
[16] IFPUG, IFPUG FSM Method: ISO/IEC 20926 - Software
and systems engineering – Software measurement – IFPUG
functional size measurement method, New York:
International Function Point User Group (IFPUG), 2009.
[5] J. Sutherland, G. Schoonheim, E. Rustenburg and M. Rijk,
"Fully Distributed Scrum: The Secret Sauce for
Hyperproductive Offshored Development Teams," in IEEE
Agile 2008 Conference, 2008.
[17] NESMA, NESMA functional size measurement method
conform ISO/IEC 24570, version 2.2, Netherlands Software
Measurement User Association (NESMA), 2004.
[18] IFPUG, ISO/IEC 14764:2006 Software Engineering Software Life Cycle Processes - Maintenance (with four subtypes of maintenance), International Function Point User
Group (IFPUG), 2006.
[6] T. Sulaiman, B. Barton and T. Blackburn, "AgileEVM –
Earned Value Management in Scrum Projects," in IEEE
AGILE 2006 Conference (AGILE'06), 2006.
[7] C. Santana, F. Leoneo, A. Vasconcelos and C. Gusmão,
"Using Function Points in Agile Projects," in Agile Processes
in Software Engineering and Extreme Programming Lecture Notes in Business Information Processing Volume
77, Berlin Heidelberg, Springer-Verlag, 2011, pp. 176-191.
[19] NESMA, „FPA according to NESMA and IFPUG; the actual
situation (in Dutch),” Netherlands Software Metrics User
Association (NESMA; www.nesma.nl), 2008.
[20] S. Trudel and L. Buglione, "Guideline for Sizing Agile
Projects with COSMIC," in IEEE International Workshop on
Software Measurement (IWSM/MetriKon), Stuttgart,
Germany, 2010.
[8] A. Schmietendorf, M. Kunz and R. Dumke, "Effort
estimation for Agile Software Development Projects," in 5th
Software Measurement European Forum, Milan, 2008.
[21] L. Buglione and A. Abran, "Improving Estimations in Agile
Projects: Issues and Avenues," in Software Measurement
European Forum (SMEF), Rome, Italy, 2007.
[9] C. Jones, "Development Practices for Small Software
Applications," Software Productivity Research, 2008.
[10] A. Tengshe and S. Noble, "Establishing the Agile PMO:
Managing variability across Projects and Portfolios," in IEEE
Agile Conference, 2007.
[22] A. Abran, Software Metrics and Software Metrology, WileyIEEE Computer Society Press, 2010.
[23] M. Cohn, Agile Estimating and Planning, Upper Saddle
River, NY: Pearson Education, 2006.
[11] T. Fehlmann and L. Santillo, "From Story Points to COSMIC
Function Points in Agile Software Development – A Six
Sigma perspective," in Metrikon - Software Metrik Kongress,
2010.
[24] L. Buglione and A. Abran, "Improving the User Story Agile
Technique Using the INVEST Criteria," in Joint Conference
of the 23nd International Workshop on Software
Measurement (IWSM) and the Eight International
Conference on Software Process and Product Measurement
(Mensura), Ankara, Turkey, 2013.
[12] S. Bhalerao and M. Ingle, "Incorporating Vital Factors in
Agile
Estimation
through
Algorithmic
Method,"
International Journal of Computer Science and Applications
- Technomathematics Research Foundation, vol. 6, no. 1, pp.
85-97, 2009.
[25] M. Chemuturi, Software Estimation Best Practices, Tools &
Techniques: A Complete Guide for Software Project
Estimators, J. Ross Publishing, 2009.
[13] A. Fuqua, "Using Function Points in XP - Considerations," in
Extreme Programming and Agile Processes in Software
Engineering - Lecture Notes in Computer Science Volume
2675, Springer-Verlag Berlin Heidelberg, 2003, pp. 340-342.
[26] A. S. C. Marçal, B. de Freitas, F. Furtado Soares, M.
Furtado, T. Maciel and A. Belchior, "Blending Scrum
practices and CMMI project management process areas,"
Innovations in Systems and Software Engineering, vol. 4, no.
Springer, pp. 17-29, 2008.
[14] H. Huijgens and R. v. Solingen, "Measurement of Best-inClass Software Releases," in IEEE Joint Conference of the
23nd International Workshop on Software Measurement and
the Eighth International Conference on Software Process
and Product Measurement (IWSM-MENSURA), Ankara,
Turkey, 2013.
[27] N. Abbas, A. Gravell and G. Wills, "The Impact of
Organization, Project and Governance Variables on Software
Quality and Project Success," in IEEE Agile Conference,
2010.
[15] C. F. Kemerer and B. S. Porter, "Improving the Reliability of
36