A Replicated Study on Correlating Agile Team Velocity Measured in Function and Story Points Hennie Huijgens Rini van Solingen Delft University of Technology The Netherlands [email protected] Delft University of Technology and Prowareness The Netherlands [email protected] scribe the backgrounds of estimating the relative effort of software products, and the coherence between both SP and size measures such as FP [4] [5] [8] [9] [10] [11] [12]. Yet not much of the present literature presents a thorough quantitative analysis on any statistical correlation between both metrics. ABSTRACT Since the rapid growth of agile development methods for software engineering, more and more organizations measure the size of iterations, releases, and projects in both function points and story points. In 2011 Santana et al. performed a case study on the relation between function points and story points, from data collected in a Brazilian Government Agency. In this paper we replicate this study, using data collected in a Dutch banking organization. Based on a statistical correlation test we find that a comparison between function points and story points as measured in our repository indicates a moderate negative linear relation, where Santana et al. concluded a strong positive linear relation between both size metrics in their case study. Based on the outcome of our study we conclude that it appears too early to make generic claims on the relation between function points and story points; in fact FSM-theory seems to underpin that such a relationship is a spurious one. To make things more complicated, the limited set of quantitative studies on both FP and SP are not in sync with regard to their outcomes. In 2009 Fuqua [13] reported on a controlled experiment on the use of FP in combination with XP. For 100 User Stories he tried to estimate the effort in FP and also in the number of ideal engineering days, a SP like measure often used within XP-projects. The outcome of the experiment was that the used FP procedure was unable to estimate the required effort, and that FP supports the measurement of the load factor insufficiently. However more recently, in 2011, a quantitative analysis of the correlation between SP and FP was performed by Santana et al. [7]. That case study demonstrated a strong positive correlation between SP and FP that were both collected on 19 iterations from a project realized by a Brazilian public agency. Together with a number of qualitative research studies that argued that FP and SP are from different planets and cannot be compared [5] [2] [11], a rather confusing picture of size metrics in software engineering emerges. Categories and Subject Descriptors D.2.8 [Metrics]: Product metrics, Process metrics General Terms Measurement. Keywords Given the growing importance of reliable, standardized, and quantifying tools for measurement and analysis of software engineering in agile environments, we propose to replicate the study performed by Santana et al. [7] with our own research data. As described in an earlier study [14], we collected a repository with primary data of more than 350 finalized software engineering projects. A subset of this repository represented 26 small software releases that were performed during one year by two software development teams within a Finance & Risk department of a large banking company in The Netherlands. 14 Of these iterations were performed according to an agile development method (Scrum); for those iterations the size of each release was measured in FP and estimated in SP. Story Point, Function Point, Function Point Analysis. 1. INTRODUCTION The importance of effort estimation in software industry is well known. Project overruns are frequent, and managers state that accurate estimation is perceived as a problem [1]. Software size can be an important component of a cost or effort estimate [2]. Two ways of size measurement are often used in practice. Function points (FP) are an industry standard for expressing the functional size of software products. Story points (SP), on the other hand, are used to determine velocity of Scrum teams [3]. In this paper, we explore the relationship between these two measures. Since the rapid growth of agile development methods for software engineering an increased interest can be found in literature on effort estimation of product features in agile , such as SP [4] [5] [6] [7] [2]. In particular, articles on agile delivery methods de- Similar to Santana et al. [7] the research objective of our study is to quantitatively analyze the relationship between SP and FP, based on empirical data from a real life case study where two sets of releases are measured in these two size metrics. The contribution of this research is to examine whether FP are compatible with SP on agile projects. Permission totomake digital or hard copies of all of or all partorofpart this work forwork personal Permission make digital or hard copies of this for or classroom or useclassroom is granted without fee provided that copies are not made or distributed personal use is granted without fee provided that copies are for profit or commercial advantage and that copies bear this notice and the full citation not or distributed forcomponents profit or of commercial advantage on themade first page. Copyrights for this work owned by othersand thanthat ACM copies bear thisAbstracting notice and thecredit fulliscitation onTothe first page. To copy must be honored. with permitted. copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or otherwise, or republish, to post on servers or to redistribute to lists, a fee. Request permissions from [email protected]. requires prior specific permission and/or a fee. WETSoM’14, 3, 2014, WETSoM'14, June June 3, 2014, Hyderabad, Hyderabad, India India. Copyright 2014 ACM 978-1-4503-2854-8/14/06...$15.00 978-1-4503-2854-8/14/06... $15.00. http://dx.doi.org/10.1145/2593868.2593874 In this paper we subsequently discuss in Section 2 the background of size metrics in both traditional software development environments (e.g. FP), and agile delivery environments (e.g. SP), in Section 3 the problem statement, in Section 4 the original study, in Section 5 the implementation aspects, in Section 6 evaluation 30 Table 1. Data for two variables in three samples results, in Section 7 discussion on the results, in Section 8 threats to validity, in Section 9 related work, and in Section 10 conclusions and future work. Santana et al. 2. BACKGROUND Function Point Analysis (FPA) is an established industry standard for expressing the functional size of software products [15]. FPA measures the functional size of software engineering activities by looking at relevant functions for users and (logical) data sets. The unit of measurement is the function point (FP). The functional size of an information system or a project is thus expressed in a number of FP. FPA is used for both new construction [16] [17] and maintenance activities (e.g. not valid for corrective and part of adaptive activities) [18]. Counting guidelines, which describe how to perform an FPA, are improved, monitored and controlled by organizations of FPA users such as IFPUG [16] and NESMA [17]. The various methods differ in practice, though not much of one another [19]. Both methods are certified and recorded as an ISO standard for Functional Size Measurement (FSM) [16] [17]1. SP on the other hand are a roughly estimate by experience and analogue of the relative ‘size and complexity’ of each user story compared to other user stories in the same project [20]. SP are closely related to the conception of velocity. Velocity of a Scrum team is determined by the number of SP that the team can complete using a standardized definition of done in a single iteration [4]. SP are defined within a software development team and are used upfront by the team members to avoid taking up too much work and afterwards to measure the actual velocity of successive iterations. The basic idea is that over time these teams become proficient at assigning SP to the software features they are asked to develop [2] (although this is not equivalent to good and accurate results). Product features are represented by user stories and size and complexity of a story is represented in SP [5]. In contrast to a function point, which is an absolute size metric that is assumed to be counted according to pre-arranged, standardized counting guidelines, a SP is relative. No formalized, standardized FSM guidelines are available for counting SP. Instead, a software development team uses mutually agreed values for expressing the relative size and complexity of one or more user stories. This relative size and complexity is related to the estimated size and complexity of previous, already completed user stories. A team of software developers for example says “given that this previous feature was 20 SP, this new one is 40 SP, because it is more complex and harder to test”. Another software development team may decide that in terms of weight similar stories have a relative size and complexity of 8 SP, and that the still to be built stories are estimated at 13 SP. Although the stories might be comparable from the point of functionality, different teams show different outcomes of relative size and complexity estimation; yet they all call them SP. As such they are not at all absolute nor comparable across teams [21]. Period FP SP Period FP SP Feb 09 64 540 A Feb 12 14 26 Mar 09 41 437 A Mar 12 21 57 Apr 09 67 787 A Apr 12 13 75 May 09 51 593 A May 12 48 48 Jun 09 65 474 A Jun 12 41 29 Jul 09 130 648 A Jul 12 32 45 Aug 09 156 787 A Aug 12 55 27 Sep 09 159 758 Max 55 75 Oct 09 106 535 Mean 32.00 43.86 Nov 09 91 480 Median 32 45 Dec 09 54 262 Min Jan 10 45 373 St.Dev. Feb 10 43 312 Mar 10 71 506 Apr 10 49 358 Period FP SP May 10 90 819 B Jan 12 5 43 Jun 10 74 742 B Feb 12 18 35 Jul 10 66 469 B Mar 12 24 25 Aug 10 71 652 B Apr 12 20 51 Max 159 819 B May 12 25 45 Mean 13 26 16.69 18.19 Bank Data B 78.58 554.32 B Jun 12 41 21 Median 67 535 B Jul 12 57 14 Min 41 262 Max 57 51 35.76 171.01 Mean 27.14 33.43 24 35 St.Dev. Median Min St.Dev. 5 14 16.95 13.78 team members in such teams are mutually responsible for the planning of their work. FP on the other hand are, besides for estimating purposes, often used for analyzing and benchmarking of organization wide portfolios of software engineering activities [2]. 3. PROBLEM STATEMENT Most of the studies performed on agile size metrics are not based on research of historic data and are written from a theoretic or qualitative point of view. A quick scan we performed on related work indicates a lack of quantitative studies that are based on primary data, collected in companies that deliver software solutions according to an agile approach. Only recently, in 2011, a case study was performed comparing SP and FP that were based on a set of 19 iterations performed within a Brazilian Government Agency where both SP and FP had been counted [7]. And contrary to the majority of theoretical studies that argued that SP and FP could not be used in comparison [23] [5] [2] the case study resulted in a positive correlation (a Spearman’s rank correlation of 0.7 with p-value < 0.01) between the 19 iterations measured in SP and FP. This remarkable discrepancy between theory and practice The value of SP however, is in the fact that it is a commonly shared estimate. All development team members within one team deliver their input in the determination of its size. This is in contrast to traditional size metrics, like FP, where size is measured by one or two experts, based on pre-defined guidelines. In practice SP are used to estimate the amount of work and measure the velocity within teams that work in an agile way. The development 1 Bank Data A The IFPUG method consists of two parts: Unadjusted Function Points and Adjusted Function Points. ISO has not recognized the full IFPUG method; it has accepted only the unadjusted part. 31 900 800 Story Points 700 600 500 Santana et al. 400 Bank Data A 300 Bank Data B 200 100 0 0 20 40 60 80 100 120 140 160 180 Function Points Figure 1. Plotter chart representing FP versus SP. encouraged us to perform complementary quantitative research on this subject by replicating the study done by Santana et al. [7] with our own set of primary historic data. For 19 iterations, performed from February 2009 to August 2010, both SP and FP were counted (see Table 1) and analyzed. These were velocity measures of the team; they measured the amount of software actually delivered (output), measured in FP and estimated in SP. In the left part of Table 1 the data in columns FP and SP represent the amount of FP and SP collected in each month respectively within the ATI organization. However, in addition to the theoretical differences between FP and SP, and the assumed impossibility to compare them [5] [2] [11], we emphasize that analyzing any correlation between FP and SP might be seen as spurious in a way. A fundamental issue behind comparing both metrics is that FP are a software size measure based on pre-defined measurement guidelines that cover functional requirements only, yet SP do not correspond to a software size, and not even to actual effort, since they are about estimation (and not measurement), and they cover both functional and non-functional requirements. Backgrounds of this theoretical differences are described in specific studies on FSM-theory in agile settings (e.g. [24]), yet the question of the day exists that these theoretical differences up to now are poorly tested in quantitative research. Since the paper by Santana et al. [7] is the only available study that quantifies this subject (resulting in a conclusion that most carefully-worded implies that automated conversion of FP to SP might ever be possible) we argue that, in spite of the assumed spurious character of it, our study might help to gain a better knowledge on this subject in both the empirical and practical software engineering world. Santana et al. [7] concluded that the value of ρ ≈ 0.7173 from the Spearman’s rank correlation test means a strong positive correlation. Due to the low p-value of 0.0005989 a large confidence interval was concluded. For visual verification of the strength of linear correlation, a scatter plot was constructed (see Figure 1). The results shown in the scatter plot presents points growing on a linear pattern, which supported the Spearman’s rank correlation test [7]. Based on the outcomes of these tests, Santana et al. [7] argued that, despite the strong differences of size definition of FP and SP, a strong positive correlation between both size metrics suggests that FP, in that particular case, could be related with the initial value of SP found after planning poker sessions. Santana et al. state that the result of their study cannot be generalized, and propose to replicate the method of assessment within other organizations. For further research the suggestion is made to find a more or less generic conversion method for FP and SP based on the correlation found in the study. 4. THE ORIGINAL STUDY Driven by rules and regulations of the Brazilian government, a Brazilian Government Agency, called ATI, was forced to follow an instruction that states that measurement is an integral part of the outsourcing strategy. In practice, due to this a situation occurred where size was counted in both FP (rules and regulations driven), and SP (the development team already worked according to Scrum). IFPUG guidelines [16] were used to measure FP. SP of all demands of a sprint were estimated during planning poker meetings in which ATI and its supplier participated. Due to the fact that ATI planned to reduce the initial measurement work at the beginning of each sprint (count both SP and FP), a plan existed to create a method of converting between both metrics, once the statistical correlation proved to be strong enough. The case study that is part of the paper of Santana et al. [7] is based on the analysis performed to derive this conversion method, although analysis to prove a linear regression based on strong statistical correlation was not finished when the paper was published. 5. REPLICATION DESIGN Encouraged by the results from the study of Santana et al. [7] and especially triggered by its implication to seek for generalization of a conversion method for FP and SP, we replicate the research with our own set of primary historic data [14]. The method used for data collection was based at quantification of both duration and cost of a repository of 26 finalized software engineering releases that were collected over time by members of a measurement team within two Finance & Risk departments in a banking company in The Netherlands. For 14 of the inventoried releases delivered by each team, size was measured in FP and estimated in SP. FPA was performed by an experienced measurement specialist, based on unadjusted FP according to NESMA guidelines [17], and reviewed by a certified FPA specialist (CFPA NESMA). The 14 analyzed software releases where performed during the period from February 2012 to August 2012 within two software engi- 32 neering teams, in this study referred to as A and B. All releases that were performed during that period were incorporated in the research sample; no delivery activities were left out. Both sets of releases were performed within an IT-department that delivered solutions at group level of a large bank. All releases were performed on two Finance & Risk systems. One release was always mapped on one system (single application mapping). For each system a fixed team of experienced software developers was in place. Both teams worked separately from each other, each at their own location and for different business departments. Unlike Santana et al. [7] the normality test of both the SP variable and the FP variable (FP) of both dataset A and B are considered normal. Usually this would lead to the use of a parametric method to test the statistical correlation; however to be in sync with the method used by Santana et al. [7] we use both the Spearman’s rank correlation test and the Pearson’s correlation test for this purpose. The results of the Spearman’s rank correlation test on our set of data were: Spearman's rank correlation rho Data: Bank_Data_SP_A and Bank_Data_FP_A S = 76, p-value = 0.4444 Alternative hypothesis: true rho is not equal to 0 Sample estimates: rho -0.3571429 During the 13 months that were subject of the collected data, both teams delivered a set of software solutions on a monthly basis, consisting of new developments and enhancements on existing software and maintenance. The heartbeat was steady: once a month a ‘Go Live moment’ was scheduled and during each month two sprints (or iterations) of each two weeks were performed. Both development teams transformed during the measurement period from a more or less plan-driven delivery model (waterfall) towards an agile approach (Scrum). Data was collected during a data collection period from three months at the end of the software engineering period that was in scope for this study. The implemented size of the analyzed small software releases was measured in a number of FP, based on the set of functional requirements that was prepared for each separate release. During the last 8 months of the measurement period both software engineering teams worked according to Scrum. For all releases that where performed according to a Scrum delivery approach (2012-01 and on) the delivered size was estimated in a number of SP too. The SP were counted by the release-team itself based on what was delivered; the velocity. No data, neither measured in FP nor estimated in SP, was excluded from the dataset. Exclusion of outliers was not applicable for the analyzed release data. Data: Bank_Data_SP_B and Bank_Data_FP_B S = 90, p-value = 0.1667 Alternative hypothesis: true rho is not equal to 0 Sample estimates: rho -0.6071429 In contradiction to the outcome of the Spearman’s rank correlation test on Santana’s data, we conclude that the value of resp. ρ ≈ -0.3571 and ρ ≈ -0.6071 with regard to our datasets indicates a weak, resp. moderate downhill (negative) linear relation. Due to the high p-values of 0.4444 and 0.1667 a rather small confidence interval can be concluded for both datasets. For visual verification of the strength of linear correlation, we included the data in the scatter plot in Figure 1. The results shown in the scatter plot presents points slightly decreasing on a linear pattern, which supported the Spearman’s rank correlation test. We replicate Santana’s study with the data from our own repository. Due to the fact that SP are a relative measure we analyze the data of both teams separately (referred at as Bank_Data_A and Bank_Data_B), instead of analyze our data as one set. Based on Santana’s research we perform subsequently three tests on each of the 7 releases measured in both FPs and SPs within both research datasets A and B: 1. Test of the normality of the FP by using the ShapiroWilk normality test; 2. Test of the normality of the SP by using the ShapiroWilk normality test; 3. Test of the statistical correlation between FP and SP by using the Spearman’s rank correlation test. Due to the fact that both tested variables (FP and SP) are considered normal we use, besides the test with a non-parametric method above, a parametric method too to test the statistical correlation. The results of the Pearson’s correlation test are: Pearson's product-moment correlation Data: Bank_Data_SP_A and Bank_Data_FP_A t = -1.2154, df = 5, p-value = 0.2785 alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: -0.9051144 0.4302074 sample estimates: cor -0.4775694 According to the research performed by Santana et al. we used the statistical tool R to perform the statistical tests. The result of the Shapiro-Wilk normality tests were: Data: Bank_Data_SP_B and Bank_Data_FP_B t = -2.8498, df = 5, p-value = 0.03583 alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: -0.96692886 -0.08263097 sample estimates: cor -0.7867338 Shapiro-Wilk normality test Data: Bank_Data_FP_A W = 0.9235, p-value = 0.4974 Data: Bank_Data_SP_A W = 0.9017, p-value = 0.3411 The outcome of the Pearson’s correlation test on our datasets indicates a weak downhill (negative) linear relationship between FP and SP for dataset A, and a strong downhill (negative) linear relationship for dataset B. Due to the p-value of resp. 0.2785 and 0.03683 a strong confidence interval is concluded. Although, minor differences occur between both used correlation test methods, and taking into account that we only used a limited set of research Data: Bank_Data_FP_B W = 0.9323, p-value = 0.5702 Data: Bank_Data_SP_B W = 0.9479, p-value = 0.7102 33 data for our tests, the overall outcome justifies the conclusion that a comparison between FP and SP as measured in our repository indicates in one case a weak moderate negative linear relation and in the other a strong negative linear relation. amounts of research data from different companies, might reveal this expectation. In our study we did not find evidence for the promising idea for future research or application that Santana et al. [7] proposed; the description of a method that can be used by companies that are facing the problem on how to assess the relationship between FP and SP, and finally find a first degree equation (y = Ax + B) where y refers to the number of FP, x is the amount of SP, and A and B are constants. However we emphasize that we based our study on a limited, quite small set of data. We support the idea stated by Cohn [23] that SP are meant to describe the size of a feature or user story compared to another within the scope of a development team. The goal of SP is in estimation of work to do and communicating this to the business [2] [5] [8]. Due to their subjective and relative nature we expect that SP cannot be used for project exceeding purposes such as portfolio management, program management and internal and external benchmarking of software engineering activities. For those purposes FP have proven to be suitable [5] [2] [10]. Therefore, we strongly support the idea that companies that work in either a traditional way or an agile way do collect traditional size measures (such as FP) for portfolio management, program management, and benchmarking purposes; and that companies that work according to an agile method do beside that collect size measures in an agile size metric (such as SP) for estimation purposes. 6. EVALUATION RESULTS Based on the outcome of our research we argue that it appears still too early to make generic claims on the relation between FP and SP; in fact FSM-theory seems to underpin that such a relationship is a spurious one. The case study performed by Santana et al. [7] revealed that comparison of 19 software development iterations, all measured in both FP and SP, showed a strong positive linear relation. On the contrary, research where we replicated Santana’s study with our own data of 14 software development iterations from two development teams, also both measured in FP and SP, showed a resp. weak and strong negative linear relation between both size metrics. Summarized we conclude that both studies revealed a linear relation between FP and SP; yet in one case this relation proved to be strong positive one, while in the other two cases the nature of the relation was weak negative in the first and strong negative in the second. 7. DISCUSSION The results from our study support the often heard saying that SP cannot be (or should not be) compared with functional size measurements such as FP. FP are assumed to be objective functional size measurements, based on ISO/IEC standardized guidelines [16] [17]. They are widely known and used within the industry [15] and cover functional requirements only. SP on the other hand, are at best reliable within the scope of one software development team, yet since no commonly shared and formalized guidelines exists for the measurement of SP, results from these estimations cannot be compared with other teams or companies [22] [23] [11] [25]. Besides that SP cover both functional and non-functional requirements, where FP are about functional requirements only. Taking all considerations into account, we argue that these are a good premise for revising and replicating Santana et al.’s study. Looking at the remark made by Santana et al. [7], p.188, the outcome of their research was seen as a surprise too: “Even being used to the same goal, FP and SP presents strong theoretical differences. Whereas the results of this study it is still surprising, once was observed a correlation between the functional size that is obtained accurately with impersonal method of sizing, and SP obtained purely from the experience of the team”. 8. THREATS TO VALIDITY Since our study is a replication of research by Santana et al. [7] our study suffers from some of the same threats to validity. This holds particular to the threats to construct and conclusion validity. Furthermore, based on our relatively high p-values, similar to Santana et al. [7] we emphasize that “the number of samples are still small to reach any definitive conclusion on this study”. With respect to internal validity we relate to the method(s) used for counting FP by both involved organizations. Within the case study performed in the research by Santana et al. [7] the IFPUG guidelines [16] were used, while the Dutch company that was subject in our study performed Function Point Analysis according to NESMA guidelines [17]. Practice shows that both guidelines are quite compatible [19]; due to that no differences in the outcomes of the correlation tests are to be expected. Santana et al.’s [7] remark that a lack of knowledge of what statistical method is most appropriate to use, affects the threats to construct validity. As we followed the approach as described in the case study of Santana’s paper, this threat to internal validity might be applicable for us too, although by using the same methods comparison of the outcomes of both studies seems valid to us. So, what can we do with our results? Santana et al. [7] stated the goal of their research to be to motivate how companies can find their ratio between FP and SP. Unless the theory on the assumed spuriousness of a study on correlation between both FP and SP, our research does not completely prove this is an unfeasible idea. We used a limited set of research data, and it still could be valid for one company to find a reliable ratio between both size metrics within the scope of the company itself, assuming that within that company all software developers determine SP in the same way. Our research shows that when looked upon a wider scope of software development teams from different companies in different countries, a reliable ratio between FP and SP is not found, contrary to that we found major differences in outcomes of correlation tests. Based on these outcomes and due to the lack of standardized guidelines for SP measurement, our expectation however is, that even within the boundaries of one single company different software development teams will show different ratios between both metrics. We expect that future research, on larger Another threat to internal validity mentioned by Santana et al. [7] is applicable for us too: a lack of theoretical concepts consolidated about a possible correlation between the approaches might have contributed to weaken several factors in the study, such as selection of the wrong method or pooling of demands. The way of working with regard to data collection and measurement within the banking company that was in scope of our study is representative for a large organization that performs software engineering activities on Finance & Risk systems, supported by internal teams. We expect that in comparable companies (e.g. relatively large companies that perform software engineering activities on systems for internal use, supported by internal teams) the outcome of analysis will be comparable too. 34 SP are about valuing and measuring User Stories, although standardization is missing. What does this imply when we compare SP with the more traditional measure for size, FP? Some statements underpin the use of FP by stating that the best standard metric to compare productivity across projects is FP [5]. As SP are not translatable between projects the size of a project has also to be measured in FP [5]. FP create a context for software measurement based on the software’s business value [2]. Yet, other studies argue the opposite by stating that FP are not suitable for effort estimates in agile projects, due to its too fine granularity and its insufficient support of feedback and change [8]. 9. RELATED WORK Practice shows that the use of both traditional FP2 and more contemporary SP in the same software engineering environment can be challenging. An unambiguous definition of a SP seems not easy to find, maybe due to the fact that measurement and analysis simply is not mentioned in agile methods. Jones [9] states it quite clearly: “There is much more quantitative data available on the results of projects using the CMMI than for projects using the agile methods. In part this is because the CMMI includes measurement as a key practice. Also, while agile projects do estimate and measure, some of the agile metrics are unique and have no benchmarks available as of 2007. For example, some agile projects use SP, some use running tested features, some use ideal time, and some measure with use case points” [9]. Despite the differences between FP and SP, they are more and more used commonly, although FP has the largest following in comparison to Object Points, Feature Points, Use Case Points, and SP [25] [26]. Especially within organizations that use Scrum, 50% use SP for estimating [26]. In such cases FP are mostly used at a portfolio level, to prioritize projects in the portfolio backlog and to analyze and benchmarks a company’s overall IT-performance against internal and external peer groups. FP do not interfere with agile estimation techniques; it does give a better handle on the initial estimate and on the size of the scope delivered. Agile methodologies seem to be mainly founded on functional analysis and not supported by historical data. The practice to collect data from projects is not required by most agile methods [21]. And with regard to standardization, each agile methodology and each team applying a certain agile approach uses its own definitions; adequate measurements require to be based on standards and require inputs with a validated, quality level [21]. Chemuturi [25] states that “there is no universally accepted definition of what constitutes a SP. There is also no defined methodology to which software estimators can refer when estimating in SP” [25]. And Fehlmann and Santillo [11] are even more clear by arguing that “SP are not a prediction function, since they don't identify the controls needed for a transfer function – in Six Sigma terms, the difference between predicted size and implemented size is unpredictable, thus out of tolerance; hence SP are not a measurement method” [11]. Furthermore FP are one of the metrics that can be presented at portfolio level to the management [10]. Bhalerao and Ingle [12] state that “agile methods do not hold big upfront for early estimation of size, cost and duration due to uncertainty in requirements. It has been observed that agile methods mostly rely on an expert opinion and historical data of project for estimation of cost, size and duration. It has been observed that these methods do not consider the vital factors affecting the cost, size and duration of project for estimation. In absence of historical data and experts, existing agile estimation methods such as analogy, planning poker become unpredictable. Therefore, there is a strong need to devise simple algorithmic method that incorporates the factors affecting the cost, size and duration of project [12].” In short, many companies that work in both a traditional way and an agile way do collect traditional size measures (like FP) in combination with agile size measures (like SP) [27] [21]. Yet looking at some definitions of SP teaches us that a certain need for new size measurements, besides the more traditional FP, is present. Product features are represented in User Stories and size of a story is represented in SP [4] [5]. A SP is a (variable) number of days needed to realize a User Story [21]. The value of a SP correlates with the required implementation effort of a User Story [7]. 10. CONCLUDING REMARKS Cohn [23] argues that SP are meaningless and thus relative size measures: “One way to measure the size of a User Story or a feature is to use SP. In contrast to FP, SP are unique to one team. The numbers do not actually mean anything; they are only a way to describe the size of one feature/user story compared to another. In other words, SP are relative. It is an estimate that includes the complexity of the feature, the effort needed to develop the feature and the risk in developing it” [23]. This characterization is mentioned in many studies on agile software development; cost per SP cannot be standardized across the industry [5]; SP are regarded to be indicators of business value [5]; and in a number of studies it is mentioned that SP speak more directly to the business value added in each iteration than more traditional measures [2] [5] [8], or that SP “while having little value outside of a specific software development group for benchmarking or comparison studies, offer a great deal of external value for communicating productivity and quality and provide an excellent tool for negotiating features with management” [2]. 2 Based on the outcome of our study we conclude that it is still far too early to make generic claims on the relation between FP and SP; in fact FSM-theory seems to underpin that such a relationship is a spurious one. Although SP show a (positive linear or negative linear) correlation when compared with FP, comparison over different software development teams proved to be unreliable. Therefore we see no basis for the suggested automated conversion between FP and SP. ACKNOWLEDGMENTS We thank Georgios Gousios, Arie van Deursen, and all other reviewers for their valuable and guiding contributions. REFERENCES [1] K. Moløkken and M. Jørgensen, "A Review of Surveys on Software Effort Estimation," in IEEE Proceedings of the 2003 International Symposium on Empirical Software Engineering (ISESE’03), 2003. In the context of this paper we refer to FP according to IFPUG and NESMA guidelines only. Other FSM, such as the more recent COSMIC-ISO 19761 method are not within the scope of our study. [2] A. F. Minkiewicz, "The Evolution of Software Size: A Search for Value," Software Engineering Technology, vol. March/April, pp. 23-26, 2009. 35 [3] K. Schwaber, "SCRUM Development Process," in Business Object Design and Implementation, Springer-Verlag London Limited, 1997, pp. 117-134. Function Point Measurement: An Empirical Study," IEEE Transactions on Software Engineering, vol. 18, no. 11, pp. 1011-1024, 1992. [4] J. Sutherland, G. Schoonheim and M. Rijk, "Fully Distributed Scrum: Replicating Local Productivity and Quality with Offshore Teams," in IEEE 42nd Hawaii International Conference on System Sciences, 2009. [16] IFPUG, IFPUG FSM Method: ISO/IEC 20926 - Software and systems engineering – Software measurement – IFPUG functional size measurement method, New York: International Function Point User Group (IFPUG), 2009. [5] J. Sutherland, G. Schoonheim, E. Rustenburg and M. Rijk, "Fully Distributed Scrum: The Secret Sauce for Hyperproductive Offshored Development Teams," in IEEE Agile 2008 Conference, 2008. [17] NESMA, NESMA functional size measurement method conform ISO/IEC 24570, version 2.2, Netherlands Software Measurement User Association (NESMA), 2004. [18] IFPUG, ISO/IEC 14764:2006 Software Engineering Software Life Cycle Processes - Maintenance (with four subtypes of maintenance), International Function Point User Group (IFPUG), 2006. [6] T. Sulaiman, B. Barton and T. Blackburn, "AgileEVM – Earned Value Management in Scrum Projects," in IEEE AGILE 2006 Conference (AGILE'06), 2006. [7] C. Santana, F. Leoneo, A. Vasconcelos and C. Gusmão, "Using Function Points in Agile Projects," in Agile Processes in Software Engineering and Extreme Programming Lecture Notes in Business Information Processing Volume 77, Berlin Heidelberg, Springer-Verlag, 2011, pp. 176-191. [19] NESMA, „FPA according to NESMA and IFPUG; the actual situation (in Dutch),” Netherlands Software Metrics User Association (NESMA; www.nesma.nl), 2008. [20] S. Trudel and L. Buglione, "Guideline for Sizing Agile Projects with COSMIC," in IEEE International Workshop on Software Measurement (IWSM/MetriKon), Stuttgart, Germany, 2010. [8] A. Schmietendorf, M. Kunz and R. Dumke, "Effort estimation for Agile Software Development Projects," in 5th Software Measurement European Forum, Milan, 2008. [21] L. Buglione and A. Abran, "Improving Estimations in Agile Projects: Issues and Avenues," in Software Measurement European Forum (SMEF), Rome, Italy, 2007. [9] C. Jones, "Development Practices for Small Software Applications," Software Productivity Research, 2008. [10] A. Tengshe and S. Noble, "Establishing the Agile PMO: Managing variability across Projects and Portfolios," in IEEE Agile Conference, 2007. [22] A. Abran, Software Metrics and Software Metrology, WileyIEEE Computer Society Press, 2010. [23] M. Cohn, Agile Estimating and Planning, Upper Saddle River, NY: Pearson Education, 2006. [11] T. Fehlmann and L. Santillo, "From Story Points to COSMIC Function Points in Agile Software Development – A Six Sigma perspective," in Metrikon - Software Metrik Kongress, 2010. [24] L. Buglione and A. Abran, "Improving the User Story Agile Technique Using the INVEST Criteria," in Joint Conference of the 23nd International Workshop on Software Measurement (IWSM) and the Eight International Conference on Software Process and Product Measurement (Mensura), Ankara, Turkey, 2013. [12] S. Bhalerao and M. Ingle, "Incorporating Vital Factors in Agile Estimation through Algorithmic Method," International Journal of Computer Science and Applications - Technomathematics Research Foundation, vol. 6, no. 1, pp. 85-97, 2009. [25] M. Chemuturi, Software Estimation Best Practices, Tools & Techniques: A Complete Guide for Software Project Estimators, J. Ross Publishing, 2009. [13] A. Fuqua, "Using Function Points in XP - Considerations," in Extreme Programming and Agile Processes in Software Engineering - Lecture Notes in Computer Science Volume 2675, Springer-Verlag Berlin Heidelberg, 2003, pp. 340-342. [26] A. S. C. Marçal, B. de Freitas, F. Furtado Soares, M. Furtado, T. Maciel and A. Belchior, "Blending Scrum practices and CMMI project management process areas," Innovations in Systems and Software Engineering, vol. 4, no. Springer, pp. 17-29, 2008. [14] H. Huijgens and R. v. Solingen, "Measurement of Best-inClass Software Releases," in IEEE Joint Conference of the 23nd International Workshop on Software Measurement and the Eighth International Conference on Software Process and Product Measurement (IWSM-MENSURA), Ankara, Turkey, 2013. [27] N. Abbas, A. Gravell and G. Wills, "The Impact of Organization, Project and Governance Variables on Software Quality and Project Success," in IEEE Agile Conference, 2010. [15] C. F. Kemerer and B. S. Porter, "Improving the Reliability of 36
© Copyright 2026 Paperzz