Defect Prediction: Accomplishments and Future Challenges

Defect Prediction:
Accomplishments and Future Challenges
Yasutaka Kamei
Emad Shihab
Principles of Software Languages Group (POSL)
Kyushu University, Fukuoka, Japan
Email: [email protected]
Dept. of Computer Science and Software Engineering
Concordia University, Montréal, Canada
Email: [email protected]
Abstract—As software systems play an increasingly important
role in our lives, their complexity continues to increase. The
increased complexity of software systems makes the assurance
of their quality very difficult. Therefore, a significant amount of
recent research focuses on the prioritization of software quality
assurance efforts. One line of work that has been receiving an
increasing amount of attention for over 40 years is software
defect prediction, where predictions are made to determine where
future defects might appear. Since then, there have been many
studies and many accomplishments in the area of software defect
prediction. At the same time, there remain many challenges that
face that field of software defect prediction. The paper aims to
accomplish four things. First, we provide a brief overview of
software defect prediction and its various components. Second,
we revisit the challenges of software prediction models as they
were seen in the year 2000, in order to reflect on our accomplishments since then. Third, we highlight our accomplishments
and current trends, as well as, discuss the game changers that
had a significant impact on software defect prediction. Fourth,
we highlight some key challenges that lie ahead in the near (and
not so near) future in order for us as a research community to
tackle these future challenges.
I. I NTRODUCTION
If you know your enemies and know yourself, you will not
be imperiled in a hundred battles [89]. This is the quote by
Sun Tzu (c. 6th century BCE), who was a Chinese general,
military strategist, and author of the book The Art of War,
an immensely influential ancient Chinese book on military
strategy. This quote is the one of the principle of empirical
software engineering. To know your enemies (i.e., defects) and
yourself (i.e., software systems) and win battles (i.e., leading a
project to success conclusion), one needs to investigate a large
amount of research on Software Quality Assurance (SQA).
SQA can be broadly defined as the set of activities that ensure
software meets a specific quality level [16].
As software systems continue to play an increasingly important role in our lives, their complexity continues to increase; making SQA efforts very difficult. At the same time,
the importance of SQA efforts is of paramount importance.
Therefore, to ensure high software quality, software defect prediction models, which describe the relationship between various software metrics (e.g., SLOC and McCabe’s Cyclomatic
complexity) and software defects, have been proposed [57, 95].
Traditionally, software defect prediction models are used in
two ways: (1) to predict where defects might appear in the
future and allocate SQA resources to defect-prone artifacts
(e.g., subsystems and files) [58] and (2) to understand the
effect of factors on the likelihood of finding a defect and
derive practical guidelines for future software development
projects [9, 45].
Due to its importance, defect prediction work has been
at the focus of researchers for over 40 years. Akiyama [3]
first attempted to build defect prediction models using sizebased metrics and regression modelling techniques in 1971.
Since then, there have been a plethora of studies and many
accomplishments in the software defect prediction area [23].
At the same time, there remain many challenges that face
software defect prediction. Hence, we believe that it is a
perfect time to write a Future of Software Engineering (FoSE)
paper on the topic of software defect prediction.
The paper is written from a budding university researchers’
point of view and aims to accomplish four things. First, we
provide a brief overview of software defect prediction and
its various components. Second, we revisit the challenges of
software prediction models as they were seen in the year 2000,
in order to reflect on our accomplishments since then. Third,
we highlight the accomplishments and current trends, as well
as, discuss the game changers that had a significant impact
on the area of software defect prediction. Fourth, we highlight
some key challenges that lie ahead in the near (and not so
near) future in order for us as a research community to tackle
these future challenges.
Target Readers. The paper is intended for researchers and
practitioners, especially masters and PhD students and young
researchers. As mentioned earlier, the paper is meant to
provide background on the area, reflect on accomplishments
and present key challenges so that the reader can quickly grasp
and be able to contribute to the software defect prediction area.
Paper Organization. The paper is organized as follows.
Section II overviews the area of defect prediction models.
Section III revisits the challenges that existed in the year
2000. Section IV describes current research trends and presents
game changers, which dramatically changed impacted the
field of defect prediction. Section V highlights some key
challenges for the future of defect prediction. Section VI draws
conclusions.
Defect Repository Source Code Repository Metrics Model Building Performance Evalua<on Output: List of SoDware Ar<facts for addi<onal SQA Other Repositories (e.g. email) Data Fig. 1. Overview of Software Defect Prediction [78]
II. BACKGROUND
As mentioned earlier, the main goal of most software defect
prediction studies is (1) to predict where defects might appear
in the future and (2) to quantify the relative importance of
various factors (i.e., independent variables) used to build a
model. Ideally, these predictions will correctly classify software artifacts (e.g., subsystems and files) as being defect-prone
or not and in some cases may also predict the number of
defects, defect density (the number of defects / SLOC) and/or
the likelihood of a software artifact including a defect [78].
Figure 1 shows an overview of the software defect prediction process. First, data is collected from software repositories,
which archive historical data related to the development of the
software project, such as source code and defect repositories.
Then, various metrics that are used as independent variables
(e.g., SLOC) and a dependent variable (e.g., the number of defects) are extracted to build a model. The relationship between
the independent variables and dependent variable is modeled
using statistical techniques and machine learning techniques.
Finally, the performance of the built model is measured using
several criteria such as precision, recall and AUC (Area Under
the Curve) of ROC (the Receiver Operating Characteristic). We
briefly explain each of the four aforementioned steps next.
Software defect prediction work generally leverages various
types of data from different repositories, such as (1) source
code repositories, which store and record the source code
and development history of a project, (2) defect repositories,
which track the bug/defect reports or feature requests filed
for a project and their resolution progress and (3) mailing list
repositories, which track the communication and discussions
between development teams. Other repositories can also be
leveraged for defect prediction.
B. Metrics
When used in software defect prediction research, metrics
are considered to be independent variables, which means that
they are used to perform the prediction (i.e., the predictors).
Also, metrics can represent the dependent variables, which
means they are the metrics being predicted (i.e., these can
be pre- or post-release defects). Previous defect prediction
studies used a wide variety of independent variables (e.g.,
process [24, 25, 37, 57, 59, 67], organizational [9, 61] or code
metrics [30, 38, 51, 60, 85, 95]) to perform their predictions.
Moreover, several different metrics were used to represent
the dependent variable as well. For example, previous work
predicted different types of defects (e.g., pre-release [59], postrelease [92, 95], or both [82, 83]).
A. Data
In order to build software defect prediction models, we
need a number of metrics that make up the independent and
dependent variables. Large software projects often store their
development history and other information, such as communication, in software repositories. Although the main reason
for using these repositories is to keep track of and record
development history, researchers and practitioners realize that
this repository data can be used to extract software metrics.
For example, prior work used the data stored in the source
control repository to count the number of changes made to a
file [95] and the complexity of changes [24], and used this
data to predict files that are likely to have future defects.
C. Model Building
Various techniques, such as linear discriminant analysis [30,
65, 71], decision trees [39], Naive Bayes [54], Support Vector
Machines (SVM) [13, 36] and random forest [20], are used
to build defect prediction models. Each of the aforementioned
techniques have their own benefits (e.g., they provide models
that are robust to noisy data and/or provide more explainable
models).
Generally speaking, most defect prediction studies divide
the data into two sets: a training set and a test set. The training
set is used to train the prediction model, whereas the testing set
is used to evaluate the performance of the prediction model.
D. Performance Evaluation
Once a prediction model is built, its performance needs to
be evaluated. Performance is generally measured in two ways:
predictive power and explanative power.
Predictive Power. Predictive power measures the accuracy
of the model in predicting the software artifacts that have defects. Measures such as precision, recall, f-measure and AUCROC, which plots the the false positive rate on the x-axis and
true positive rate on the y-axis over all possible classification
thresholds, are commonly-used in defect prediction studies.
Explanative Power. In addition to measuring the predictive
power, explanatory power is also used in defect prediction
studies. Explanative power measures how well the variability
in the data is explained by the model. Often the R2 or deviance
explained measures are used to quantify the explanative power.
Explanative power is particularly useful since it enables us
to measure the variability explained by each independent
variable in the model, providing us with a ranking as to which
independent variable is most “useful”.
III. PAST C HALLENGES : T HE E ARLY 2000 S
In order to grasp the level of accomplishments, it is necessary to look back and examine the challenges that the
software defect prediction community faced in the past. In
particular, we look back to the early 2000s (2000-2002), when
Fenton et al. [16] published the seminal survey on software
defect prediction. To enhance readability, we organize this
section along the four dimensions used in Section II, i.e.,
data, metrics, modelling and performance evaluation. Figure
2 shows the overview of past challenges, current trends and
future challenges in defect prediction studies.
A. Data
Past Challenge 1: Lack of Availability and Openness. In
the year 2000, one of the main challenges facing all datadriven approaches (including software defect prediction) was
the lack of data availability. Software defect prediction was
done only within several select, and perhaps forward-thinking,
companies using their industrial data. However, since software
companies almost never want to disclose the quality of their
software, researchers could rarely obtain the datasets used in
previous studies, which was a major challenge.
Validating this challenge, a survey of software defect prediction papers published between the years 2000-2010 [78]
showed that 13 out of 14 defect prediction papers published
between the years 2000-2002 conducted their studies using
datasets that are collected from industrial projects.1 Obviously,
these dataset were not shared. The other paper that used open
source software data (the Apache Web server project) [12]
never made their datasets publicly available. This shows that
1 Those 14 defect prediction papers are selected by (1) searching for papers
related to software defect prediction in the top software engineering venues
(e.g., TSE, ICSE and FSE), (2) reading through the titles and abstracts of each
paper to classify whether or not the papers are actually related to software
defect prediction (details can be found in the Appendix of [78]).
back in the early 2000s, data availability and openness was a
real challenge facing the defect prediction community.
Past Challenge 2: Lack of Variety in Types of Repositories.
In addition to the lack of availability of data, the variety of
the data was very limited as well. In the year 2000, most
papers used data from source code and defect repositories,
because those repositories provided the most basic information
for building defect prediction models. For example, all 14
defect prediction studies between the years 2000 and 2002
used source code and defect repositories only [78]. Clearly,
this shows that not only was data scarce, it was also very
difficult to come up with different types of data.
Past Challenge 3: Lack of Variety of Granularity. In
addition to data availability and variety, the granularity of the
data was another issue that faced software defect prediction
work in the early 2000s. The prediction unit (granularity)
heavily depends on the data that is collected from the software
repositories and used to build the defect prediction models.
There are different levels of granularity such as the subsystem [93], file [11] or function [37] level.
In the early 2000s, the majority of studies were performed
at the subsystem or file levels. Only one paper [90] of the 14
defect prediction studies between 2000 and 2002 performed its
prediction at the function level [78]. The main reason for most
studies performing their prediction at high levels of granularity
is that repository data is often given at the file level and can be
easily abstracted to the subsystem level. Although performing
predictions at the subsystem and file levels may lead to better
performance results [76, 95], the usefulness of the defect
prediction studies becomes less significant (i.e., since more
code would need to be inspected at high levels of abstraction).
Software defect prediction studies in the early 2000s
suffered from a lack of data availability, variety and
granularity.
B. Metrics
Due to the limitations on data in the early 2000s, there were
several implications on the metrics and the type of metrics that
were used in software defect prediction models.
Past Challenge 4: Lack of Variety of Independent Variables
—Size-Based Metrics. Size-based metrics (i.e., product metrics) are metrics that are directly related to or derived from the
source code (e.g., complexity or size). In the early 2000s, a
large body of defect prediction studies used product metrics to
predict defects. The main idea behind using product metrics is
that, for example, complex code is more likely to have defects.
However, as Fenton and Neil mentioned [17] in their future of
software engineering paper in 2000, while size-based metrics
correlated to the number of defects, they are poor predictors
of defects (i.e., there is no linear relationship between defect
density and size-based metrics). Furthermore, several studies
found that size-based metrics, especially code complexity
metrics, tend to be highly correlated with each other and with
the simple measure of Lines of Code (LOC) [17].
Dimensions
Sub-­‐dimensions
Openness
Data
Types of Repos.
Granularity
Metrics
Independent Variables
Dependent Variables
Model Building
Modeling Techniques
Scope of Applica9on
Performance Evalua9on
Performance Measures
Transparency
PC1. Limited (mostly industrial datasets)
PC2. Code and defects
PC3. Subsystems and files
FC2. Making CT1. Based on public OSS data
FC1. CT2. New repos are considered Commercial vs. our approaches OSS data more proac9ve
(e.g., Gerrit and vulnerability)
CT3. Finer granularity (e.g., Methods)
PC4. Mostly size-­‐based metrics
CT4. Process-­‐based and socio-­‐
technical metrics
PC5. Mostly post-­‐release defects
CT5. Considers effort
PC6. Specializing only in the training data CT6. Robust to different distribu9on of metrics PC7. Within-­‐project CT7. Cross-­‐project PC8. Mostly precision and recall CT8. Considers prac9cal value PC9. Rarely considered Past CT9. Most studies share data and scripts Current
FC3. Consider new markets
FC4. Keeping up with the fast pace of development FC5. Suggest how to fix the defects FC6. Making our models more accessible FC7. Focusing on prac9cal value Future
Fig. 2. Overview of Past Challenges, Current Trends and Future Challenges in Defect Prediction Studies (PC1: Past Challenge 1, CT1: Current Trend 1, and
FC1: Future Challenge 1).
Past Challenge 5: Lack of Variety of Dependent Variables
—Post-Release Defects. Generally speaking, the majority
of defect prediction studies predicted post-release defects. In
fact, between 2000 and 2002, 13 of 14 defect prediction
studies used post-release defects as a dependent variable [78].
Although the number of post-release defects is important and
measures the quality of the released software, it is not the
only criteria for software quality. The fact that so many studies
focused on the prediction of post-release defects is a limitation
of software defect prediction work.
In the early 2000s, software defect prediction mainly
leveraged size-based metrics as independent variables
and post-release defects as a dependent variable.
C. Model building
Past Challenge 6: Specializing Only in the Training Data.
In the early 2000s, linear regression and logistic regression
models were often used as modeling techniques due to their
simplicity. Defect prediction models assume that the distributions of the metrics in the training and testing datasets are
similar [86]. However, in practice, the distribution of metrics
can vary among releases, which may cause simple modeling
techniques such as linear and logistic regression models to
specialize only in the training data and perform poorly in its
prediction on the testing data.
Past Challenge 7: Building Models for Projects in the
Initial Development Phases.
In the early 2000s, most software defect prediction studies
trained their models on data from the same project, usually
from early releases. Then, the trained models were used to
predict defects in future releases. In practice however, training
data may not be available for projects in the initial development phases or for legacy systems that have not archived
historical data. How to deal with projects that did not have
prior project data was an open challenge in the early 2000s.
In the early 2000s, software defect prediction studies had
to deal with the lack of project data and the fact that
metrics may not have similar distributions.
D. Performance Evaluation
Past Challenge 8: Lack of Practical Performance Measures. Once the prediction models are built, one of the
key questions is how well do they perform. In the early
2000s, many defect prediction studies empirically evaluated
their performance using standard statistical measures such as
precision, recall and model fit (measured in R2 or deviance
explained). Such standard statistical measures are fundamental
criteria to know how well defect prediction models predict
and explain defects. However, in some cases other (and more
practical criteria) need to be considered. For example, how
much effort is needed to address a predicted defect or the
impact of a defect may need to be taken into consideration.
Past Challenge 9: Lack of Transparency/Repeatability. In
the early 2000s, due to the lack of availability and openness of
the datasets, other studies were not able to repeat the findings
of prior studies. Such lack of repeatability misses the critiques
of current and new research [77] and [18]. Hence, comparing
the performance of a technique with prior techniques was
nearly impossible in the early 2000s.
In the early 2000s, defect prediction studies lacked practical performance evaluation measures and transparency.
IV. C URRENT T RENDS
The challenges faced in the early 2000s were the focus of
the research that followed. Solutions to many of the challenges
mentioned in Section III were proposed and major advancements were accomplished. In this section, we highlight some
of the current trends in the area of software defect prediction.
Furthermore, we discuss - what we call game changers that had a significant impact on the accomplishments in
the software defect prediction area. Similar to the previous
sections, we organize the trends along the four dimensions,
data, metrics, models and performance evaluation.
A. Data
Current Trend 1: Availability and Openness. Once the
software defect prediction community realized that data availability and openness is a key factor to its success, many defect
prediction studies started sharing not only their data, but even
their analysis scripts. For example, in 2004, NASA MDP
(metrics data program) shared 14 datasets that are measured
during the software development of NASA projects through
the PROMISE repository (we will discuss the PROMISE
repository later in this section) [53]. The NASA datasets
were some of the first public datasets from industrial software
projects to be shared in the defect prediction domain. Similarly, Zimmermann et al. [95], D’Ambros et al. [11], Jureczko
et al. [27], Kamei et al. [31] and many others shared their
open source data. In fact, many conferences now have special
tracks to facilitate the sharing of datasets and artifacts. For
example, the ESEC/FSE conference [15] now has a replication
package track that encourages authors to share their artifacts.
The MSR conference now has a dedicated data showcase track
that focuses on the sharing of datasets that can be useful for
the rest of the community.
Reflecting back, the current trend of data sharing and
openness have in many ways helped alleviate the challenge
that existed in the early 2000s. That said, not all data is
being shared; our community needs to continue to nourish
such efforts in order to make data availability and openness a
non-existing issue.
Current Trend 2: Types of Repositories. In addition to
using the traditional repositories such as defect and code repositories, recent studies have also explored other types of repositories. For example, Meneely and Williams [49] leveraged
the vulnerabilities database (e.g., the National Vulnerability
Database and the RHSR security metrics database) to examine
the effect of the “too many cooks in the kitchen” phenomena
on software security vulnerabilities. Other studies such as the
study by Lee et al. [41] leveraged Mylyn repositories, which
capture developer interaction events. McIntosh et al. [45]
leveraged Gerrit data, which facilitates a traceable code review
process for git-based software projects to study of the impact
of modern code review practices on software quality. Other
types of repositories are also being used, especially as software
development teams become more appreciative of the power of
data analytics.
Game Changer 1: OSS projects. The open source
initiative, founded in 1998 is an organization dedicated
to promoting open source software (OSS), describes open
source software as software that can be freely used,
changed, and shared (in modified or unmodified form)
by anyone [66]. Nowadays, there is no end of the number
of active OSS projects available online supported by a
wide range of communities.
The proliferation of OSS projects is considered a game
changer because it opened up the development history
of many long-lived, widely-deployed software systems.
The number of papers using OSS projects is rapidly and
steadily growing over time [78].
Another way to access rich data sources is to cooperate
with commercial organization (e.g., ABB Inc, [43] and
Avaya [55]). However, in many cases, commercial organization are not willing to give access to the details of
its data sources due to confidentiality reasons. Academic
projects (e.g., course projects) have also been used, however, they tend to not be as rich since they do not have real
customers, and are often developed by small groups. In
short, OSS projects provide rich, extensive, and readily
available software repositories, which had a significant
impact on the software defect prediction field.
Reflecting back, the current trends show strong growth in
the different types of repositories being used. We believe that
exploring new repositories will have a significant impact on
the future of software defect prediction since it will facilitate
better and more effective models and allow us to explore the
impact of different types of phenomena on software quality.
Current Trend 3: Variety of Granularity. Due to the
benefits of performing predictions at a finer granularity, recent
defect prediction studies have focused on more fine-grained
levels, such as method level [19, 26, 37] and change level
predictions [4, 31, 36]. For example, Giger et al. [19] empirically investigate whether or not defect prediction models at the
method-level work well. Another example of work that aims to
perform predictions at a fine granularity is the work on changelevel prediction, which aims to predict defect introducing
changes (i.e., commits). The advantage of predicting defect
introducing changes, compared to subsystems or files is that
a change is often much smaller, can be easily assigned and
contains complete information about a single change (which
is important if a fix spans multiple files for example).
Similar to change-level prediction, Hassan and Holt [25]
propose heuristics to create the Top Ten List of subsystems that
are most susceptible to defects. Their approach dynamically
updates the list to reflect the progress of software development
when developers modify source code. Kim et al. [37] built on
the work by Hassan and Holt to propose BugCache, which
caches locations that are likely to have defects based on
locality of defects (e.g., defects are more likely to occur
in recently added/changed entities). When source code is
modified, the cached locations are dynamically updated to
reflect that the defect occurrences directly affect the cached
locations.
The current trends have realized that the practical value
of the predictions decrease as the abstraction level increases
(i.e., since more code would need to be inspected at high
levels of abstraction). Recently, studies have focused more on
performing predictions at a finer level of granularity, e.g., at
the method-level and change-level.
B. Metrics
Current Trend 4: Variety of Independent Variables. In
addition to using size-based metrics (i.e., product metrics),
recent software defect prediction used process metrics, which
measure the change activity that occurred during the development of a new release, to build highly accurate defect
prediction models [24, 25, 37, 57, 59, 67]. The idea behind
using process metrics in defect prediction is that the process
used to develop the code may lead to defects, hence the process
metrics may be a good indicator of defects. For example, if a
piece of code is changed many times or by many people, this
may indicate that it is more likely to be defect-prone. One of
the main advantages of process metrics is that process metrics
are independent of the programming language, so process
metrics are easier to expand to other languages than product
metrics.
Although much of the current research used process metrics
to predict defects, other metrics have been proposed in the
literature. For example, studies explored design/UML metrics [7, 14, 64] (which capture the design of the software
system), social metrics [5] (which combine dependency data
from the code and from the contributions of developers),
organizational metrics [55] (which capture the geographic
distribution of the development organization, e.g., the number
of sites that modified a software artifact) and ownership metrics [6] (which measure the level of ownership of a developer
to a software artifact). In addition, there are studies that use
textual information (e.g., identifiers and comments in source
code) as independent variables [36, 87].
Reflecting back, we see a very clear evolution in the way
software defect prediction work uses metrics. Initially, most
metrics were deigned to help improve the prediction accuracy.
However, more recently, studies have used defect prediction to
examine the impact of certain phenomena, e.g., ownership, on
code quality. Such studies have pushed the envelope in metric
design and contributed significantly to the body of empirical
work on software quality in general.
Current Trend 5: Variety of Dependent Variables. In the
early 2000s, most studies used post-release defects as a dependent variable [7, 12]. More recently however, many studies
started to focus on different types of dependent variables that
span much more than post-release defects [10, 80, 82, 88]. For
example, Shihab et al. [82] focused on predicting defects that
break pre-existing functionality (breakage defects) and defects
in files that had relatively few pre-release changes (surprise
Game Changer 2: The PROMISE repository. The
PROMISE repository is a research data repository for
software engineering research datasets and offers free
and long-term storage for research datasets [53]. To date,
the PROMISE repository contains more than 45 datasets
related to defect prediction. The PROMISE repository
started to share the samples of the Metrics Data Program, which was run by NASA for collecting static code
measures in 2002, as version 1 in 2004. The PROMISE
repository is currently in version 4 and it stores more than
one terabyte of data.
The PROMISE repository is a game changer, because it
facilitated the sharing of data sets across researchers. This
dramatically speeds up the progress of defect prediction
research. The repository is popular among software defect
prediction researchers.
defects). Meneely and Williams [49] built models to predict
software security vulnerabilities. Garcia et al. [88] predicted
blocking defects, which block other defects from being fixed.
Furthermore, researchers have proposed to perform predictions
at the change-level, focusing on predicting defect-inducing
changes [36, 56]. The prediction can be conducted at the time
when the change is submitted for unit test (private view) or
integration. Such immediate feedback ensures that the design
decisions are still fresh in the minds of developers [31].
Reflecting back, it seems that many of the recent studies
have realized that not all defects are equal and that there are
important defects that are not post-release defects. The recent
trends show that different types of dependent variables are being considered, which take into account different stakeholders
and different timing (e.g., releases vs. commits). For example,
the prediction of blocking defects clearly shows that helping
the developers in the main goal, which is different from the
traditional defect prediction studies which mainly focused
on the customer as the main stakeholder. The prediction
of defect-inducing changes shows that predictions are made
early on, comparing with the traditional studies which have
their drawbacks (i.e., predictions are made too late in the
development cycle).
C. Model building
Current Trend 6: Treating Uncertainty in Model Inputs
or Outputs. As we mentioned in Section III, many defect
prediction models assume that the distributions of the metrics
in the training and testing datasets are similar [86]. However, in
practice, the distribution of metrics can vary among releases.
To deal with projects that do not have similar distributions
in their training and testing datasets, recent defect prediction
studies have used ensemble techniques. One of the frequentlyused ensemble techniques is a random forest classifier that
consists of a number of tree-structured classifiers [20, 29, 47].
New objects are classified from an input vector that is composed of input vectors on each tree in the forest. Each tree
casts a vote at the input vector by providing a classification.
The forest selects the classification that has the most votes
over all trees in the forest.
The main advantages of random forest classifiers are that
they generally outperform simple decision trees algorithms in
terms of prediction accuracy and random forest classifiers are
more resistant to noise in data [44].
Current Trend 7: Building Cross-Project Defect Prediction
Models for Projects with Limited Data.
The majority of studies in the early 2000s focused on
within-project defect prediction. This means that they used
data from the same project to build their prediction models.
However, one major drawback with such an approach that
was highlighted by previous studies [7, 94] is the fact that
within-project defect prediction requires sufficient historical
data, for example, data from past releases. Having sufficient
historical data can be a challenge, especially for newer projects
and legacy projects. Hence, more recently, defect prediction
studies have worked on cross-project defect prediction models,
i.e., models that are trained using historical data from other
projects [28, 50, 62, 69, 70, 86, 91, 94], in order to make
defect prediction models available for projects in the initial
development phases, or for legacy systems that have not
archived historical data [7].
In the beginning, cross-project defect prediction studies
showed that the performance of cross-project predictions is
lower than the within-project predictions. However, recent
work has shown that cross-project models can achieve performance similar to that of within-project models. For example, in
their award winning work, Zhang et al. [91] propose a contextaware rank transformation method to preprocess predictors
and address the variations in their distributions, and showed
that their cross-project models achieve performance that rivals
within-project models. Recent work [69, 70] has also focused
on secure data sharing to address the privacy concerns of data
owners in cross-project models. For example, Peters et al. [69]
studied the approaches that prevent the disclosure of sensitive
metrics values without a significant performance degradation
of cross-project models.
Reflecting back, thanks to cross-project defect prediction
models, the recent defect prediction models are now available
for projects in their initial development phases and for legacy
systems that have not archived historical data. At this point, the
research community has come a long way with cross-project
defect prediction, however, many questions still remain. For
example, the lack of availability of industrial data leaves
open the question of whether models built using data from
open source projects would apply to industrial projects. At
this stage, cross-project defect prediction remains as an open
research area.
D. Performance evaluation
Current Trend 8: Practical Performance Measures. In
the early 2000s, most studies used traditional performance
evaluation techniques such as precision and recall. More recently defect prediction studies have focused on more practical
Game Changer 3: SZZ algorithm. Śliwerski et al. proposed the SZZ algorithm [84] that extracts whether or
not a change introduces a defect from Version Control
Systems (VCSs). The SZZ algorithm links each defect
fix to the source code change introducing the original
defect by combining information from the version archive
(such as CVS) with the defect tracking system (such as
Bugzilla).
The SZZ algorithm is a game changer, because it provided
a new data source for defect prediction studies. Without
the SZZ algorithm, we would not be able to determine
when a change induces a defect and conduct empirical
studies on defect prediction models. When developers
submit their revision to add functionality and /or fix
defects to VCSs in their project, they enter comments
(e.g., fix defect #1000) related to their revision in the log
message. However, there is no comments to detect that
a change induces a defect (e.g., introducing defects), because developers have no intention of introducing defects
and introduce them wrongly. At the time of writing this
paper (September 2015), the paper [84] is cited by more
than 400 times according to Google Scholar† .
†
https://goo.gl/lUiGbR
performance evaluations. To consider evaluation in more practical settings, recent work has considered the effort required to
address the predicted software artifacts [29, 48, 52, 72, 73]. For
example, recent work by Kamei et al. [29] evaluates common
defect prediction findings (e.g., process metrics vs. product
metrics and package-level prediction vs. file-level prediction)
when effort is considered. Mende and Koschke [48] compared
strategies to include the effort treatment into defect prediction
models. Menzies et al. [52] argue that recent studies have
not been able to improve defect prediction results since their
performance is measured as a tradeoff between the probability
of false alarms and probability of detection. Therefore, they
suggest changing the standard goal to consider effort, i.e., to
finding the smallest set of modules that contain most of the
defects.
Furthermore, recent defect prediction studies have also conducted an interview to better understand the defect prediction
models and derive practical guidelines for developing high
quality software. For example, Shihab et al. [82] ask the
opinions of the highly experienced quality manager in the
project about their prediction results. Based on their opinions,
the authors conclude that the defect prediction models should
not only predict defect locations, but also detect patterns of
changes that are suggestive of a particular type of defect and
recommend appropriate remedies.
Reflecting back, we see that more recent studies have
focused on the practical value of software defect prediction
and on trying to evaluate their models in realistic settings. We
strongly support such research and see this trend continuing
to grow since software defect prediction models are starting
to be used in industry (e.g., [79]).
Current Trend 9: Transparency/Repeatability. As mentioned earlier, the software engineering community as a whole
has realized the value of making studies as transparent as
possible. For example, recent defect prediction studies have
kept transparency for their studies, then generated a lively
discussion (e.g., critique) of the results of the studies and led to
new findings. For example, Shepperd et al. [77] derive some
comments on the NASA software defect datasets (e.g., the
dataset contains several erroneous and implausible entries) and
share cleaned NASA defect datasets. Then, Ghotra et al. [18]
revisit the findings of Lessmann et al. [42] that used original
NASA datasets. In contrary to prior results [42], Ghotra et
al. show that there are statistically significant differences in
the performance of defect prediction models that are trained
using different classification techniques. Such value is made
possible thanks to the fact that the NASA projects made their
datasets available.
Reflecting back, we see a very healthy and progressive trend
where software defect prediction studies are becoming more
transparent. Such changes will only make our findings more
practical and will encourage us to advance the science (not
only the engineering) behind the area of software defect prediction since it allows us to repeat and scrutinize assumptions
of prior studies.
V. C HALLENGES FOR THE N EAR ( AND NOT SO NEAR )
F UTURE
The field of software defect prediction has made many accomplishments in the recent years. However, many challenges
remain and (we believe) will pop up in the future due to
changes in technology, data and the increasingly important
role software systems continue to play.
Future Challenge 1: Commercial vs. OSS Data. As Section
IV shows, many researchers make use of the dataset collected
from open source software projects. The main reason for using
data from open source projects is that these projects archive
long and practical development history and make their data
publicly available. However, the generality of our findings
and techniques to non open source software projects (e.g.,
commercial projects) is not studied in depth; in part due to
the lack of availability of data.
To solve this challenge, we need to make more partnership
with industrial partners and have access to their repositories.
The authors had some success starting projects with industry
and our experience shows that starting such partnerships is
easier than one might think. Especially with the type of big
data and data analysis in general, companies are starting to
realize that there is value in mining their data. That said,
the industrial partners need to see some value with what the
researchers are doing with their data, otherwise they may lose
interest.
On the positive side, many industrial projects have already
shifted to modern software development environment (e.g., Git
and Gerrit). That is, it is easy to apply our tools to their
projects without much effort. In short, the most important
thing is to have the will to start a collaboration with industrial
Game Changer 4: Statistical and Machine Learning
Tools (Weka and R).
The majority of defect prediction studies rely heavily on
statistical and machine learning (ML) techniques. Some of
the most popular statistical and ML tools used in software
defect prediction studies are R and WEKA. WEKA [22]
is a tool developed by the Machine Learning Group at
University of Waikato. Weka is a collection of machine
learning algorithms for data mining tasks that can be
applied directly to a dataset. R [1] is an open source
language and environment for statistical analysis.
Weka and R are game changers, because both of them
provide a wide variety of data pre-processing, statistical
(linear and nonlinear modelling, classical statistical tests
and classification) and support for graphical techniques.
They are also open source and are highly extensible.
Therefore, Weka and R are commonly used in defect
prediction studies [78].
projects. When we continue to demonstrate the value of data
in software repositories and the benefits of MSR techniques
for helping practitioners in their daily activities, practitioners
are more likely to contact us and consider using our technique
in practice.
Future Challenge 2: Making our Approaches More Proactive. When it comes to software defect prediction, many of
our techniques thus far have been reactive in nature. What that
means is that we observe the software development process
and then use this data to predict what will happen postrelease. However, in many cases practitioners would like to
have predictions happen much sooner, e.g., before or as soon
as they commit their changes. Doing so would make our
approaches more proactive in a sense, since it will allow us
to perform our predictions as the development is happening
rather than waiting till it has completed.
Several studies have already started to work in this area, using metrics from the design stage [76] and performing changelevel defect predictions [31, 36]. However, there remains much
work to do in this area. We may be able to devise tools that
not only predict risky areas or changes, but also generate tests
(and possibly fixes) for these risky areas and changes. We can
also devise techniques that proactively warn developers, even
before they modify the code, that they are working with risky
code that has had specific types of defects in the past.
Future Challenge 3: Considering New Markets. The
majority of software defect prediction studies used code and/or
process metrics to perform their predictions. To date, this
has served the community well and has helped us advance
the state-of-the-art in software defect prediction. However,
comparing with the year 2000, our environment has changed
dramatically. Therefore, we need to tackle the challenges that
new environments raise.
One example of new markets that we should tackle is mobile
application fields. We use personal smart phone every day
during moving and update applications from online stores
(e.g., Google Play and App Store). Mobile applications play
a significant role in our daily life and these applications have
different characteristics compared to conventional applications
that we studied in the past. These mobile application stores
allow us to gain valuable user data that, till today, has not
been leveraged in the area of software defect prediction. Few
studies have leveraged this data to improve testing [32, 33].
Moreover this data can be leveraged to help us understand
what impacts users and in which way the user is impacted.
Such knowledge can help us build better and more accurate
models. For example, we may be able to build the models
that predict specific types of defects (e.g., energy defects and
performance defects) using the complaint of users [10, 34].
We anticipate the use of user data in software defect prediction
models to be an area of significant growth in the future.
Future Challenge 4: Keeping Up with the Fast Pace of
Development. In today’s fast changing business environment, the recent trend of software development is to reduce
the release cycle to days or even hours [74]. For example,
the Firefox project changed their release process to a rapid
release model (i.e., a development model with a shorter release
cycle) and releases over 1,000 improvements and performance
enhancements with version 5.0 in 3 months [35]. IMVU, which
is an online social entertainment website, deploys new code
fifty times a day on average [2].
We need to think about how we can integrate our research
into continuous integration. For example, O’Hearn suggested
that commenting on code changes at review time makes a large
difference in helping developers as opposed to producing the
defect list from batch-mode analysis because such commenting
does not ask them to make a context switch to understand and
act on an analysis report [68].
The majority of quality assurance research focused on defect
prediction models that identify defect-prone modules (i.e., files
or packages) at release-level like batch-mode analysis [21, 24,
43, 58]. Those studies use the datasets collected from previous
releases to build a model and derive the list of defect-prone
modules. Such models require practitioners to remember the
rationale and all the design decisions of the changes to be able
to evaluate if the change introduced a defect.
To solve the problem that O’Hearn pointed out, we can
focus on Just-In-Time (JIT) Quality Assurance [31, 36, 56],
which performs predictions at the change level. JIT defect
prediction models aim to be an earlier step of continuous quality control because it can be invoked as soon as a developer
commits code to their private or to the team’s workspace.
There remains much work to do in this area. We still
need to evaluate how to integrate JIT models into the actual
contentious integration process. For example, we can devise
the approaches that suggest how much effort developers spend
to find and fix defects based on the probability of prediction
(e.g., while JIT models predict that this change includes
defects with 80% of probability, the developer should work on
the change for an additional 30 minutes to find the defects).
Fig. 3. Adding a Repository in Commit Guru
Fig. 4. Commit Statistics of the Linux Kernel Project in Commit Guru
Future Challenge 5: Knowing How to Fix the Defects. The
main purpose of defect prediction models thus far has mainly
been two-fold: (1) to predict where defects might appear in
the future and allocate SQA resources to defect-prone artifacts
(e.g., subsystems and files) and (2) understand the effect of
factors on the likelihood of finding a defect. Although such
models have proven to be useful in certain contexts to derive
practical guidelines for future software development projects,
how to fix the defects that are flagged remains to be an open
question.
Therefore, we envision that future defect prediction models
will not only predict where the defects are, they will also
provide information on how to best fix these defects. In the
future, we need to understand what kind of defects happen and
why such defects happen. Such information can be leveraged
by developers to locate and fix the predicted defect. One area
of research that might be useful here is the area of automated
program repair [40, 46, 63], which produces candidate patches.
Automated program repair can be combined with defect prediction to automatically generate patches for the predicted
defects.
Future Challenge 6: Making our Models More Accessible.
Recently, software engineering research papers are providing
replication packages (e.g., including dataset, tool/script and
readme files) with their studies. The main reason for doing so
is (1) to ensure the validity and transparency of the experiments and (2) to allow others to replicate their studies and/or
compare the performance of new models with the original
models using the same datasets. The general belief is that we
always want to improve the accuracy of our approaches.
However, in practice, we need to consider how easy our
proposed models and replication packages are to use by other
researchers and practitioners. When we just want to use a
replication package, some types of replication packages are not
suitable: (e.g., the README files are inaccurate and additional
set up is needed to produce the same environment). In that
case, we may give up on using these techniques due to time
investment needed.
Furthermore, it would be great to provide tools that can be
easily accessed via REST-APIs and original scripts for the
people who want to integrate our tools into their projects.
For example, Commit Guru [75] provides a language agnostic
analytics and prediction tool that identifies and predicts risky
software commits (Figure 3 and Figure 4). It is publicity
available via the web. The tool simply requires a URL of
the Git repository that you want to analyze (Figure 3). Its
source code is freely available under the MIT license.2 In
short, we need to make our models and techniques simple
and extendable, and where applicable, provide tools so that
others can easily use our techniques.
Future Challenge 7: Focusing on Effort. From the year
2000 [17], we have one argument that we need to evaluate our
prediction models in a practical setting (e.g., how much effort
do defect prediction models reduce for code review?) instead
of only improving precision and recall. Recent studies tried to
tackle such problems when considering effort [8, 29, 48, 72].
Such studies use LOC or churn as a proxy for effort. However,
our previous studies [81] show that using a combination of
LOC, code and complexity metrics provides a better prediction
of effort than using LOC alone. In the future, researchers
need to examine what is the best way to measure effort
in effort-aware defect prediction models. Such research can
have a significant impact on the future applicability of defect
prediction work.
VI. C ONCLUSION
The field of software defect prediction has been wellresearched since it was first proposed. As our paper showed,
there have been many papers that explored different types
of data and their implications, proposed various metrics,
examined the applicability of different modeling techniques
and evaluation criteria.
The field of software defect prediction has made a series of
accomplishments. As our paper highlighted, there have been
many papers that addressed challenges related to data (e.g., the
recent trend of data sharing and openness have in many ways
helped alleviate the challenge that existed in the early 2000s),
metrics (e.g., studies have used defect prediction to examine
the impact of certain phenomena, e.g., ownership, on code
quality), model building (e.g., thanks to cross-project defect
prediction models, the recent defect prediction models are
now available for projects in the initial development phases)
and performance evaluation (e.g., we see a very healthy and
progressive trend where software defect prediction studies are
becoming more accurate and transparent.).
2 It can be downloaded at https://github.com/CommitAnalyzingService/
CAS Web (front-end) and https://github.com/CommitAnalyzingService/CAS
CodeRepoAnalyzer (back-end).
At the same time, particular initiatives and works have had
a profound impact on the area of software defect prediction,
which we listed as game changers. For example, OSS projects
providing rich, extensive, and readily available software repositories; the PROMISE repository providing SE researchers
with a common platform to share datasets; the SZZ algorithm
automatically extracting whether or not a change introduces a
defect from VCSs, which in turn dramatically accelerated the
speed of research, especially for JIT defect prediction; tools
such as Weka and R providing a wide variety of data preprocessing and making available common statistical and ML
techniques.
That said, there remain many future challenges for the
field that we also highlight. For example, we need to tackle
that the generality of our findings and the applicability of
our techniques to non-open source software projects (e.g.,
commercial projects). We also need to consider new markets
(e.g., mobile applications and energy consumption) in the
domain of software quality assurance, as well as many other
future challenges that may impact the future of software defect
prediction.
This paper only provides the authors perspective based on
their preferences and experiences. The aim of the paper is
to provide the readers with an understanding of prior defect
prediction studies and highlight some key challenges for the
future of defect prediction.
ACKNOWLEDGMENT
We would like to thank Ahmed E. Hassan, Thomas Zimmerman and Massimiliano Di Penta (the co-chairs of Leaders of
Tomorrow: Future of Software Engineering) for giving us the
opportunity to write our vision paper on the topic of software
defect prediction. We would also like to thank the reviewers
and Dr. Naoyasu Ubayashi for their constructive and fruitful
feedback. The first author was partially supported by JSPS
KAKENHI Grant Numbers 15H05306 and 25540026.
R EFERENCES
[1] The R project. http://www.r-project.org/.
[2] B. Adams. On Software Release Engineering. ICSE 2012
technical briefing.
[3] F. Akiyama. An example of software system debugging.
In IFIP Congress (1), pages 353–359, 1971.
[4] L. Aversano, L. Cerulo, and C. Del Grosso. Learning
from bug-introducing changes to prevent fault prone
code. In Proc. Int’l Workshop on Principles of Software
Evolution (IWPSE’07), pages 19–26, 2007.
[5] C. Bird, N. Nagappan, H. Gall, B. Murphy, and P. Devanbu. Putting it all together: Using socio-technical
networks to predict failures. In Proc. Int’l Symposium
on Software Reliability Engineering (ISSRE’09), pages
109–119, 2009.
[6] C. Bird, N. Nagappan, B. Murphy, H. Gall, and P. Devanbu. Don’t touch my code!: examining the effects of
ownership on software quality. In Proc. European Softw.
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
[17]
[18]
[19]
[20]
[21]
[22]
Eng. Conf. and Symposium on the Foundations of Softw.
Eng. (ESEC/FSE’11), pages 4–14, 2011.
L. C. Briand, W. L. Melo, and J. Wust. Assessing
the applicability of fault-proneness models across objectoriented software projects. IEEE Trans. Softw Eng,
28:706–720, 2002.
G. Canfora, A. D. Lucia, M. D. Penta, R. Oliveto,
A. Panichella, and S. Panichella. Defect prediction as a
multiobjective optimization problem. Softw. Test., Verif.
Reliab., 25(4):426–459, 2015.
M. Cataldo, A. Mockus, J. A. Roberts, and J. D. Herbsleb. Software dependencies, work dependencies, and
their impact on failures. IEEE Trans. Softw. Eng.,
99(6):864–878, 2009.
T.-H. Chen, M. Nagappan, E. Shihab, and A. E. Hassan.
An empirical study of dormant bugs. In Proc. Int’l Working Conf. on Mining Software Repositories (MSR’14),
MSR 2014, pages 82–91, 2014.
M. D’Ambros, M. Lanza, and R. Robbes. An extensive
comparison of bug prediction approaches. In Proc.
Int’l Working Conf. on Mining Software Repositories
(MSR’10), pages 31–41, 2010.
G. Denaro and M. Pezzè. An empirical evaluation of
fault-proneness models. In Proc. Int’l Conf. on Softw.
Eng. (ICSE’02), pages 241–251, 2002.
K. O. Elish and M. O. Elish. Predicting defect-prone
software modules using support vector machines. Journal
of Systems and Software, 81:649–660, 2008.
A. Erika and C. Cruz. Exploratory study of a UML
metric for fault prediction. In Proc. Int’l Conf. on Softw.
Eng. (ICSE’10) - Volume 2, pages 361–364, 2010.
ESEC/FSE 2015. Research sessions. http://esec-fse15.
dei.polimi.it/research-program.html.
N. E. Fenton and M. Neil. A critique of software defect
prediction models. IEEE Trans. Softw. Eng., 25:675–689,
1999.
N. E. Fenton and M. Neil. Software metrics: Roadmap.
In Proc. Conf. on The Future of Software Engineering
(FoSE), pages 357–370, 2000.
B. Ghotra, S. McIntosh, and A. E. Hassan. Revisiting the
impact of classification techniques on the performance of
defect prediction models. In Proc. Int’l Conf. on Softw.
Eng. (ICSE’15), pages 789–800, 2015.
E. Giger, M. D’Ambros, M. Pinzger, and H. C. Gall.
Method-level bug prediction. In Proc. the Int’l Symposium on Empirical Softw. Eng. and Measurement
(ESEM’12’), pages 171–180, 2012.
L. Guo, Y. Ma, B. Cukic, and H. Singh. Robust prediction
of fault-proneness by random forests. In Proc. Int’l Symposium on Software Reliability Engineering (ISSRE’04),
pages 417–428, 2004.
T. Gyimothy, R. Ferenc, and I. Siket. Empirical validation
of object-oriented metrics on open source software for
fault prediction. IEEE Trans. Softw. Eng., 31:897–910,
2005.
M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reute-
[23]
[24]
[25]
[26]
[27]
[28]
[29]
[30]
[31]
[32]
[33]
[34]
[35]
[36]
mann, and I. H. Witten. The WEKA data mining
software: An update. SIGKDD Explor. Newsl., 11(1):10–
18, 2009.
T. Hall, S. Beecham, D. Bowes, D. Gray, and S. Counsell.
A systematic literature review on fault prediction performance in software engineering. IEEE Trans. Softw. Eng.,
38(6):1276–1304, 2012.
A. E. Hassan. Predicting faults using the complexity
of code changes. In Proc. Int’l Conf. on Softw. Eng.
(ICSE’09), pages 78–88, 2009.
A. E. Hassan and R. C. Holt. The top ten list: Dynamic
fault prediction. In Proc. the 21st IEEE Int’l Conf. on
Software Maintenance, pages 263–272, 2005.
H. Hata, O. Mizuno, and T. Kikuno. Bug prediction based
on fine-grained module histories. In Proc. Int’l Conf. on
Softw. Eng. (ICSE’12), pages 200–210, 2012.
M. Jureczko and L. Madeyski. Towards identifying
software project clusters with regard to defect prediction.
In Proc. Int’l Conf. on Predictor Models in Softw. Eng.
(PROMISE’10), pages 9:1–9:10, 2010.
Y. Kamei, T. Fukushima, S. McIntosh, K. Yamashita,
N. Ubayashi, and A. E. Hassan. Studying just-in-time
defect prediction using cross-project models. Empirical
Softw. Eng., pages 1–35, 2015.
Y. Kamei, S. Matsumoto, A. Monden, K. Matsumoto,
B. Adams, and A. E. Hassan. Revisiting common bug
prediction findings using effort aware models. In Proc.
Int’l Conf. on Software Maintenance (ICSM’10), pages
1–10, 2010.
Y. Kamei, A. Monden, S. Matsumoto, T. Kakimoto,
and K. Matsumoto. The effects of over and under
sampling on fault-prone module detection. In Proc. Int’l
Symposium on Empirical Softw. Eng. and Measurement
(ESEM’07), pages 196–204, 2007.
Y. Kamei, E. Shihab, B. Adams, A. E. Hassan,
A. Mockus, A. Sinha, and N. Ubayashi. A large-scale
empirical study of just-in-time quality assurance. IEEE
Trans. Softw. Eng., 39(6):757–773, 2013.
H. Khalid, M. Nagappan, and A. E. Hassan. Examining
the relationship between Findbugs warnings and end user
ratings: A case study on 10,000 Android apps. IEEE
Software, In Press.
H. Khalid, M. Nagappan, E. Shihab, and A. E. Hassan.
Prioritizing the devices to test your app on: A case study
of Android game apps. In Proc. Int’l Symposium on
Foundations of Software Engineering (FSE’14), pages
610–620, 2014.
H. Khalid, M. Nagappan, E. Shihab, and A. E. Hassan.
What do mobile app users complain about? IEEE
Software, 32(3):70–77, 2015.
F. Khomh, B. Adams, T. Dhaliwal, and Y. Zou. Understanding the impact of rapid releases on software quality
— the case of Firefox. Empirical Softw. Eng., 20:336–
373, 2015.
S. Kim, E. J. Whitehead, Jr., and Y. Zhang. Classifying
software changes: Clean or buggy? IEEE Trans. Softw.
Eng., 34(2):181–196, 2008.
[37] S. Kim, T. Zimmermann, E. J. Whitehead Jr., and
A. Zeller. Predicting faults from cached history. In Proc.
Int’l Conf. on Softw. Eng. (ICSE’07), pages 489–498,
2007.
[38] A. G. Koru and H. Liu. Building defect prediction models
in practice. IEEE Software, 22:23–29, 2005.
[39] A. G. Koru and H. Liu. An investigation of the effect of
module size on defect prediction using static measures.
In Proc. Int’l Workshop on Predictor Models in Softw
Eng. (PROMISE’05), pages 1–5, 2005.
[40] C. Le Goues, M. Dewey-Vogt, S. Forrest, and W. Weimer.
A systematic study of automated program repair: Fixing
55 out of 105 bugs for $8 each. In Proc. Int’l Conf. on
Softw. Eng. (ICSE’12), pages 3–13, 2012.
[41] T. Lee, J. Nam, D. Han, S. Kim, and H. P. In. Micro interaction metrics for defect prediction. In Proc. European
Softw. Eng. Conf. and Symposium on the Fundations of
Softw. Eng. (ESEC/FSE’11), pages 311–321, 2011.
[42] S. Lessmann, B. Baesens, C. Mues, and S. Pietsch.
Benchmarking classification models for software defect
prediction: A proposed framework and novel findings.
IEEE Trans. Softw. Eng., 34:485–496, 2008.
[43] P. L. Li, J. Herbsleb, M. Shaw, and B. Robinson. Experiences and results from initiating field defect prediction
and product test prioritization efforts at ABB inc. In
Proc. Int’l Conf. on Softw. Eng. (ICSE’06), pages 413–
422, 2006.
[44] L. Marks, Y. Zou, and A. E. Hassan. Studying the
fix-time for bugs in large open source projects. In
Proc. Int’l Conf. on Predictive Models in Softw. Eng.
(PROMISE’11), pages 11:1–11:8, 2011.
[45] S. McIntosh, Y. Kamei, B. Adams, and A. E. Hassan.
The impact of code review coverage and code review
participation on software quality: A case study of the Qt,
VTK, and ITK projects. In Proc. Int’l Working Conf. on
Mining Software Repositories (MSR’14), pages 192–201,
2014.
[46] S. Mechtaev, J. Yi, and A. Roychoudhury. DirectFix:
Looking for simple program repairs. In Proc. Int’l Conf.
on Softw. Eng. (ICSE’15), pages 448–458, 2015.
[47] T. Mende and R. Koschke. Revisiting the evaluation
of defect prediction models. In Proc. Int’l Conf. on
Predictor Models in Softw. Eng. (PROMISE’09), pages
1–10, 2009.
[48] T. Mende and R. Koschke. Effort-aware defect prediction
models. In Proc. European Conf. on Software Maintenance and Reengineering (CSMR’10), pages 109–118,
2010.
[49] A. Meneely and L. Williams. Strengthening the empirical
analysis of the relationship between Linus’ Law and
software security. In Proc. Int’l Symposium on Empirical
Softw. Eng. and Measurement (ESEM’10), pages 9:1–
9:10, 2010.
[50] T. Menzies, A. Butcher, D. Cok, A. Marcus, L. Layman,
F. Shull, B. Turhan, and T. Zimmermann. Local versus
[51]
[52]
[53]
[54]
[55]
[56]
[57]
[58]
[59]
[60]
[61]
[62]
[63]
[64]
[65]
[66]
global lessons for defect prediction and effort estimation.
IEEE Trans. Softw. Eng., 39(6):822–834, 2013.
T. Menzies, J. Greenwald, and A. Frank. Data mining
static code attributes to learn defect predictors. IEEE
Trans. Softw. Eng., 33:2–13, 2007.
T. Menzies, Z. Milton, B. Turhan, B. Cukic, Y. Jiang, and
A. Bener. Defect prediction from static code features:
current results, limitations, new approaches. Automated
Software Engineering, 17:375–407, 2010.
T. Menzies, M. Rees-Jones, R. Krishna, and C. Pape. The
PROMISE repository of empirical software engineering
data, 2015.
T. Menzies, B. Turhan, A. Bener, G. Gay, B. Cukic,
and Y. Jiang. Implications of ceiling effects in defect
predictors. In Proc. Int’l Workshop on Predictor Models
in Softw. Eng. (PROMISE’08), pages 47–54, 2008.
A. Mockus. Organizational volatility and its effects on
software defects. In Proc. Int’l Symposium on Foundations of Softw. Eng. (FSE’10), pages 117–126, 2010.
A. Mockus and D. M. Weiss. Predicting risk of software
changes. Bell Labs Technical Journal, 5(2):169–180,
2000.
R. Moser, W. Pedrycz, and G. Succi. A comparative
analysis of the efficiency of change metrics and static
code attributes for defect prediction. In Proc. Int’l Conf.
on Softw. Eng. (ICSE’08), pages 181–190, 2008.
J. C. Munson and T. M. Khoshgoftaar. The detection
of fault-prone programs. IEEE Trans. Softw. Eng.,
18(5):423–433, 1992.
N. Nagappan and T. Ball. Use of relative code churn
measures to predict system defect density. In Proc. Int’l
Conf. on Softw. Eng. (ICSE’05), pages 284–292, 2005.
N. Nagappan and T. Ball. Using software dependencies
and churn metrics to predict field failures: An empirical
case study. In Proc. Int’l Symposium on Empirical
Softw. Eng. and Measurement (ESEM’07), pages 364–
373, 2007.
N. Nagappan, B. Murphy, and V. Basili. The influence of
organizational structure on software quality: an empirical
case study. In Proc. Int’l Conf. on Softw. Eng. (ICSE’08),
pages 521–530, 2008.
J. Nam, S. J. Pan, and S. Kim. Transfer defect learning.
In Proc. Int’l Conf. on Softw. Eng. (ICSE’13), pages 382–
391, 2013.
H. D. T. Nguyen, D. Qi, A. Roychoudhury, and S. Chandra. SemFix: Program repair via semantic analysis. In
Proc. Int’l Conf. on Softw. Eng. (ICSE’13), pages 772–
781, 2013.
A. Nugroho, M. Chaudron, and E. Arisholm. Assessing
UML design metrics for predicting fault-prone classes in
a java system. In Proc. Int’l Working Conf. on Mining
Software Repositories (MSR’10), pages 21 –30, 2010.
N. Ohlsson and H. Alberg. Predicting fault-prone software modules in telephone switches. IEEE Trans. Softw.
Eng., 22(12):886–894, 1996.
Open Source Initiative. Welcome to the open source
initiative. http://opensource.org/.
[67] T. J. Ostrand, E. J. Weyuker, and R. M. Bell. Predicting
the location and number of faults in large software
systems. IEEE Trans. Softw. Eng., 31(4):340–355, 2005.
[68] Peter O’Hearn.
Moving fast with software verification. http://www0.cs.ucl.ac.uk/staff/p.ohearn/Talks/
Peter-CAV.key.
[69] F. Peters, T. Menzies, L. Gong, and H. Zhang. Balancing
privacy and utility in cross-company defect prediction.
IEEE Trans. Softw. Eng., 39(8), 2013.
[70] F. Peters, T. Menzies, and L. Layman. LACE2: better
privacy-preserving data sharing for cross project defect
prediction. In Proc. Int’l Conf. on Software Engineering
(ICSE’15), pages 801–811, 2015.
[71] M. Pighin and R. Zamolo. A predictive metric based on
discriminant statistical analysis. In Proc. Int’l Conf. on
Softw. Eng. (ICSE’97), pages 262–270, 1997.
[72] F. Rahman, D. Posnett, and P. Devanbu. Recalling
the ”imprecision” of cross-project defect prediction. In
Proc. Int’l Symposium on the Foundations of Softw. Eng.
(FSE’12), pages 61:1–61:11, 2012.
[73] F. Rahman, D. Posnett, A. Hindle, E. Barr, and P. Devanbu. BugCache for inspections: Hit or miss? In
Proc. European Softw. Eng. Conf. and Symposium on the
Foundations of Softw. Eng. (ESEC/FSE’11), pages 322–
331, 2011.
[74] RELENG2015. 3rd International Workshop on Release Engineering 2015.
http://releng.polymtl.ca/
RELENG2015/html/index.html.
[75] C. Rosen, B. Grawi, and E. Shihab. Commit Guru:
Analytics and risk prediction of software commits. In
Proc. European Softw. Eng. Conf. and Symposium on the
Foundations of Softw. Eng. (ESEC/FSE’15), pages 966–
969, 2015.
[76] A. Schröter, T. Zimmermann, and A. Zeller. Predicting
component failures at design time. In Proc. Int’l Symposium on Empirical Softw. Eng. (ISESE’06), pages 18–27,
2006.
[77] M. Shepperd, Q. Song, Z. Sun, and C. Mair. Data quality:
Some comments on the NASA software defect datasets.
IEEE Trans. Softw. Eng., 39(9):1208–1215, 2013.
[78] E. Shihab. An Exploration of Challenges Limiting Pragmatic Software Defect Prediction. PhD thesis, Queen’s
University, 2012.
[79] E. Shihab, A. E. Hassan, B. Adams, and Z. M. Jiang. An
industrial study on the risk of software changes. In Proc.
Int’l Symposium on Foundations of Softw. Eng. (FSE’12),
pages 62:1–62:11, 2012.
[80] E. Shihab, A. Ihara, Y. Kamei, W. M. Ibrahim, M. Ohira,
B. Adams, A. E. Hassan, and K. Matsumoto. Studying
re-opened bugs in open source software. Empirical Softw.
Eng., 5(18):1005–1042, 2013.
[81] E. Shihab, Y. Kamei, B. Adams, and A. E. Hassan. Is
lines of code a good measure of effort in effort-aware
models? Inf. Softw. Technol., 55(11):1981–1993, 2013.
[82] E. Shihab, A. Mockus, Y. Kamei, B. Adams, and A. E.
[83]
[84]
[85]
[86]
[87]
[88]
[89]
[90]
[91]
[92]
[93]
[94]
[95]
Hassan. High-impact defects: a study of breakage
and surprise defects. In Proc. European Softw. Eng.
Conf. and Symposium on the Foundations of Softw. Eng.
(ESEC/FSE’11), pages 300–310, 2011.
Y. Shin, R. Bell, T. Ostrand, and E. Weyuker. Does
calling structure information improve the accuracy of
fault prediction? In Proc. Int’l Working Conf. on Mining
Software Repositories (MSR’09), pages 61–70, 2009.
J. Śliwerski, T. Zimmermann, and A. Zeller. When do
changes induce fixes? In Proc. Int’l Workshop on Mining
Software Repositories (MSR’05), pages 1–5, 2005.
R. Subramanyam and M. S. Krishnan. Empirical analysis
of ck metrics for object-oriented design complexity:
Implications for software defects. IEEE Trans. Softw.
Eng., 29(4):297–310, 2003.
B. Turhan, T. Menzies, A. B. Bener, and J. Di Stefano.
On the relative value of cross-company and withincompany data for defect prediction. Empirical Softw.
Eng., 14:540–578, 2009.
B. Ujhazi, R. Ferenc, D. Poshyvanyk, and T. Gyimothy. New conceptual coupling and cohesion metrics
for object-oriented systems. In Proc. Int’l Working Conf.
Source Code Analysis and Manipulation (SCAM’10),
pages 33–42, 2010.
H. Valdivia Garcia and E. Shihab. Characterizing and
predicting blocking bugs in open source projects. In Proc.
Int’l Working Conf. on Mining Software Repositories
(MSR’14), pages 72–81, 2014.
Wikiquote. Sun Tzu. https://en.wikiquote.org/wiki/Sun
Tzu.
W. E. Wong, J. R. Horgan, M. Syring, W. Zage, and
D. Zage. Applying design metrics to predict faultproneness: a case study on a large-scale software system.
Software: Practice and Experience, 30(14), 2000.
F. Zhang, A. Mockus, I. Keivanloo, and Y. Zou. Towards
building a universal defect prediction model. In Proc.
Int’l Working Conf. on Mining Software Repositories
(MSR’14), pages 182–191, 2014.
J. Zheng, L. Williams, N. Nagappan, W. Snipes, J. P.
Hudepohl, and M. A. Vouk. On the value of static
analysis for fault detection in software. IEEE Trans.
Softw. Eng., 32:240–253, 2006.
T. Zimmermann and N. Nagappan. Predicting subsystem
failures using dependency graph complexities. In Proc.
Int’l Symposium on Software Reliability (ISSRE’07),
pages 227 –236, 2007.
T. Zimmermann, N. Nagappan, H. Gall, E. Giger, and
B. Murphy. Cross-project defect prediction: a large
scale experiment on data vs. domain vs. process. In
Proc. European Softw. Eng. Conf. and Symposium on
the Foundations of Softw. Eng. (ESEC/FSE’09), pages
91–100, 2009.
T. Zimmermann, R. Premraj, and A. Zeller. Predicting
defects for Eclipse. In Proc. Int’l Workshop on Predictor
Models in Softw. Eng. (PROMISE’07), pages 1–7, 2007.