Automated Fault Prediction The Ins, The Outs, The Ups, The Downs

Automated Fault Prediction
The Ins, The Outs, The Ups, The Downs
Elaine Weyuker
June 11, 2015
To determine which files of a large
software system with multiple releases are
likely to contain the largest numbers of
bugs in the next release.
Help testers prioritize testing efforts.
Help developers decide when to do
design and code reviews and what to
reimplement.
Help managers allocate resources.
Verified that bugs were non-uniformly distributed
among files.
Identified properties that were likely to affect faultproneness, and then built a statistical model and
ultimately a tool to make predictions.
●
●
●
●
●
Size of file (KLOCs)
Number of changes to the file in the previous 2
releases.
Number of bugs in the file in the last release.
Age of file (Number of releases in the system)
Language the file is written in.
●
●
All of the systems we’ve studied to date use a
configuration management system which
integrates version control and change
management functionality, including bug history.
Data is automatically extracted from the
associated data repository and passed to the
prediction engine.
Used Negative Binomial Regression
Also considered machine learning algorithms
including:
◦ Recursive Partitioning
◦ Random Forests
◦ BART (Bayesian Additive Regression Trees)
●
●
●
Consists of two parts.
The back end extracts data needed to make the
predictions.
The front end makes the predictions and displays
them.
Extracts necessary data from the repository.
Predicts how many bugs will be in each file in
the next release of the system.
Sorts to files in decreasing order of the number
of predicted bugs.
Displays results to user.
Percentage of actual bugs that occurred in
the N% of the files predicted to have the
largest number of bugs. (N=20)
Considered other measures less sensitive
to the specific value of N.
System
Years
Followed
Releases
LOC
% Faults
Top 20%
NP
4
17
538K
83%
WN
2
9
438K
83%
VT
2.25
9
329K
75%
TS
9+
35
442K
81%
TW
9+
35
384K
93%
TE
7
27
327K
76%
IC
4
18
1520K
91%
AR
4
18
281K
87%
IN
4
18
2116K
93%
The Tool
Release to be
predicted
User-supplied
parameters
Statistical
Analysis
Version Mgmt /Fault
Database
(previous releases)
Prediction
Engine
Fault-proneness
predictions
User enters system
name.
Available releases are
found in the version
mgmt database. User
chooses the releases
to analyze.
User selects 4 file
types.
User specifies
that all problems
reported in
System Test
phase are faults.
User asks for fault
predictions for release
“Bluestone2008.1”
User confirms
configuration
User enters filename
to save the
configuration.
User clicks Save &
Run button, to start the
prediction process.
Initial prediction view for
Bluestone2008.1
All files are listed in decreasing
order of predicted faults
Listing is restricted to eC files
Listing is restricted to 10% of eC files
Prediction tool is fully-operational
◦ 750 lines Python for interface
◦ 2150 lines C, 75K bytes compiled for prediction engine
Current version’s backend (written in C) is specific for the
internal AT&T configuration management system but can
be adapted to other configuration management systems.
All that is needed is a source of the data required by the
prediction model.
Variations of the Fault
Prediction Model
Developers
◦ Counts
◦ Individuals
Amount of Code Change
Calling Structure
Overview
1.
2.
3.
4.
5.
Standard model
Developer counts
Individual developers
Line-level change metrics
Calling structure
The Standard Model
Underlying statistical model
◦ Negative binomial regression
Output (dependent) variable
◦ Predicted fault count in each file of release n
Predictor (independent) variables
◦
◦
◦
◦
◦
KLOC (n)
Previous faults (n-1)
Previous changes (n-1, n-2)
File age (number of releases)
File type (C,C++,java,sql,make,sh,perl,...)
Developer counts
How many different people have worked on the
file in the most recent previous release?
How many different people have worked on the
file in all previous releases? This is a
cumulative count.
How many people who changed the file were
working on it for the first time?
Faults per file in releases of System BTS
Standard Model
Developers Changing File in Previous Release
New Developers Changing File in Previous Release
Total Developers Changing File in All Previous Releases
Total developers touching file in all previous releases
Summary
None of the developer count attributes
uniformly increases prediction
accuracy. In all cases, adding a
developer count attribute to the
standard model sometimes leads to
less accurate predictions than the
standard model alone. The benefit is
never major.
Code Change
The standard model includes a count of the number of
changes made in the previous two releases. It does not
take into account how much code was changed.
We will now look at the impact on predictive accuracy of
adding to the model fine-grained information about
change size.
Measures of Code Change
Number of changes made to a file during a
previous release
Number of lines added
Number of lines deleted
Number of lines modified
Relative size of change (line changes/LOC)
Changed/not changed
Two Subject Systems
18 releases, 5 year lifespan
IC: Large provisioning system
6 languages: Java (60%), C, C++, SQL, SQL-C, SQL-C++
3000+ files
1.5Mil LOC
Average of 395 faults/release
AR: Utility, data aggregation system
>10 languages: Java (77%), Perl, xml, sh, ...
800 files
280K LOC
Average of 90 faults/release
Distribution of files,
averages over all releases.
System IC Faults per File, by Release
System AR Faults per File, by Release
Prediction Models with Line-level
Change Counts
Univariate models
Base model: log(KLOC), File age, File type
Augmented models:
◦
◦
◦
◦
Previous Changes
Previous {Adds / Deletes / Mods}
Previous {Adds / Deletes / Mods} / LOC (relative churn)
Previous Developers
Fault-percentile averages for univariate predictor
models: System IC
Base Model and Added Variables:
System IC
• Base model: KLOC, File age (number of releases),
File type (C,C++,java,sql,make,sh,perl,...)
Base Model and Added Variables:
System AR
Summary
Change information provides important information for fault
predictions
{Adds+Deletes+Mods} improves the accuracy of a model that
doesn’t include any change information
BUT
a simple count of prior changes slightly outperforms
{Adds+Deletes+Mods}
Prior changed (a simple binary variable) is nearly as good as
either, when added to a model without change info
Lines added is the most effective single change predictor
Lines deleted is least effective single change predictor
Relative changes is no better than absolute changes for predicting
total fault count
Individual Developers
How can we measure the effect that a single
developer has on the faultiness of a file?
If developer d modifies k files in release N
how many of those files have bugs in release
N+1?
how many bugs are in those files in release
N+1?
The BuggyFile Ratio
If d modifies k files in release N, and if b of them
have bugs in release N+1, the buggyfile ratio for
d is b/k
System IC has 107 programmers.
Over 15 releases, their buggyfile ratios vary
between 0 and 1
The average is about 0.4
Average buggyfile ratio, all programmers
Buggyfile ratio for two
programmers
Buggyfile ratio
more typical cases
The Bug Ratio
If d modifies k files in release N, and if there are B
bugs in those files in release N+1, the bug ratio
for d is B/k
The bug ratio can vary between 0 and B
Over 15 releases, we’ve seen a maximum bug
ratio of about 8
The average is about 1.5
Bug Ratio
Buggyfile Ratio
Problems with these definitions
A file can be changed by more than one developer.
A file may be changed in Rel N and a fault detected in
N+1, but that change may not have caused that fault.
A programmer might change many files in the identical
trivial ways (interface, variable name, ...)
The “best” programmers might be assigned to work on
the most difficult files.
For most programmers, the bug ratios vary widely from
release to release.
Some final thoughts
• Is individual programmer bug-proneness helpful
for prediction?
• Is this information useful for helping a project
succeed?
• Are there better ways to measure it?
• Is it ethical to measure it?
• Does attempting to measure it lead to poor
performance and unhappy programmers?
Calling Structure
Are files that have high rate of interaction with other
files more fault-prone?
Callees of File Q
Callers of File Q
File X
File A
File Q
Method 1
Method 2
File Y
File B
File Z
Calling Structure Attributes Investigated
For each file:
number of callers & callees
number of new callers & callees
number of prior new callers & callees
number of prior changed callers & callees
number of prior faulty callers & callees
ratio of internal calls to total calls
Fault Prediction by Multi-variable
Models
Code and history attributes, no calling structure
Code and history attributes, including calling
structure
Code attributes only, including calling structure
Fault Prediction by Multi-variable
Models
Models applied to C, C++, and C-SQL files of one
of the systems studied.
First model built from the single best attribute.
Each succeeding model built by adding the
attribute that most improves the prediction.
Stop when no attribute improves.
Code and history attributes, no calling
structure
Code, history, and calling structure attributes
Code and calling structure attributes but not
numbers of faults or changes in previous
releases.
Summary
Calling structure attributes do not increase the
accuracy of predictions.
History attributes (prior changes, prior faults)
increase accuracy, either with or without calling
structure.
We only studied these issues for two of the
systems.
Overall Summary
The Standard Model performs very well (on all
nine industrial systems we have examined)
The augmented models add very little or no
additional accuracy
Cumulative developers is the most effective
addition to the Standard Model, but still doesn’t
guarantee improved prediction or yield significant
improvement.
What’s Ahead?
◦ Will our standard model make accurate predictions for opensource systems?
◦ Will our standard model make accurate predictions agile systems?
◦ Can we predict which files will contain the faults with the highest
severities?
◦ Can predictions be made for units smaller than files?
◦ Can run-time attributes be used to make fault predictions?
(execution time, execution frequency, memory use, …)
◦ What is the most meaningful way to assess the effectiveness and
accuracy of the predictions?

Download Report

Automated Fault Prediction The Ins, The Outs, The Ups, The Downs

Paperzz.com

Your Paperzz