FIESTA: Effective Fault Localization to Mitigate the Negative Effect of

FIESTA: Effective Fault Localization to Mitigate the
Negative Effect of Coincidentally Correct Tests
Seokhyeon Mun, Yunho Kim, Moonzoo Kim∗
CS Dept. KAIST, South Korea
Abstract
One of the obstacles for precise coverage-based fault localization (CFL) is the
existence of Coincidentally Correct Test cases (CCTs), which are the test cases
that pass despite executing a faulty statement. To mitigate the negative effects
of CCTs, we have proposed Fault-weIght and atomizEd condition baSed localizaTion Algorithm (FIESTA) that transforms a target program with compound
conditional statements to a semantically equivalent one with atomic conditions
to reduce the number of CCTs. In addition, FIESTA utilizes a fault weight on
a test case t that indicates “likelihood” of t to execute a faulty statement. We
have evaluated the effectiveness of the transformation technique and the fault
weight metric through a series of experiments on 12 programs including five nontrivial real-world programs. Through the experiments, we have demonstrated
that FIESTA is more precise than Tarantula, Ochiai, and Op2 for the target
programs on average. Furthermore, the transformation technique improves the
precision of CFL techniques in general (i.e., increasing the precision of all four
CFL techniques–Tarantula, Ochiai, Op2, and FIESTA).
Keywords: Coverage-based fault localization, coincidentally correct test cases,
debuggability transformation, fault weights of test cases
1. Introduction
Although many automated verification and validation (V&V) techniques on
software have been proposed, developers still spend a large amount of time to
identify the root causes of program failures. Fault-localization is an expensive
step in the whole debugging process [1, 2], because it usually requires human
developers to understand the complex internal logic of the target program and
reason about the differences between passing test runs and failing test runs
step-by-step. Thus, there are large demands for automated fault localization
techniques.
∗ Corresponding
author
Email addresses: [email protected] (Seokhyeon Mun), [email protected]
(Yunho Kim), [email protected] (Moonzoo Kim)
Preprint submitted to Elsevier
September 11, 2013
Coverage-based fault localization (CFL) is a fault localization technique that
locates faults by assigning suspiciousness scores to program entities (i.e., statements or branches) based on the code coverage information. A key idea of CFL
is to utilize the correlation between failing test runs and faulty statements to
identify the location of faults [3, 4, 5]. However, in practice, this correlation may
not be strong enough to localize faults precisely [6]. One of the main causes that
weaken such correlation is coincidentally correct test cases (CCTs), which are
passing test cases that execute faulty statements. This is because faulty statements executed by CCTs are often considered as non-faulty statements, since
they are executed by passing test cases [7, 3, 8, 9].
In this paper, we propose Fault-weIght and atomizEd condition baSed localizaTion Algorithm (FIESTA) to mitigate this negative effect of CCTs in the
following ways:
1. To transform a target program to reduce a number of CCTs (debuggability
transformation).
2. To develop a new suspiciousness metric that alleviates the negative effect
of CCTs approximately.
First, FIESTA transforms a target program P into a semantically equivalent program P 0 by converting a compound conditional statement into nested
conditional statements with atomic conditions (see Section 2). Through this debuggability transformation, the precision of CFL techniques (not only FIESTA)
can increase due to the decreased number of CCTs (see Section 5.1). Note that
this debuggability transformation technique can be applied to any CFL techniques such as Tarantula [3], Ochiai [4], and Op2 [5] to improve the precision of
the CFL techniques further (Section 5.4). We can consider this transformation
as a parallel to ‘testability’ transformation [10].
Second, FIESTA defines a new suspiciousness metric that reduces the negative effect of CCTs by considering fault weights on test cases. We define a fault
weight on a test case t to indicate “likelihood” of t to execute a faulty statement
(see Section 3) and utilize the fault weight values of the test cases that cover a
target statement s to calculate a suspiciousness score of s.
In this paper, we set up the following research questions regarding the effectiveness of the debuggability transformation and the fault weight metric as well
as the precision of FIESTA:
• RQ1: How effective is the transformation of a target program, in terms
of a reduced number of CCTs?
• RQ2: How accurate is the fault weight metric on test cases, in terms of
the probability of a passing test case with a high fault weight to be a
CCT?
• RQ3: How precise is FIESTA, specifically compared with Tarantula,
Ochiai, and Op2, in terms of the % of executed statements examined
to localize a fault?
2
• RQ4: How effective is the transformation of a target program, in terms
of the improved precision of Tarantula, Ochiai, Op2, and FIESTA?
To answer these research questions, we have performed a series of experiments by applying Tarantula, Ochiai, Op2, and FIESTA to 12 programs including the seven Siemens programs and five non-trivial real-world programs (see
Table 1). The experimental results confirm that the transformation technique
reduces the number of CCTs effectively and the fault weight metric is accurate
enough to distinguish CCTs from truly correct test cases (TCTs), which are
passing test cases that do not execute a faulty statement. Consequently, the experimental results demonstrate that FIESTA is 49%, 22%, and 14% relatively
more precise than Tarantula, Ochiai, and Op2 on average, respectively. Furthermore, we demonstrate that the debuggability transformation technique can
improve the precision of CFL techniques in general (i.e., increasing the precision
of all four CFL techniques - Tarantula, Ochiai, Op2, and FIESTA).
The contributions of this paper are as follows:
• We propose debuggability transformation (Section 2) to improve the precision of CFL techniques in general by reducing the number of CCTs
(Section 5.1 and Section 5.4)
• We define a fault weight metric on test cases (Section 3), which can improve
the precision of fault localization (Section 5.2).
• We demonstrate that FIESTA, which has been developed based on the
above two techniques, is more precise than Tarantula, Ochiai, and Op2
through the empirical study on the 12 programs including five real-world
programs (Section 5.3).
The remainder of this paper is organized as follows. Section 2 describes
the debuggability transformation to reduce the number of CCTs. Section 3
explains the new suspiciousness metric of FIESTA that utilizes fault weights of
test cases. Section 4 shows the experiment setup for the empirical evaluation of
the techniques on the 12 target programs. Section 5 explains and analyzes the
experiment results regarding the research questions. Section 6 discusses several
limitations of CFL techniques observed through the experiments. Section 7
describes related work and compares the proposed techniques with the related
work. Section 8 concludes this paper with future work.
2. Debuggability Transformation of a Target Program to Reduce CCTs
2.1. Overview
FIESTA transforms a target program P into a semantically equivalent program P 0 by converting a compound conditional statement into nested conditional statements with atomic boolean conditions (i.e., conditions without
3
// Original program P
1:if(x>0&&y==0&&z<0){//Fault:z<0 should be z==0
2:
stmt;
3:}
// Transformed program P’
11:if (x > 0) {
12:
if (y == 0) {
13:
if (z < 0) {//Fault:z<0 should be z==0
14:
stmt;
15:}}}
Figure 1: Example of P and P 0 that has been transformed from P by FIESTA
boolean operators) 1 . A motivation for this transformation is based on the observation that a target program often contains a compound conditional statement
scomp whose boolean condition (i.e., a predicate) consists of multiple clauses
c1 , c2 , ...cn combined with logical operators. If one such clause cm is faulty,
scomp becomes faulty and a passing test case tcp that executes scomp becomes
a CCT, although tcp may not execute cm due to short-circuit evaluation. Consequently, tcp decreases a suspiciousness score of scomp , since tcp covers scomp
but passes. We can alleviate this problem by decomposing a compound conditional statement scomp into semantically equivalent nested atomic conditional
statements.
For example, FIESTA transforms a compound conditional statement at line
1 of P in Figure 1 into the semantically equivalent nested conditional statements
in P 0 (lines 11–13). Suppose that the last atomic clause of line 1 in P is a fault
(e.g., it should be z==0, not z<0) and a test case tc1 (x=0,y=0,z=0) that executes
line 1 is a CCT for P . Note that tc1 is not a CCT for P 0 , since tc1 does not
execute the corresponding faulty statement (line 13) of P 0 (see Section 2.2.1).
CCTs that execute line 1 of P are not CCTs for P 0 unless they satisfy x>0
&& y==0. Thus, through the transformation of a target program P to P 0 , such
CCTs become truly correct test cases (TCTs) (i.e., passing test cases that do not
execute a faulty statement) and the precision of coverage-based fault localization
can be improved on P 0 due to the decreased number of CCTs 2 .
Once the suspiciousness scores on the statements of P 0 are obtained, these
scores are mapped to the statements of P (see Section 2.2.2). Finally, the user
reviews the statements of P that have high suspiciousness scores to identify a
also transforms exp?v1:v2 in P to if(exp) tmp=v1; else tmp=v2; in P 0 .
might expect that a loop unrolling transformation or a function inlining transformation can decrease the number of CCTs as well. However, our preliminary experimental
results show that those transformations decrease the precision due to increased number of
faults through unrolling and inlining (i.e., transforming a single fault program to a multiple
faults program).
1 FIESTA
2 One
4
faulty statement as usual. Note that this debuggability transformation technique can be applied to any CFL techniques including FIESTA to improve the
precision of the CFL techniques further.
2.2. Mapping of Statements between P and P’
FIESTA utilizes CIL (C Intermediate Language) [11] to transform a target
C program P to P 0 and CIL generates basic mapping information between
statements of P and P 0 . Based on this mapping information, FIESTA maps
a faulty statement in P to statement(s) in P 0 (Section 2.2.1). In addition,
FIESTA maps the suspiciousness scores on the statements in P 0 to the original
statements in P (Section 2.2.2).
2.2.1. Mapping Faulty Statements of P to P’
To measure how many CCTs are reduced through the transformation, we
map a faulty compound conditional statement of P to atomic conditional statements of P 0 . There are four cases to consider for the mapping as follows:
1. When a clause of a compound condition is faulty (e.g., z<0 is faulty in
if(x>0 && y==0 && z<0), since it should be z==0):
Statement(s) of P 0 that correspond to a faulty clause (i.e., if(z<0) in P 0 )
are marked as faulty statements.
2. When a boolean operator is faulty (e.g., if(x>0 && y==0) {s1;} should
be if(x>0 || y==0) {s1;}):
Statement(s) of P 0 that correspond to the all operands of the faulty operator (e.g., if(x>0) and if(y==0) in P 0 ) are marked as faulty statements.
This is because each of those statements that correspond to the operands
can trigger errors.
For example, with tc2 (x=1,y=1), the correct program (i.e., if(x>0 ||
y==0) {s1;}) will execute s1, but the faulty transformed program (i.e.,
if(x>0) { if(y==0) {s1;}}) will not execute s1 due to if(y==0). Similarly, with tc3 (x=0, y=0), the correct program will execute s1, but the
faulty transformed program will not execute s1 due to if(x>0).
3. A new clause is added (e.g., w!=0 is conjunctively added):
Statement(s) of P 0 that correspond to the newly added clause (e.g., if(w!=0)
in P 0 ) are marked as faulty statements.
4. A clause is omitted (e.g., if(x<0 /*&& y<0 omitted*/) {s1;}):
Statement(s) of P 0 that correspond to the statements that are control
dependent on the omitted clause in P (i.e., y<0 in the example) are marked
as faulty (i.e., s1 in the example), since such statements can trigger errors.
2.2.2. Mapping Suspiciousness Scores on the Statements of P’ to P
Although we obtain suspiciousness scores on the statements of P 0 , we have
to examine statements of P in the descending order of their suspiciousness
scores. This is because we would like to identify a fault in the original program
eventually. In addition, P 0 can be larger than P through the transformation,
which may cause a user to examine more statements even when the % of the
5
executed statements to examine to find a faulty statement for P 0 is smaller than
the % of the executed statements to examine for P . Thus, FIESTA maps the
suspiciousness scores on the statements of P 0 to the statements of P .
Suppose that a compound conditional statement scomp of P is transformed
into nested conditional statements with atomic conditions s01 , s02 , ...s0n . Then, the
suspiciousness score of scomp is defined as the score of s0m (1 ≤ m ≤ n) that has
the highest suspiciousness score among the scores of all s01 , s02 , ...s0n . In addition,
FIESTA reports s0m to a user to help the user examine suspicious statements to
localize a fault. The suspiciousness scores of the statements of P 0 that are not
expanded from a compound conditional statement in P are directly mapped to
their original statements in P .
3. Fault Weight Based Localization Metric
3.1. Background
Coverage based fault localization techniques calculate suspiciousness scores
of coverage entities such as statements and branches by analyzing difference between passing test runs and failing test runs. High suspiciousness of a statement
indicates that the statement is likely to be a faulty one. For example, Tarantula [3] calculates a suspiciousness score of a statement s by using the following
metric:
|f ailing(s)|
|Tf |
suspT aran (s) =
|f ailing(s)| |passing(s)|
+
|Tf |
|Tp |
where Tf is a set of failing test cases and f ailing(s) is a set of failing test
cases that execute s (Tp and passing(s) are defined similarly). Similarly, the
suspiciousness metrics of Ochiai [4] and Op2 [5] are as follows:
|f ailing(s)|
suspOchiai (s) = p
|Tf | ∗ (|f ailing(s)| + |passing(s)|)
suspOp2 (s) = |f ailing(s)| −
|passing(s)|
|Tp | + 1
Figure 2 shows an example of how the above metric localizes a fault. Let us
assume that we have six test cases (tc1 to tc6) and line 3 is the faulty statement,
since it should be x=x+2. The executed statements by each test case are marked
with black dot in the figure. In the figure, tc1 is the only failing test case, since
example() with tc1 will return 3, while it will return 1 if there is no fault (i.e.,
if line 3 is x=x+2). All other test cases tc2 to tc6 are passing test cases, since
example() with these test cases will return a correct value (i.e., 1 with tc2 to
tc4 and 2 with tc5 and tc6) in spite of the fault. Thus, tc2, tc3, and tc4 are
coincidentally correct test cases (CCTs), since they are passing test cases that
execute line 3, the faulty statement.
6
Figure 2: Example of coverage-based fault localization
With the suspiciousness metric of Tarantula, line 7 (ret=ret+2) has the
highest suspiciousness score (i.e.,
the faulty line (line 3) has
1
1
1
1
1
1
+
2
5
= 0.71) and marked as rank 1, while
= 0.63 as a suspiciousness score and marked as
+ 53
rank 3. This is because there are more passing test cases that execute line 3 (i.e.,
tc2, tc3, and tc4) than line 7 (i.e., tc5 and tc6). For Tarantula, Ochiai, and Op2,
CCTs such as tc2 to tc4 make the suspiciousness of the faulty statement low,
since they increase |passing(s)| in the denominator of the suspiciousness metrics
(Tarantula and Ochiai) or |passing(s)| in the negative term of the suspiciousness
metric (Op2). Thus, a fault localization technique like Tarantula, Ochiai, and
Op2 becomes imprecise as there are more CCTs. Our new fault localization
technique (i.e., FIESTA) alleviates this problem by considering fault weights on
test cases.
1
1
3.2. Fault Weights on Test Cases
We define a fault weight wf on a test case t, which indicates “likelihood” of
t to execute a faulty statement. For a failing test case tf , wf (tf ) is defined as
1. For a passing test case t, fault weight wf (t) is defined as follows:
Σtf ∈Tf
wf (t) =
|stmt(t) ∩ stmt(tf )|
|stmt(tf )|
|Tf |
where stmt(t) is a set of statements executed by a test case t and Tf is a set of
failing test cases. wf (t) represents a degree of overlap between the statements
executed by t (i.e., stmt(t)) and the statements executed by a failing test case
7
tf ∈ Tf (i.e., stmt(tf )) on average over Tf . Thus, high wf (t) score indicates
that t is likely to execute a faulty statement. 3
For the example of Figure 2 where Tf = {tc1}, we can calculate the fault
weights of the six test cases as follows:
• wf (tc1) = 1, since tc1 is a failing test case.
• wf (tc2) =
|stmt(tc2)∩stmt(tc1)|
|stmt(tc1)|
1
=
6
7
1
= 0.86
• wf (tc3) = wf (tc4) = wf (tc2) = 0.86, since stmt(tc3) = stmt(tc4) =
stmt(tc2).
• wf (tc5) =
|stmt(tc5)∩stmt(tc1)|
|stmt(tc1)|
1
=
5
7
1
= 0.71
• wf (tc6) = wf (tc5) = 0.71, since stmt(tc6) = stmt(tc5)
The fault weights (i.e., 0.86) of CCTs (i.e., tc2, tc3, and tc4) are higher than
the fault weights (i.e., 0.71) of TCTs (i.e., tc5, tc6), which are passing test cases
that do not execute a faulty statement. If wf (tp ) is high for a passing test case
tp , tp is likely a CCT (see Figure 3) and we can make a precise suspiciousness
metric that gives high suspiciousness score to the statements executed by such
tp .
3.3. Suspiciousness Metric of FIESTA
We define a new suspiciousness metric suspF on a statement s using the
fault weights on test cases as follows:
suspF (s) = (
|f ailing(s)| Σt∈test(s) wf (t)
+
)/2
|Tf |
|test(s)|
where test(s) is a set of test cases that execute s (i.e., test(s) = passing(s) ∪
f ailing(s)).
The left term |f ailing(s)|
in the metric increases the suspiciousness score, if s
|Tf |
Σ
w (t)
f
is executed by more failing test cases. 4 The right term t∈test(s)
increases
|test(s)|
a suspiciousness score, if s is executed by test cases with higher fault weights.
3 Fault weight can be considered as a ‘distance’ between a given test case and all failing test
cases on average. There are other CFL techniques which utilizes different ‘distance’ metrics
for similar purpose (see Section 7.2).
4 The left term may make FIESTA localize a fault less precisely on a target program with
multiple faults spanning on multiple lines than one with a single line fault, which occurs commonly to most CFL techniques. However, Digiuseppe et al. [12] showed that CFL techniques
are still effective to target programs with multiple faults (i.e., in the situation where different
faults interfere with each other’s localizability) as the number of the faults increases. In addition, there are several techniques (i.e., Liu et al. [13], Zheng et al. [14], and Jones et al. [15])
to cluster test cases targeting an individual fault to reduce interference among multiple faults.
8
Finally, to normalize a metric value within a range 0 to 1, the sum of the left
term and the right term is divided by 2.
For example, with the example of Figure 2, the new fault localization metric
calculates the suspiciousness of line 3 as 0.95 and marks line 3 (and line 4
together) as the first rank as follows:
1 wf (tc1) + wf (tc2) + wf (tc3) + wf (tc4)
)/2 = 0.95
( +
1
4
Note that the suspiciousness metric of FIESTA is more precise than those of
Tarantula, Ochiai, and Op2 for this example, because FIESTA utilizes the fault
weights to increase suspiciousness scores of the statements executed by CCTs.
4. Empirical Study Setup
We designed the following four research questions to study effectiveness of
the transformation technique and the fault weight metric as well as the precision
of FIESTA:
RQ1: How effective is the transformation of a target program, in terms of a
reduced number of CCTs?
We evaluate the effectiveness of the debuggability transformation by measuring
both the number of CCTs for a target program P and the number of CCTs for
the transformed target program P 0 .
RQ2: How accurate is the fault weight metric of FIESTA on test cases, in terms
of the probability of a passing test case with a high fault weight to be a CCT?
We have to check that the fault weight metric is accurate enough to improve
the precision of fault localization. We evaluate the accuracy of the fault weight
metric by calculating the probability of a passing test case whose fault weight
is larger than 0.9 to be a CCT over all versions of the target programs.
RQ3: How precise is FIESTA, specifically compared to Tarantula, Ochiai, and
Op2 in terms of the % of executed statements examined to localize a fault?
We have to show how precise FIESTA is by measuring the % of executed statements examined to localize a fault on the target benchmark programs on average. The precision of fault localization depends on multiple factors including a
target program, test cases, and faults in a target program. Thus, we have to
perform experiments with various settings of such factors carefully.
RQ4: How effective is the transformation of a target program, in terms of
improved precision of Tarantula, Ochiai, Op2, and FIESTA?
The ultimate goal of the debuggability transformation is to improve the precision of fault localization techniques by reducing the number of CCTs. Thus,
we should measure how much the debuggability transformation improves the
precision of the fault localization techniques (i.e., Tarantula, Ochiai, Op2, and
9
Target programs
LOC
print tokens
print tokens2
replace
schedule
schedule2
tcas
tot info
flex
grep
gzip
sed
space
Average
563
510
563
412
307
173
406
12423
12653
6576
11990
9564
4678.33
# of
faulty
ver. used
7
9
30
5
9
40
23
16
4
5
3
28
14.92
# of
provided
test cases
4130
4115
5542
2650
2710
1608
1052
567
809
214
360
13585
3111.83
Description
lexical analyzer
lexical analyzer
pattern replacement
priority scheduler
priority scheduler
altitude separation
information measure
lexical analyzer generator
pattern matcher
file compressor
stream editor
ADL interpreter
Table 1: Experiment objects
FIESTA) by comparing the precision of these techniques with or without the
debuggability transformation.
To answer these research questions, we have performed a series of experiments by applying Tarantula, Ochiai, Op2, and FIESTA to the 12 target programs. We will describe the detail of the experiment setup in the following
subsections.
4.1. Objects of Analysis
As objects of the empirical study, we used the seven Siemens programs (419.1
LOC on average) and five non-trivial real-world programs (flex 2.4.7, grep
2.2, and gzip 1.1.2, sed 1.18, and space) in the SIR benchmark suite [16]
(10641.2 LOC on average). Among the faulty versions of those programs, we
excluded faulty versions for which no test case or only one test case fails. Table 1
describes the target programs and the number of their faulty versions on which
we conducted fault localization experiments. The target programs have 154
faulty versions each of which has a single fault (only one line is different from
the non-faulty version) and 25 faulty versions each of which has multiple faults
(more than one line are different from the non-faulty version). In addition, we
used all test cases in the SIR benchmark except the ones that crash a target
program since we used gcov to measure coverage information and gcov cannot
measure coverage information if a target program crashes.
4.2. Variables and Measures
To address our research questions, our experiments manipulated two independent variables:
10
IV1: Debuggability transformation
We examined the reduced number of CCTs and the increased precision of
the four fault localization techniques through the debuggability transformation (regarding RQ1 and RQ4).
IV2: Fault localization technique
We examined four fault localization techniques, Tarantula, Ochiai, Op2,
and FIESTA, each of whose precision is evaluated (regarding RQ3 and
RQ4).
We have the following dependent variables that are measured through the
experiments to answer the research questions:
DV1: Numbers of CCTs for the original target program P and for the
transformed target program P 0
To measure how much the debuggability transformation technique reduces
the number of CCTs, we measured the number of CCTs for P and P 0
(regarding RQ1). If the transformation technique is effective, the number
of CCTs will significantly decrease.
DV2: Numbers of CCTs and TCTs in the given range of fault weight
values
To measure the accuracy of the fault weight metric, we equally divided the
range of a fault weight value (i.e.,[0, 1]) into 10 partitions and measured the
number of CCTs and TCTs whose weights are included in the partitions
of the fault weights (regarding RQ2). If the fault weight metric is effective,
test cases with higher test weights will be more likely to be CCTs.
DV3: The percentage of executed statements examined until a fault is
identified
We measured the % of executed statements examined to localize a fault
as a measurement of the precision of Tarantula, Ochiai, Op2, and FIESTA (regarding RQ3 and RQ4). The precision of the fault localization
technique will inversely correlate with the percentage.
4.3. Experiment Setup
We implemented Tarantula, Ochiai, Op2, and FIESTA in 2600 lines of C++
code with CIL 1.3.7. The target programs are compiled and instrumented in
statement-level by using gcc. After each execution of the instrumented object
program finishes, we use gcov to get the statement coverage of the program
execution on a given test case. Our tools parse the statement coverage reported
from gcov for each test case, and construct a table that shows which statements are covered by which test cases (see Figure 2). In our experiments, we
consider only executed statements that can be tracked by gcov. Each executed
statement is given a suspiciousness score according to the suspiciousness metrics
of Tarantula, Ochiai, Op2, and FIESTA and ranked in a descending order of
the suspiciousness scores. Statements with the same suspiciousness score are
assigned the lowest rank the statements can have, since we assume that the
11
developer must examine all of the statements with the same suspiciousness to
find faulty statements. All experiments were performed on 5 machines that are
equipped with Intel i5 3.6 Ghz CPUs and 8 Gigabyte of memory and run Debian
6.05.
4.4. Threats to Validity
The primary threat to external validity for our study involves the representativeness of our object programs since we have examined only 12 C programs.
However, since the target programs include various faulty versions of five real
world target programs with test suites to support controlled experimentation
and the target programs are widely used for fault localization research, we believe that this threat to external validity is limited.
The second threat involves the representativeness of the coverage-based fault
localization (CFL) techniques that we applied (i.e., Tarantula, Ochiai, Op2, and
FIESTA). However, since Tarantula and Ochiai are considered as representative
CFL techniques in literature and Op2 is known for its high precision [17], we
believe that this threat is also limited.
The third possible threat is the assumption that we already have many
test cases when applying CFL techniques to a target program. However, since
we can generate many test cases using automated test generation techniques
(i.e., random testing, concolic testing, or test generation strategies for fault
localization [18, 19, 20]), this threat is limited too.
A primary threat to internal validity is possible faults in the tools that implement Tarantula, Ochiai, Op2, and FIESTA. We controlled this threat through
extensive testing of our tool.
5. Result of The Experiments
We have applied our implementation of Tarantula, Ochiai, Op2, and FIESTA to the 179 versions of the 12 programs (the whole experiments took
around three hours by running 5 machines). From the data obtained from the
series of experiments, we can answer the four research questions in the following
subsections.
5.1. Regarding RQ1: Reduced Number of CCTs through the Debuggability Transformation
Table 2 describes statistics on the reduced number of CCTs for the target
program versions that have faulty compound conditional statements through
the transformation (see Section 2). 5 Among the total 179 versions of the
target programs, only 42 (=3+11+4+20+1+3) versions have faulty compound
conditional statements (see the second column of Table 2). For the 42 versions
of the target programs that have faulty compound conditional statements, the
5 For a target program version which does not contain a faulty compound conditional statement, the number of CCTs does not change for P 0 .
12
# of faulty Avg. # of Avg. # of
ver. with
faulty lines of the
faulty comp. stmt.
expanded
comp. stmt. per version faulty stmt.
ptok2
3
1.00
2.67
replace
11
1.00
5.27
schedule2
4
1.00
4.00
tcas
20
1.20
5.38
tot info
1
1.00
2.00
flex
3
1.00
2.67
Average
7.0
1.03
3.66
Target
programs
Avg. # of CCTs Reduced
Original Transformed number
program
program of CCTs
P
P0
(%)
1947.33
1897.33
2.57
1647.55
1086.09
34.08
2618.75
2166.25
17.28
1049.37
716.68
31.70
844.00
806.00
4.50
330.67
58.67
82.26
1406.28
1121.84
28.73
Table 2: Reduced number of CCTs through the debuggability transformation
number of CCTs decreases 28.73% on average (see the last column of Table 2).
For example, three versions among the 16 faulty versions of flex have 1.00
faulty compound conditional statement each on average (see the first to the
third columns of the seventh row of Table 2). In addition, each such compound
faulty statement is transformed into 2.67 statements on average (see the fourth
column of the seventh row of Table 2) and the number of CCTs decreases from
330.67 for P to 58.67 for P 0 on average (i.e., 82% of CCTs are reduced) (see the
fifth to the seventh columns of the table).
Note that bug studies in literature show that incorrect conditional statements were often the causes of bugs. For example, the causes of 22% of the
bugs in the operating system at IBM [21] and 25% of bugs in the 12 opensource programs [22] resided in the incorrect conditional statements. Therefore,
we can conclude that the debuggability transformation of FIESTA that targets incorrect compound conditional statements reduces the number of CCTs
effectively.
5.2. Regarding RQ2: Accuracy of the Fault Weight Metric on Test Cases
Table 3 describes statistics on the test cases used to localize a fault in the
target programs. The number of CCTs (i.e., 1111.5) is similar to TCTs (i.e.,
1717.3) on average (see the third column and the fifth column of the last row
of the table). 6 Average fault weight of CCTs (i.e., 0.86) is higher than that of
TCTs (i.e., 0.66) (see the fourth column and the sixth column of the last row
of the table).
Figure 3 shows that the ratios of CCTs and TCTs for the 179 versions of
the 12 target programs according to their fault weight values. Figure 3 shows
that passing test cases with high weights are more likely CCTs. For example, a
set of passing test cases whose weights are between 0.9 and 1 (mean value 0.95)
6 Note that the numbers of CCTs may be different for different versions of a target program
P due to the different faulty lines in the different versions of P .
13
Failing test
Passing test cases
Total test
cases
CCT
TCT
cases used
Avg. # of Avg. # of Avg. fault Avg. # of Avg. fault Avg. # of Avg. fault
test cases test cases
weight test cases
weight test cases
weight
ptok
69.1
1960.1
0.75
2100.7
0.69
4130.0
0.72
ptok2
229.3
1187.9
0.83
2697.8
0.76
4115.0
0.80
replace
100.7
2218.8
0.79
3222.5
0.65
5542.0
0.72
schedule
142.8
1431.2
0.96
1069.0
0.64
2643.0
0.83
schedule2
29.0
2518.9
0.90
153.6
0.50
2701.4
0.87
tcas
40.6
852.6
0.90
714.8
0.62
1608.0
0.78
tot info
83.7
680.1
0.85
288.1
0.50
1052.0
0.77
flex
255.7
100.1
0.83
183.2
0.73
538.9
0.88
grep
241.5
36.0
0.87
486.8
0.67
764.3
0.78
gzip
47.6
37.4
0.96
128.0
0.81
213.0
0.91
sed
48.7
13.0
0.86
234.0
0.72
295.7
0.77
space
1954.2
2302.1
0.83
9328.6
0.67
13585.0
0.74
Average
270.3
1111.5
0.86
1717.3
0.66
3099.0
0.80
Target
programs
Table 3: Fault weights of test cases for the target programs
Figure 3: Ratios of CCTs and TCTs for different fault weights
consists of 86.4% CCTs and 13.6% TCTs (see the small circle at the top right
part of Figure 3). In other words, the probability of a passing test case with
weight larger than 0.9 to be a CCT is 86.4%. In contrast, the probability of a
passing test case with a fault weight between 0.3 and 0.4 (mean value 0.35) to
be a CCT is only 7.1% (see the small circle at the bottom of Figure 3).
14
Target
programs
print tokens
print tokens2
replace
schedule
schedule2
tcas
tot info
flex
grep
gzip
sed
space
Average
% of executed stmts examined
Tarantula Ochiai
Op2 FIESTA
27.45
19.02 10.41
11.20
12.88
7.47
1.62
2.31
9.19
6.98
5.44
7.74
5.95
3.51
17.17
10.45
52.85
46.83
40.48
28.55
28.31
28.62
27.14
21.17
27.30
21.31 13.17
16.11
29.55 12.51
13.54
12.84
21.71
1.68
1.23
1.23
17.15
7.91
7.91
7.49
4.94
1.32
1.29
1.29
5.83
2.80
4.90
3.70
20.26
13.33
12.03
10.34
Table 4: % of executed statements examined to localize a fault in the target programs (i.e.,
precision)
Therefore, we can conclude that the proposed fault weight metric is accurate
enough to distinguish CCTs and TCTs approximately. Consequently, FIESTA
that utilizes the fault weight metric can improve the precision of fault localization by reducing the negative effect of CCTs.
5.3. Regarding RQ3: Precision of FIESTA
Table 4 enumerates the precisions of fault localization on the target programs
by using Tarantula, Ochiai, Op2, and FIESTA. We mark the most precise results
in bold font. For the programs, FIESTA localizes a fault after reviewing 10.34%
of executed statements on average (the last column of the last row of the table).
The comparison results with Tarantula, Ochiai, and Op2 are as follows.
), 22% (= 13.33−10.34
), and 14% (= 12.03−10.34
) relFIESTA is 49% (= 20.26−10.34
20.26
13.33
12.03
atively more precise than Tarantula, Ochiai, and Op2, on average, respectively
(see the last row of Table 4). In addition, FIESTA localizes a fault more precisely
in all target programs except schedule than Tarantula. With the exception of
replace, schedule, flex, and space, FIESTA localizes a fault more precisely
than Ochiai. Compared to Op2, FIESTA localizes a fault more precisely in the
six programs (i.e., schedule, schedule2, tcas, flex, gzip, and space) and
localizes a fault with the same precision in grep and sed. Furthermore, in the
five real-world programs, FIESTA localizes a fault more or equally precisely
compared to Op2.
Also, Figure 4 shows that FIESTA is more precise than Tarantula, Ochiai,
and Op2 in a different aspect. Figure 4 illustrates that faults of how many target
program versions (y axis) can be detected by examining a given % of target
program executed statements (x axis). As shown in Figure 4, FIESTA always
15
Figure 4: Accumulated % of target versions whose faults are localized after examining a
certain amount of target statements
detects a fault in more target versions than Tarantula does after examining
the same number of target statements. FIESTA always detects a fault in more
target versions than Ochiai and Op2 after examining the same number of target
statements, except when % of executed statements examined is between 8-12%
and 4-20% for Ochiai and Op2 respectively. For example, by examining 2% of
executed statements, FIESTA can find faults in 37% of all 179 target program
versions (i.e., 67 (=179×37%) versions) while Tarantula, Ochiai, and Op2 can
find faults in 22% (40 versions), 31% (56 versions), and 33% (59 versions) of the
target program versions respectively (see the small circles at the left part of the
figure).
5.4. Regarding RQ4: Improved Precision due to the Debuggability Transformation
Table 5 shows how much the debuggability transformation improves the
precision of the four CFL techniques. The table shows that the debuggability
transformation relatively increases the precision of Tarantula, Ochiai, Op2, and
FIESTA by 4.62%, 6.51%, 6.11%, and 7.72% on average, respectively (see the
last row of the table). For example, for replace, FIESTA without the transformation finds a fault after reviewing 10.88% of the executed statements while
FIESTA (with the transformation) does after reviewing 7.74% of the executed
16
Target
Programs
p tokens
p tokens2
replace
schedule
schedule2
tcas
tot info
flex
grep
gzip
sed
space
Average
Tarantula
Ochiai
Op2
FIESTA
% of exec. stmts Relative
% of exec. stmts Relative
% of exec. stmts Relative
% of exec. stmts Relative
examined
Imprv.
examined
Imprv.
examined
Imprv.
examined
Imprv.
w/o Trans. w/ Trans.
(%) w/o Trans. w/ Trans.
(%) w/o Trans. w/ Trans.
(%) w/o Trans. w/ Trans.
(%)
27.45
27.69
-0.86
19.02
19.02
0.00
10.41
10.41
0.00
11.20
11.20
0.00
12.88
13.45
-4.48
7.47
7.57
-1.42
1.62
1.68
-3.21
2.41
2.31
4.29
9.19
8.54
7.15
6.98
6.34
9.21
5.44
4.69
13.90
10.88
7.74
28.82
5.95
6.49
-9.11
3.51
3.65
-3.85
17.17
17.30
-0.79
10.32
10.45
-1.27
52.85
32.37
38.75
46.83
29.04
37.99
40.48
26.27
35.09
43.01
28.55
33.62
28.31
19.10
32.53
28.62
17.50
38.86
27.14
19.65
27.61
27.29
21.17
22.44
27.30
27.84
-2.00
21.31
21.89
-2.71
13.17
13.41
-1.80
15.76
16.11
-2.22
29.55
28.31
4.21
12.51
11.51
8.01
13.54
12.72
6.02
13.75
12.84
6.67
21.71
22.11
-2.30
1.68
1.75
-4.30
1.23
1.23
0.00
1.23
1.23
0.00
17.15
17.50
-2.04
7.91
7.91
0.00
7.91
7.91
0.00
7.51
7.49
0.30
4.94
5.25
-6.11
1.32
1.36
-3.41
1.29
1.33
-3.49
1.29
1.29
0.00
5.86
5.88
-0.32
2.80
2.81
-0.24
4.90
4.90
-0.05
3.70
3.70
0.00
20.26
17.89
4.62
13.33
10.86
6.51
12.03
10.13
6.11
12.36
10.34
7.72
Table 5: Effect of the transformation technique on the precision of Tarantula, Ochiai, Op2,
and FIESTA
Tarantula
Ochiai
Op2
FIESTA
% of exec. stmts Relative
% of exec. stmts Relative
% of exec. stmts Relative
% of exec. stmts Relative
examined
Imprv.
examined
Imprv.
examined
Imprv.
examined
Imprv.
w/o Trans. w/ Trans.
(%) w/o Trans. w/ Trans.
(%) w/o Trans. w/ Trans.
(%) w/o Trans. w/ Trans.
(%)
p tokens2
20.03
20.97
-4.69
10.64
10.49
1.47
1.72
1.72
0.00
1.88
1.41
25.00
replace
7.13
4.09
42.66
5.84
3.43
41.36
4.21
1.87
55.59
8.43
2.57
69.49
schedule2
84.41
35.60
57.82
77.29
34.89
54.85
65.22
31.45
51.78
65.41
31.08
52.48
tcas
38.72
16.08
58.47
39.89
14.91
62.63
37.24
20.85
44.01
37.94
24.05
36.62
tot info
42.37
34.75
18.00
24.58
20.34
17.24
13.56
8.47
37.50
14.41
10.17
29.41
flex
19.38
10.96
43.44
9.26
3.63
60.76
6.69
2.07
69.12
7.14
2.14
69.97
Average
35.34
20.41
35.95
27.92
14.61
39.72
21.44
11.07
43.00
22.53
11.90
47.16
Target
Programs
Table 6: Effect of the transformation technique on the precision of Tarantula, Ochiai, Op2,
and FIESTA for versions containing faulty compound conditional statements
statements, which is 28.82% (= 10.88−7.74
) relatively more precise result (see the
10.88
fourth row of Table 5).
Note that the transformation improves the precision much for the target
versions that contain faulty compound conditional statements. Table 6 shows
the improved precision on such versions only (i.e., 42 out of 179 versions (see
the second column of Table 3)). For such versions of the target programs,
the debuggability transformation relatively increases the precision of Tarantula,
Ochiai, Op2, and FIESTA significantly by 35.95%, 39.72%, 43.00%, and 47.16%,
respectively (see the last row of Table 6). 7
Therefore, if we can generalize the experiment results, we can conclude that
7 For the 137 versions that have no faulty compound conditional statements, the transformation relatively decreases the precision of Tarantula, Ochiai, Op2, and FIESTA by -5.53%,
-3.96%, -2.46%, and -0.76%, respectively.
17
the debuggability transformation technique can improve the precision of CFL
techniques in general.
6. Limitations
Although FIESTA detected a first fault by examining 10.34% of the executed
statements on average (see Table 4), there are 8 versions out of the total 179
versions of the target programs for which FIESTA examined more than 50% of
executed statements to localize a fault. We can classify these 8 versions into the
following four categories according to the reasons for the low precision:
• A large number of statements covered together (one version)
• Correct statements executed by only failing test cases and CCTs (one
version)
• A fault in a large basic block (three versions)
• Narrow distribution of the suspiciousness scores (three versions)
Through the detailed analysis of these versions, we found that these limitations are common ones for most coverage-based fault localization techniques,
not specific for FIESTA. For example, the precision of FIESTA for these versions is 62.99% on average and the precision of Tarantula, Ochiai, and Op2 for
these versions is 74.40%, 67.26%, and 53.66% on average respectively.
6.1. Large Number of Test Cases that Cover the Same Statements
The precision of FIESTA for flex version 8 was 52.11%. A main reason for
this imprecise result is that more than half of executed statements are covered
by the same set of test cases. Specifically, 396 statements (52.11% of executed
statements) including the faulty statements are covered together by each of the
553 test cases (out of total 567 test cases), so that those 396 statements have
the same suspiciousness scores. Although the suspiciousness score of the faulty
statement is the highest, % of code to examine is more than 50%, since the
396 statements that have the same suspiciousness score should be examined
altogether. This problem can be solved by constructing test cases to execute
different execution paths.
6.2. Correct Statements Executed by Only Failing Test Cases and CCTs
The precision of FIESTA for tot info version 10 was 51.75%. An important
reason is that this version has statements that are executed by only failing test
cases and CCTs. Figure 5 shows the case of tot info version 10. S1 is a correct
statement that has a higher rank than a faulty statement S2 for the following
reason. We found that all 8 failing test cases execute both S1 and S2, and there
are 789 CCTs that execute S2 (121 CCTs execute both S1 and S2 and 668 CCTs
execute S2 only) and passing(S1) ⊂ passing(S2) (i.e., S1 is not executed by
TCTs). In this situation, the 668 CCTs have lower fault weights on average
18
Figure 5: Exceptional case where a correct statement S1 is executed by only failing test cases
and CCTs
(i.e., 0.84) than the 121 CCTs (i.e., 0.92), because the 668 CCTs execute a
smaller number of statements that are executed by the failing test cases than
the 121 CCTs execute. Thus, S1 has a higher suspiciousness score than S2.
Note that this case is exceptional and problematic for most coverage-based
faulty localization techniques such as Tarantula, Ochiai, and Op2. For example,
the precision of Tarantula, Ochiai, and Op2for this version are 60.53%, 58.77%,
and 51.75%, respectively.
6.3. Fault in a Large Basic Block
tcas versions 3, 12, and 34 have a fault in an initial basic block of main()
function. This basic block consists of 26 statements, which are more than 40%
of the executed statements of tcas (i.e., these 26 statements are always executed
together). Although the suspiciousness of the fault is 1 (i.e., the highest score),
% of executed statements to examine is more than at least 40%, since the 26
statements in the same basic block should be examined together.
6.4. Narrow Distribution of the Suspiciousness Scores
For the remaining cases, we found that the difference between the suspiciousness score of a faulty statement and the high scores of non-faulty statements
are negligibly small. In other words, although a suspiciousness score of a faulty
statement is very high, the suspiciousness rank of the faulty statement can be
low, because there can be many non-faulty statements whose suspiciousness
scores are higher than that of the faulty statement by a very small degree (i.e.,
difference is less than 0.03).
Table 7 shows that the percentages of executed statements examined by
FIESTA in schedule2 versions 7, 10, and tot info version 21 are 58.87% on
average (see the third column of the last row in Table 7). However, the difference between the suspiciousness scores of the faulty statements and the highest
suspiciousness scores of non-faulty statements are very small (i.e., 0.014 on average). For example, for schedule2 version 7 (see the second row of Table 7),
the faulty statement has very high suspiciousness score which is almost same to
the top score, but there are 78 non-faulty statements which have slightly higher
suspiciousness scores, which make % of executed statements examined high (i.e.,
61.59%).
19
Target
program
sched2
Version
tot info
Avg.
7
10
21
% of exec. stmts
examined
61.59
51.85
63.16
58.87
Highest
susp.
0.961
0.962
0.919
0.947
Susp. of
the fault
0.945
0.942
0.914
0.934
Difference
0.016
0.020
0.005
0.014
Table 7: Differences between the suspiciousness scores of the faulty statements and high-rank
non-faulty statements
7. Related Work
Wong et al. [23] and Alipour [24] survey various fault localization approaches
including program state-based methods [2], machine learning-based fault localization methods [25], coverage-based fault localization methods, and so on.
Since coverage-based fault localization techniques utilize only test coverage of
a target program, we can use CFL with other fault localization technique such
as program slicing [26] which utilizes data dependencies or delta-debugging [2]
which utilizes states of executions together in a complementary way. For example, we can use FIESTA and JINSI [27] which employs delta-debugging and
program slicing together to locate suspicious statements pointed by both techniques to improve precision. In this section, we focus to review related work on
coverage-based fault localization techniques.
7.1. Coverage-based Fault Localization (CFL) Techniques
Tarantula [3] and Ochiai [4] are popular CFL techniques that use the statement coverage and test execution results to compute the suspiciousness of each
statement (see Section 3.1). Op2 [5] is a relatively new CFL technique which
is known as more precise than Tarantula and Ochiai [17]. Dallmeier et al. [28]
collect execution data of a Java program and analyze method call sequences to
identify a fault. Santelices et al. [29] extend Tarantula by targeting not only
statements, but also branches and du-pairs. The technique maps branches and
du-pairs to corresponding statements and calculates the average suspiciousness
scores of these statements to improve the precision of fault localization. Wong
et al. [30] consider ‘contribution’ of a test case for calculating suspiciousness of
a statement based on the idea that a test case that executes a target statement s
later should contribute less to the suspiciousness score of s than a test case that
executed s earlier. Yoo [31] mechanically generates and evaluates various suspiciousness formulas through genetic programming techniques on five variables
such as |passing(s)| and |f ailing(s)| that are used in suspiciousness formulas of
popular CFL techniques such as Tarantula, Ochiai, and Op2. Agrawl et al. [32]
computes ‘dice’, which is a set of statements that are executed by a failing test
case, but not by a passing test case and suspected to contain a fault. Renieres
et al. [33] extend Agrawl et al. [32] and compare spectra of passing test runs and
a failing test run to select a passing test run that has the shortest distance to
20
the given failing run. Baah et al. [34, 35] utilize dynamic data dependencies to
improve the precision of CFL through a causal-inference technique. 8 Gopinath
et al. [19] improve the precision of CFL techniques by giving high suspiciousness to the statements that are pointed by both SAT-based fault localization
technique [36] and Tarantula [3].
7.2. Techniques to Solve CCT Problems
Masri et al. [9] present an approach that removes CCTs by using a clustering
technique. The technique selects a set of suspicious statements (calling them
CCE) that are executed by all failing test cases and a number of passing test
cases (the number is given by a user as a threshold). The technique then clusters
test cases into two groups based on the similarity of the executed statements
of the test cases to the CCEs and the group that is more closely related to the
CCEs is removed.
A coverage refinement approach adjusts program coverage to strengthen correlation between faults and failing test runs [8]. The work defines ‘context patterns’ which are control and data flows before and after the execution of the
faulty statements. They define context patterns according to fault types (e.g.,
missing assignment), and use the context patterns for adjusting coverage of test
cases to reduce negative effect of CCTs. However, this approach requires human knowledge on the types of faults to define context patterns. In contrast,
FIESTA does not require the types of program faults to be known, but still filters out negative effect of CCTs through the debuggability transformation (see
Section 2) and the fault weights on test cases (see Section 3).
The motivation of the fault weight metric of FIESTA is similar to the related
work described in this section, in a sense that all of these techniques try to
estimate ‘distance’ between passing test runs and failing test runs to localize
a fault precisely. However, distance metrics utilized vary depending on the
techniques (fault weight for FIESTA, and CCE distance for Masri et al. [9], Ulam
distance for Renieres et al. [33], etc.). Furthermore, although all related work
mentioned in this section may suffer false positives or false negatives to recognize
CCTs, the debuggability transformation of FIESTA reduces a number of CCTs
without false positives nor false negatives, which can increase the precision of
CFL techniques in general.
Similarly to the fault weight of FIESTA, Bandyopadhyay [37] assigns a
weight to a passing test case tp based on the CC-proximity [38] of tp with all the
failing test cases. However, the definition of the weight metric itself is different
from that of FIESTA since the weight of tp decreases as the proximity of tp
increases if the proximity is not between a low threshold L and a high threshold
H ; the weight increases as the proximity increases otherwise. Thus, Bandyopadhyay [37] requires a user to select L and H which affect the precision of the fault
8 Precision of FIESTA is higher than that of Baah et al. [34, 35], since positive improvements
of FIESTA over Ochiai is 10.79% while that of Baah et al. [34, 35] is 7.77%, although target
subjects are not identical.
21
localization, but FIESTA does not. Furthermore, FIESTA clearly demonstrated
the improved precision on the 12 target programs including the five real-world
programs compared with Tarantula, Ochiai, and Op2, but Bandyopadhyay [37]
applied Ochiai and his method only to the small Siemens programs, and did not
show the clear precision improvement. Finally, we showed empirically that the
fault weight metric is accurate enough to recognize CCTs (see Figure 3), but
Bandyopadhyay did not.
7.3. Techniques to Utilize Predicates as a Target Program Spectrum
There have been several fault localization techniques that utilize predicates
in the program code and additional predicates on program execution information. For example, Liblit et al. [39] and Liu et al. [40] instrument/transform
a target program to record how program predicates (and additional predicates
such as if a return value of a function is equal to zero) are evaluated and rank
suspicious predicates. In other words, they utilize the numbers of failing tests
and passing tests where a predicate is true (or false) to localize a fault.
These techniques can also be considered to perform a debuggability transformation to extract useful execution information for fault localization. The
debuggability transformation of FIESTA, however, improves the precision of
fault localization actively by reducing the number of CCTs explicitly (see Sections 2 and 5.1) while the above techniques improve the precision passively by
observing more execution information. In addition, those techniques handle a
predicate consisting of multiple clauses as a whole (although the effect of compound predicates were briefly discussed in Abreu et al. [41]). In contrast, FIESTA targets each clause of a predicate individually through the debuggability
transformation.
8. Conclusion and Future Work
In this paper, we have described FIESTA which improves the precision of
coverage-based fault localization by reducing the negative effects of CCTs based
on the debuggability transformation and the fault weight metric on test cases.
Through the experiments on the 12 program including five real-world programs
of large sizes, we have demonstrated that FIESTA is 49%, 22%, and 14% relatively more precise than Tarantula, Ochiai, and Op2 on average, respectively. In
addition, compared with Tarantula, Ochiai, and Op2, FIESTA localizes a fault
equally or more precisely on 11, 8, and 8 target programs out of the 12 target
programs on average, respectively. Furthermore, the debuggability transformation improves the precision of CFL techniques in general (i.e., increasing the
precision of Tarantula, Ochiai, Op2, and FIESTA relatively in 4.62%, 6.51%,
6.11%, and 7.72% on average, respectively).
We will extend the debuggability transformation technique by not only generating semantically equivalent programs, but also mutants through mutation
operators. Our basic intuition is based on the observation that a faulty statement will be still faulty after mutating the statement, while a non-faulty statement will become faulty after the mutation (similar observation from Papadakis
22
et al. [42]). Finally, we plan to extend our tool to support integrated environments (e.g., coverage visualization, code inspection feature, etc) for practical
debugging as discussed in Parmin et al. [43].
References
[1] I.Vessey, Expertise in Debugging Compute Programs, Inter.J.of ManMachine Studies 23 (5) (1985) 459–494.
[2] A. Zeller, Isolating Cause-Effect Chains from Computer Programs, in: European Software Engineering Conference/Foundations of Software Engineering (ESEC/FSE), 2002.
[3] J.A.Jones, M.J.Harrold, J.Stasko, Visualization of Test Information to Assist Fault Localization, in: International Conference on Software Engineering (ICSE), 2002.
[4] R.Abreu, P.Zoeteweij, A. Gemund, An evaluation of similarity coefficients
for software fault localization, in: Pacific Rim International Symposium on
Dependable Computing (PRDC), 2006.
[5] L. Naish, H. J. Lee, K. Ramamohanarao, A model for spectra-based software diagnosis, ACM Transactions on Software Engineering Methodology
20 (3) (2011) 11:1–11:32.
[6] M.Daran, Software Error Analysis: A Real Case Study Involving Real
Faults and Mutations, in: International Symposium on Software Testing
and Analysis (ISSTA), 1996.
[7] B.Baudry, F.Fleurey, Y.L.Traon, Improving test suites for efficient fault
localization, in: International Conference on Software Engineering (ICSE),
2006.
[8] X.Wang, S.C.Cheung, W.K.Chan, Z.Zhang, Taming coincidental correctness: Coverage refinement with context patterns to improve fault localization, in: International Conference on Software Engineering (ICSE), 2009.
[9] W.Masri, R.A.Assi, Cleansing Test Suites from Coincidental Correctness
to Enhance Fault-Localization, in: International Conference on Software
Testing, Verification and Validation (ICST), 2010.
[10] M.Harman, L.Hu, R.Hierons, J.Wegener, H.Sthamer, A. Baresel, M. Roper,
Testability Transformation, IEEE Transactions on Software Engineering
30 (1) (2004) 3–16.
[11] G. Necula, S. McPeak, S. Rahul, W. Weimer, CIL: Intermediate language
and tools for analysis and transformation of C programs, in: CC, 2002.
23
[12] N. Digiuseppe, J. Jones, On the influence of multiple faults on coveragebased fault localization, in: International Symposium on Software Testing
and Analysis (ISSTA), 2011.
[13] C. LIU, J. HAN, Failure proximity: a fault localization-based approach, in:
International Symposium on Foundations of Software Engineering, 2006.
[14] A. ZHENG, M. JORDAN, B. LIBLIT, M. NAIK, A. AIKEN, Statistical
debugging: simultaneous identification of multiple bugs, in: International
Conference on Machine Learning, 2006.
[15] J. JONES, J. BOWRING, M. HARROLD, Debugging in parallel, in: International Symposium on Software Testing and Analysis (ISSTA), 2007.
[16] H. Do, S. Elbaum, G. Rothermel, Supporting Controlled Experimentation
with Testing Techniques: An Infrastructure and its Potential Impact., Empirical Software Engineering Journal 10 (4) (2005) 405–435.
[17] X.Xie, T.Y.Chen, F.C.Kuo, B.Xu, A Theoretical Analysis of the Risk Evaluation Formulas for Spectrum-Based Fault Localization, ACM Transaction
on Software Engineering and Methodology (TOSEM) .
[18] S.Artzi, J.Dolby, F.Tip, M.Pistoia, Directed test generation for effective
fault localization, in: International Symposium on Software Testing and
Analysis (ISSTA), 2010.
[19] D.Gopinath, R.N.Zaeem, S.Khurshid, Improving Effectiveness of SpectraBased Fault Localization Using Specifications, in: Automated Software
Engineering (ASE), 2012.
[20] J.Rößler, G.Fraser, A.Zeller, A.Orso, Isolating Failure Causes through Test
Case Generation, in: International Symposium on Software Testing and
Analysis (ISSTA), 2012.
[21] R.Chillarege, W.Kao, R.G.Condit, Defect Type and its Impact on the
Growth Curve, in: International Conference on Software Engineering
(ICSE), 1991.
[22] J.A.Duraes, H.S.Madeira, Emulation of Software Faults: A Field Data
Study and a Practical Approach, IEEE Transactions on Software Engineering 32 (11) (2006) 849–867.
[23] W.E.Wong, V.Debroy, A Survey of Software Fault Localization, Tech. Rep.,
University of Texas at Dallas, technical Report UTDCS-45-09, 2009.
[24] M.A.Alipour, Automated Fault Localization Techniques: a Survey, Tech.
Rep., Oregon State University, 2012.
[25] Y.Brun, M.D.Ernst, Finding Latent Code Errors via Machine Learning over
Program Executions, in: International Conference on Software Engineering
(ICSE), 2004.
24
[26] F.Tip, A survey of program slicing techniques, Journal of Programming
Languages 3 (1995) 121–189.
[27] M.Burger, A.Zeller, Minimizing Reproduction of Software Failures, in: International Symposium on Software Testing and Analysis (ISSTA), 2011.
[28] V.Dallmeier, C.Lindig, A.Zeller, Lightweight defect localization for java, in:
European Conference on Object-Oriented Programming (ECOOP), 2005.
[29] R.Santelices, J.A.Jones, Y.Yu, M.J.Harrold, Lightweight fault-localization
using multiple coverage types, in: International Conference on Software
Engineering (ICSE), 2009.
[30] W.E.Wong, V.Debroy, B.Choi, A family code coverage-based heuristics for
effective fault localization, Journal of Software Systems 83 (2010) 188 –208.
[31] S.Yoo, Evolving Human Competitive Spectra-based Fault Localisation
Techniques, in: Search Based Software Engineering (SBSE), 2012.
[32] H.Agrawal, J.R.Horgan, S.London, W.E.Wong, Fault Localization using
Execution Slices and Dataflow Tests, in: International Symposium on Software Reliability Engineering (ISSRE), 1995.
[33] M.Renieres, S.P.Reiss, Fault localization with nearest neighbor queries, in:
Proceedings of the 18th International Conference on Automated Software
Engineering, 30 – 39, 2003.
[34] G.K.Baah, A.Podgurski, M.J.Harrold, Causal Inference for Statistical Fault
Localization, in: International Symposium on Software Testing and Analysis (ISSTA), 2010.
[35] G.K.Baah, A.Podgurski, M.J.Harrold, Mitigating the Confounding Effects of Program Dependences for Effective Fault Localization, in: European Software Engineering Conference/Foundations of Software Engineering (ESEC/FSE), 2011.
[36] M.Jose, R.Majumdar, Cause Clue Clauses: Error Localization using Maximum Satisfiability, in: Program Language Design and Implementation
(PLDI), 2011.
[37] A.Bandyopadhyay, Improving spectrum-based fault localization using
proximity-based weighting of test cases, in: Automated Software Engineering (ASE), 2011.
[38] C.Liu, X.Zhang, J.Han, A systematic study of failure proximity, IEEE
Trans.Softw.Eng. 34 (2008) 826–843.
[39] B.Liblit, M.Naik, A.X.Zheng, A.Aiken, M.I.Jordan, Scalable statistical bug
isolation, in: Program Language Design and Implementation (PLDI), 2005.
25
[40] C.Liu, X.Yan, L.Fei, J.Han, S.P.Midki, Sober: statistical model-based bug
localization, in: Foundations of Software Engineering, 2005.
[41] R. Abreu, W. Mayer, M. Stumptner, A. van Gemund, Refining spectrumbased fault localization rankings, in: ACM Symposium on Applied Computing (SAC), 2009.
[42] M.Papadakis, Y.L.Traon, Using Mutants to Locate “Unknown” Faults, in:
International Workshop on Mutation Analysis, 2012.
[43] C.Parmin, A.Orso, Are automated debugging techniques acutally helping
programmers?, in: International Symposium on Software Testing and Analysis (ISSTA), 2011.
26