FIESTA: Effective Fault Localization to Mitigate the Negative Effect of Coincidentally Correct Tests Seokhyeon Mun, Yunho Kim, Moonzoo Kim∗ CS Dept. KAIST, South Korea Abstract One of the obstacles for precise coverage-based fault localization (CFL) is the existence of Coincidentally Correct Test cases (CCTs), which are the test cases that pass despite executing a faulty statement. To mitigate the negative effects of CCTs, we have proposed Fault-weIght and atomizEd condition baSed localizaTion Algorithm (FIESTA) that transforms a target program with compound conditional statements to a semantically equivalent one with atomic conditions to reduce the number of CCTs. In addition, FIESTA utilizes a fault weight on a test case t that indicates “likelihood” of t to execute a faulty statement. We have evaluated the effectiveness of the transformation technique and the fault weight metric through a series of experiments on 12 programs including five nontrivial real-world programs. Through the experiments, we have demonstrated that FIESTA is more precise than Tarantula, Ochiai, and Op2 for the target programs on average. Furthermore, the transformation technique improves the precision of CFL techniques in general (i.e., increasing the precision of all four CFL techniques–Tarantula, Ochiai, Op2, and FIESTA). Keywords: Coverage-based fault localization, coincidentally correct test cases, debuggability transformation, fault weights of test cases 1. Introduction Although many automated verification and validation (V&V) techniques on software have been proposed, developers still spend a large amount of time to identify the root causes of program failures. Fault-localization is an expensive step in the whole debugging process [1, 2], because it usually requires human developers to understand the complex internal logic of the target program and reason about the differences between passing test runs and failing test runs step-by-step. Thus, there are large demands for automated fault localization techniques. ∗ Corresponding author Email addresses: [email protected] (Seokhyeon Mun), [email protected] (Yunho Kim), [email protected] (Moonzoo Kim) Preprint submitted to Elsevier September 11, 2013 Coverage-based fault localization (CFL) is a fault localization technique that locates faults by assigning suspiciousness scores to program entities (i.e., statements or branches) based on the code coverage information. A key idea of CFL is to utilize the correlation between failing test runs and faulty statements to identify the location of faults [3, 4, 5]. However, in practice, this correlation may not be strong enough to localize faults precisely [6]. One of the main causes that weaken such correlation is coincidentally correct test cases (CCTs), which are passing test cases that execute faulty statements. This is because faulty statements executed by CCTs are often considered as non-faulty statements, since they are executed by passing test cases [7, 3, 8, 9]. In this paper, we propose Fault-weIght and atomizEd condition baSed localizaTion Algorithm (FIESTA) to mitigate this negative effect of CCTs in the following ways: 1. To transform a target program to reduce a number of CCTs (debuggability transformation). 2. To develop a new suspiciousness metric that alleviates the negative effect of CCTs approximately. First, FIESTA transforms a target program P into a semantically equivalent program P 0 by converting a compound conditional statement into nested conditional statements with atomic conditions (see Section 2). Through this debuggability transformation, the precision of CFL techniques (not only FIESTA) can increase due to the decreased number of CCTs (see Section 5.1). Note that this debuggability transformation technique can be applied to any CFL techniques such as Tarantula [3], Ochiai [4], and Op2 [5] to improve the precision of the CFL techniques further (Section 5.4). We can consider this transformation as a parallel to ‘testability’ transformation [10]. Second, FIESTA defines a new suspiciousness metric that reduces the negative effect of CCTs by considering fault weights on test cases. We define a fault weight on a test case t to indicate “likelihood” of t to execute a faulty statement (see Section 3) and utilize the fault weight values of the test cases that cover a target statement s to calculate a suspiciousness score of s. In this paper, we set up the following research questions regarding the effectiveness of the debuggability transformation and the fault weight metric as well as the precision of FIESTA: • RQ1: How effective is the transformation of a target program, in terms of a reduced number of CCTs? • RQ2: How accurate is the fault weight metric on test cases, in terms of the probability of a passing test case with a high fault weight to be a CCT? • RQ3: How precise is FIESTA, specifically compared with Tarantula, Ochiai, and Op2, in terms of the % of executed statements examined to localize a fault? 2 • RQ4: How effective is the transformation of a target program, in terms of the improved precision of Tarantula, Ochiai, Op2, and FIESTA? To answer these research questions, we have performed a series of experiments by applying Tarantula, Ochiai, Op2, and FIESTA to 12 programs including the seven Siemens programs and five non-trivial real-world programs (see Table 1). The experimental results confirm that the transformation technique reduces the number of CCTs effectively and the fault weight metric is accurate enough to distinguish CCTs from truly correct test cases (TCTs), which are passing test cases that do not execute a faulty statement. Consequently, the experimental results demonstrate that FIESTA is 49%, 22%, and 14% relatively more precise than Tarantula, Ochiai, and Op2 on average, respectively. Furthermore, we demonstrate that the debuggability transformation technique can improve the precision of CFL techniques in general (i.e., increasing the precision of all four CFL techniques - Tarantula, Ochiai, Op2, and FIESTA). The contributions of this paper are as follows: • We propose debuggability transformation (Section 2) to improve the precision of CFL techniques in general by reducing the number of CCTs (Section 5.1 and Section 5.4) • We define a fault weight metric on test cases (Section 3), which can improve the precision of fault localization (Section 5.2). • We demonstrate that FIESTA, which has been developed based on the above two techniques, is more precise than Tarantula, Ochiai, and Op2 through the empirical study on the 12 programs including five real-world programs (Section 5.3). The remainder of this paper is organized as follows. Section 2 describes the debuggability transformation to reduce the number of CCTs. Section 3 explains the new suspiciousness metric of FIESTA that utilizes fault weights of test cases. Section 4 shows the experiment setup for the empirical evaluation of the techniques on the 12 target programs. Section 5 explains and analyzes the experiment results regarding the research questions. Section 6 discusses several limitations of CFL techniques observed through the experiments. Section 7 describes related work and compares the proposed techniques with the related work. Section 8 concludes this paper with future work. 2. Debuggability Transformation of a Target Program to Reduce CCTs 2.1. Overview FIESTA transforms a target program P into a semantically equivalent program P 0 by converting a compound conditional statement into nested conditional statements with atomic boolean conditions (i.e., conditions without 3 // Original program P 1:if(x>0&&y==0&&z<0){//Fault:z<0 should be z==0 2: stmt; 3:} // Transformed program P’ 11:if (x > 0) { 12: if (y == 0) { 13: if (z < 0) {//Fault:z<0 should be z==0 14: stmt; 15:}}} Figure 1: Example of P and P 0 that has been transformed from P by FIESTA boolean operators) 1 . A motivation for this transformation is based on the observation that a target program often contains a compound conditional statement scomp whose boolean condition (i.e., a predicate) consists of multiple clauses c1 , c2 , ...cn combined with logical operators. If one such clause cm is faulty, scomp becomes faulty and a passing test case tcp that executes scomp becomes a CCT, although tcp may not execute cm due to short-circuit evaluation. Consequently, tcp decreases a suspiciousness score of scomp , since tcp covers scomp but passes. We can alleviate this problem by decomposing a compound conditional statement scomp into semantically equivalent nested atomic conditional statements. For example, FIESTA transforms a compound conditional statement at line 1 of P in Figure 1 into the semantically equivalent nested conditional statements in P 0 (lines 11–13). Suppose that the last atomic clause of line 1 in P is a fault (e.g., it should be z==0, not z<0) and a test case tc1 (x=0,y=0,z=0) that executes line 1 is a CCT for P . Note that tc1 is not a CCT for P 0 , since tc1 does not execute the corresponding faulty statement (line 13) of P 0 (see Section 2.2.1). CCTs that execute line 1 of P are not CCTs for P 0 unless they satisfy x>0 && y==0. Thus, through the transformation of a target program P to P 0 , such CCTs become truly correct test cases (TCTs) (i.e., passing test cases that do not execute a faulty statement) and the precision of coverage-based fault localization can be improved on P 0 due to the decreased number of CCTs 2 . Once the suspiciousness scores on the statements of P 0 are obtained, these scores are mapped to the statements of P (see Section 2.2.2). Finally, the user reviews the statements of P that have high suspiciousness scores to identify a also transforms exp?v1:v2 in P to if(exp) tmp=v1; else tmp=v2; in P 0 . might expect that a loop unrolling transformation or a function inlining transformation can decrease the number of CCTs as well. However, our preliminary experimental results show that those transformations decrease the precision due to increased number of faults through unrolling and inlining (i.e., transforming a single fault program to a multiple faults program). 1 FIESTA 2 One 4 faulty statement as usual. Note that this debuggability transformation technique can be applied to any CFL techniques including FIESTA to improve the precision of the CFL techniques further. 2.2. Mapping of Statements between P and P’ FIESTA utilizes CIL (C Intermediate Language) [11] to transform a target C program P to P 0 and CIL generates basic mapping information between statements of P and P 0 . Based on this mapping information, FIESTA maps a faulty statement in P to statement(s) in P 0 (Section 2.2.1). In addition, FIESTA maps the suspiciousness scores on the statements in P 0 to the original statements in P (Section 2.2.2). 2.2.1. Mapping Faulty Statements of P to P’ To measure how many CCTs are reduced through the transformation, we map a faulty compound conditional statement of P to atomic conditional statements of P 0 . There are four cases to consider for the mapping as follows: 1. When a clause of a compound condition is faulty (e.g., z<0 is faulty in if(x>0 && y==0 && z<0), since it should be z==0): Statement(s) of P 0 that correspond to a faulty clause (i.e., if(z<0) in P 0 ) are marked as faulty statements. 2. When a boolean operator is faulty (e.g., if(x>0 && y==0) {s1;} should be if(x>0 || y==0) {s1;}): Statement(s) of P 0 that correspond to the all operands of the faulty operator (e.g., if(x>0) and if(y==0) in P 0 ) are marked as faulty statements. This is because each of those statements that correspond to the operands can trigger errors. For example, with tc2 (x=1,y=1), the correct program (i.e., if(x>0 || y==0) {s1;}) will execute s1, but the faulty transformed program (i.e., if(x>0) { if(y==0) {s1;}}) will not execute s1 due to if(y==0). Similarly, with tc3 (x=0, y=0), the correct program will execute s1, but the faulty transformed program will not execute s1 due to if(x>0). 3. A new clause is added (e.g., w!=0 is conjunctively added): Statement(s) of P 0 that correspond to the newly added clause (e.g., if(w!=0) in P 0 ) are marked as faulty statements. 4. A clause is omitted (e.g., if(x<0 /*&& y<0 omitted*/) {s1;}): Statement(s) of P 0 that correspond to the statements that are control dependent on the omitted clause in P (i.e., y<0 in the example) are marked as faulty (i.e., s1 in the example), since such statements can trigger errors. 2.2.2. Mapping Suspiciousness Scores on the Statements of P’ to P Although we obtain suspiciousness scores on the statements of P 0 , we have to examine statements of P in the descending order of their suspiciousness scores. This is because we would like to identify a fault in the original program eventually. In addition, P 0 can be larger than P through the transformation, which may cause a user to examine more statements even when the % of the 5 executed statements to examine to find a faulty statement for P 0 is smaller than the % of the executed statements to examine for P . Thus, FIESTA maps the suspiciousness scores on the statements of P 0 to the statements of P . Suppose that a compound conditional statement scomp of P is transformed into nested conditional statements with atomic conditions s01 , s02 , ...s0n . Then, the suspiciousness score of scomp is defined as the score of s0m (1 ≤ m ≤ n) that has the highest suspiciousness score among the scores of all s01 , s02 , ...s0n . In addition, FIESTA reports s0m to a user to help the user examine suspicious statements to localize a fault. The suspiciousness scores of the statements of P 0 that are not expanded from a compound conditional statement in P are directly mapped to their original statements in P . 3. Fault Weight Based Localization Metric 3.1. Background Coverage based fault localization techniques calculate suspiciousness scores of coverage entities such as statements and branches by analyzing difference between passing test runs and failing test runs. High suspiciousness of a statement indicates that the statement is likely to be a faulty one. For example, Tarantula [3] calculates a suspiciousness score of a statement s by using the following metric: |f ailing(s)| |Tf | suspT aran (s) = |f ailing(s)| |passing(s)| + |Tf | |Tp | where Tf is a set of failing test cases and f ailing(s) is a set of failing test cases that execute s (Tp and passing(s) are defined similarly). Similarly, the suspiciousness metrics of Ochiai [4] and Op2 [5] are as follows: |f ailing(s)| suspOchiai (s) = p |Tf | ∗ (|f ailing(s)| + |passing(s)|) suspOp2 (s) = |f ailing(s)| − |passing(s)| |Tp | + 1 Figure 2 shows an example of how the above metric localizes a fault. Let us assume that we have six test cases (tc1 to tc6) and line 3 is the faulty statement, since it should be x=x+2. The executed statements by each test case are marked with black dot in the figure. In the figure, tc1 is the only failing test case, since example() with tc1 will return 3, while it will return 1 if there is no fault (i.e., if line 3 is x=x+2). All other test cases tc2 to tc6 are passing test cases, since example() with these test cases will return a correct value (i.e., 1 with tc2 to tc4 and 2 with tc5 and tc6) in spite of the fault. Thus, tc2, tc3, and tc4 are coincidentally correct test cases (CCTs), since they are passing test cases that execute line 3, the faulty statement. 6 Figure 2: Example of coverage-based fault localization With the suspiciousness metric of Tarantula, line 7 (ret=ret+2) has the highest suspiciousness score (i.e., the faulty line (line 3) has 1 1 1 1 1 1 + 2 5 = 0.71) and marked as rank 1, while = 0.63 as a suspiciousness score and marked as + 53 rank 3. This is because there are more passing test cases that execute line 3 (i.e., tc2, tc3, and tc4) than line 7 (i.e., tc5 and tc6). For Tarantula, Ochiai, and Op2, CCTs such as tc2 to tc4 make the suspiciousness of the faulty statement low, since they increase |passing(s)| in the denominator of the suspiciousness metrics (Tarantula and Ochiai) or |passing(s)| in the negative term of the suspiciousness metric (Op2). Thus, a fault localization technique like Tarantula, Ochiai, and Op2 becomes imprecise as there are more CCTs. Our new fault localization technique (i.e., FIESTA) alleviates this problem by considering fault weights on test cases. 1 1 3.2. Fault Weights on Test Cases We define a fault weight wf on a test case t, which indicates “likelihood” of t to execute a faulty statement. For a failing test case tf , wf (tf ) is defined as 1. For a passing test case t, fault weight wf (t) is defined as follows: Σtf ∈Tf wf (t) = |stmt(t) ∩ stmt(tf )| |stmt(tf )| |Tf | where stmt(t) is a set of statements executed by a test case t and Tf is a set of failing test cases. wf (t) represents a degree of overlap between the statements executed by t (i.e., stmt(t)) and the statements executed by a failing test case 7 tf ∈ Tf (i.e., stmt(tf )) on average over Tf . Thus, high wf (t) score indicates that t is likely to execute a faulty statement. 3 For the example of Figure 2 where Tf = {tc1}, we can calculate the fault weights of the six test cases as follows: • wf (tc1) = 1, since tc1 is a failing test case. • wf (tc2) = |stmt(tc2)∩stmt(tc1)| |stmt(tc1)| 1 = 6 7 1 = 0.86 • wf (tc3) = wf (tc4) = wf (tc2) = 0.86, since stmt(tc3) = stmt(tc4) = stmt(tc2). • wf (tc5) = |stmt(tc5)∩stmt(tc1)| |stmt(tc1)| 1 = 5 7 1 = 0.71 • wf (tc6) = wf (tc5) = 0.71, since stmt(tc6) = stmt(tc5) The fault weights (i.e., 0.86) of CCTs (i.e., tc2, tc3, and tc4) are higher than the fault weights (i.e., 0.71) of TCTs (i.e., tc5, tc6), which are passing test cases that do not execute a faulty statement. If wf (tp ) is high for a passing test case tp , tp is likely a CCT (see Figure 3) and we can make a precise suspiciousness metric that gives high suspiciousness score to the statements executed by such tp . 3.3. Suspiciousness Metric of FIESTA We define a new suspiciousness metric suspF on a statement s using the fault weights on test cases as follows: suspF (s) = ( |f ailing(s)| Σt∈test(s) wf (t) + )/2 |Tf | |test(s)| where test(s) is a set of test cases that execute s (i.e., test(s) = passing(s) ∪ f ailing(s)). The left term |f ailing(s)| in the metric increases the suspiciousness score, if s |Tf | Σ w (t) f is executed by more failing test cases. 4 The right term t∈test(s) increases |test(s)| a suspiciousness score, if s is executed by test cases with higher fault weights. 3 Fault weight can be considered as a ‘distance’ between a given test case and all failing test cases on average. There are other CFL techniques which utilizes different ‘distance’ metrics for similar purpose (see Section 7.2). 4 The left term may make FIESTA localize a fault less precisely on a target program with multiple faults spanning on multiple lines than one with a single line fault, which occurs commonly to most CFL techniques. However, Digiuseppe et al. [12] showed that CFL techniques are still effective to target programs with multiple faults (i.e., in the situation where different faults interfere with each other’s localizability) as the number of the faults increases. In addition, there are several techniques (i.e., Liu et al. [13], Zheng et al. [14], and Jones et al. [15]) to cluster test cases targeting an individual fault to reduce interference among multiple faults. 8 Finally, to normalize a metric value within a range 0 to 1, the sum of the left term and the right term is divided by 2. For example, with the example of Figure 2, the new fault localization metric calculates the suspiciousness of line 3 as 0.95 and marks line 3 (and line 4 together) as the first rank as follows: 1 wf (tc1) + wf (tc2) + wf (tc3) + wf (tc4) )/2 = 0.95 ( + 1 4 Note that the suspiciousness metric of FIESTA is more precise than those of Tarantula, Ochiai, and Op2 for this example, because FIESTA utilizes the fault weights to increase suspiciousness scores of the statements executed by CCTs. 4. Empirical Study Setup We designed the following four research questions to study effectiveness of the transformation technique and the fault weight metric as well as the precision of FIESTA: RQ1: How effective is the transformation of a target program, in terms of a reduced number of CCTs? We evaluate the effectiveness of the debuggability transformation by measuring both the number of CCTs for a target program P and the number of CCTs for the transformed target program P 0 . RQ2: How accurate is the fault weight metric of FIESTA on test cases, in terms of the probability of a passing test case with a high fault weight to be a CCT? We have to check that the fault weight metric is accurate enough to improve the precision of fault localization. We evaluate the accuracy of the fault weight metric by calculating the probability of a passing test case whose fault weight is larger than 0.9 to be a CCT over all versions of the target programs. RQ3: How precise is FIESTA, specifically compared to Tarantula, Ochiai, and Op2 in terms of the % of executed statements examined to localize a fault? We have to show how precise FIESTA is by measuring the % of executed statements examined to localize a fault on the target benchmark programs on average. The precision of fault localization depends on multiple factors including a target program, test cases, and faults in a target program. Thus, we have to perform experiments with various settings of such factors carefully. RQ4: How effective is the transformation of a target program, in terms of improved precision of Tarantula, Ochiai, Op2, and FIESTA? The ultimate goal of the debuggability transformation is to improve the precision of fault localization techniques by reducing the number of CCTs. Thus, we should measure how much the debuggability transformation improves the precision of the fault localization techniques (i.e., Tarantula, Ochiai, Op2, and 9 Target programs LOC print tokens print tokens2 replace schedule schedule2 tcas tot info flex grep gzip sed space Average 563 510 563 412 307 173 406 12423 12653 6576 11990 9564 4678.33 # of faulty ver. used 7 9 30 5 9 40 23 16 4 5 3 28 14.92 # of provided test cases 4130 4115 5542 2650 2710 1608 1052 567 809 214 360 13585 3111.83 Description lexical analyzer lexical analyzer pattern replacement priority scheduler priority scheduler altitude separation information measure lexical analyzer generator pattern matcher file compressor stream editor ADL interpreter Table 1: Experiment objects FIESTA) by comparing the precision of these techniques with or without the debuggability transformation. To answer these research questions, we have performed a series of experiments by applying Tarantula, Ochiai, Op2, and FIESTA to the 12 target programs. We will describe the detail of the experiment setup in the following subsections. 4.1. Objects of Analysis As objects of the empirical study, we used the seven Siemens programs (419.1 LOC on average) and five non-trivial real-world programs (flex 2.4.7, grep 2.2, and gzip 1.1.2, sed 1.18, and space) in the SIR benchmark suite [16] (10641.2 LOC on average). Among the faulty versions of those programs, we excluded faulty versions for which no test case or only one test case fails. Table 1 describes the target programs and the number of their faulty versions on which we conducted fault localization experiments. The target programs have 154 faulty versions each of which has a single fault (only one line is different from the non-faulty version) and 25 faulty versions each of which has multiple faults (more than one line are different from the non-faulty version). In addition, we used all test cases in the SIR benchmark except the ones that crash a target program since we used gcov to measure coverage information and gcov cannot measure coverage information if a target program crashes. 4.2. Variables and Measures To address our research questions, our experiments manipulated two independent variables: 10 IV1: Debuggability transformation We examined the reduced number of CCTs and the increased precision of the four fault localization techniques through the debuggability transformation (regarding RQ1 and RQ4). IV2: Fault localization technique We examined four fault localization techniques, Tarantula, Ochiai, Op2, and FIESTA, each of whose precision is evaluated (regarding RQ3 and RQ4). We have the following dependent variables that are measured through the experiments to answer the research questions: DV1: Numbers of CCTs for the original target program P and for the transformed target program P 0 To measure how much the debuggability transformation technique reduces the number of CCTs, we measured the number of CCTs for P and P 0 (regarding RQ1). If the transformation technique is effective, the number of CCTs will significantly decrease. DV2: Numbers of CCTs and TCTs in the given range of fault weight values To measure the accuracy of the fault weight metric, we equally divided the range of a fault weight value (i.e.,[0, 1]) into 10 partitions and measured the number of CCTs and TCTs whose weights are included in the partitions of the fault weights (regarding RQ2). If the fault weight metric is effective, test cases with higher test weights will be more likely to be CCTs. DV3: The percentage of executed statements examined until a fault is identified We measured the % of executed statements examined to localize a fault as a measurement of the precision of Tarantula, Ochiai, Op2, and FIESTA (regarding RQ3 and RQ4). The precision of the fault localization technique will inversely correlate with the percentage. 4.3. Experiment Setup We implemented Tarantula, Ochiai, Op2, and FIESTA in 2600 lines of C++ code with CIL 1.3.7. The target programs are compiled and instrumented in statement-level by using gcc. After each execution of the instrumented object program finishes, we use gcov to get the statement coverage of the program execution on a given test case. Our tools parse the statement coverage reported from gcov for each test case, and construct a table that shows which statements are covered by which test cases (see Figure 2). In our experiments, we consider only executed statements that can be tracked by gcov. Each executed statement is given a suspiciousness score according to the suspiciousness metrics of Tarantula, Ochiai, Op2, and FIESTA and ranked in a descending order of the suspiciousness scores. Statements with the same suspiciousness score are assigned the lowest rank the statements can have, since we assume that the 11 developer must examine all of the statements with the same suspiciousness to find faulty statements. All experiments were performed on 5 machines that are equipped with Intel i5 3.6 Ghz CPUs and 8 Gigabyte of memory and run Debian 6.05. 4.4. Threats to Validity The primary threat to external validity for our study involves the representativeness of our object programs since we have examined only 12 C programs. However, since the target programs include various faulty versions of five real world target programs with test suites to support controlled experimentation and the target programs are widely used for fault localization research, we believe that this threat to external validity is limited. The second threat involves the representativeness of the coverage-based fault localization (CFL) techniques that we applied (i.e., Tarantula, Ochiai, Op2, and FIESTA). However, since Tarantula and Ochiai are considered as representative CFL techniques in literature and Op2 is known for its high precision [17], we believe that this threat is also limited. The third possible threat is the assumption that we already have many test cases when applying CFL techniques to a target program. However, since we can generate many test cases using automated test generation techniques (i.e., random testing, concolic testing, or test generation strategies for fault localization [18, 19, 20]), this threat is limited too. A primary threat to internal validity is possible faults in the tools that implement Tarantula, Ochiai, Op2, and FIESTA. We controlled this threat through extensive testing of our tool. 5. Result of The Experiments We have applied our implementation of Tarantula, Ochiai, Op2, and FIESTA to the 179 versions of the 12 programs (the whole experiments took around three hours by running 5 machines). From the data obtained from the series of experiments, we can answer the four research questions in the following subsections. 5.1. Regarding RQ1: Reduced Number of CCTs through the Debuggability Transformation Table 2 describes statistics on the reduced number of CCTs for the target program versions that have faulty compound conditional statements through the transformation (see Section 2). 5 Among the total 179 versions of the target programs, only 42 (=3+11+4+20+1+3) versions have faulty compound conditional statements (see the second column of Table 2). For the 42 versions of the target programs that have faulty compound conditional statements, the 5 For a target program version which does not contain a faulty compound conditional statement, the number of CCTs does not change for P 0 . 12 # of faulty Avg. # of Avg. # of ver. with faulty lines of the faulty comp. stmt. expanded comp. stmt. per version faulty stmt. ptok2 3 1.00 2.67 replace 11 1.00 5.27 schedule2 4 1.00 4.00 tcas 20 1.20 5.38 tot info 1 1.00 2.00 flex 3 1.00 2.67 Average 7.0 1.03 3.66 Target programs Avg. # of CCTs Reduced Original Transformed number program program of CCTs P P0 (%) 1947.33 1897.33 2.57 1647.55 1086.09 34.08 2618.75 2166.25 17.28 1049.37 716.68 31.70 844.00 806.00 4.50 330.67 58.67 82.26 1406.28 1121.84 28.73 Table 2: Reduced number of CCTs through the debuggability transformation number of CCTs decreases 28.73% on average (see the last column of Table 2). For example, three versions among the 16 faulty versions of flex have 1.00 faulty compound conditional statement each on average (see the first to the third columns of the seventh row of Table 2). In addition, each such compound faulty statement is transformed into 2.67 statements on average (see the fourth column of the seventh row of Table 2) and the number of CCTs decreases from 330.67 for P to 58.67 for P 0 on average (i.e., 82% of CCTs are reduced) (see the fifth to the seventh columns of the table). Note that bug studies in literature show that incorrect conditional statements were often the causes of bugs. For example, the causes of 22% of the bugs in the operating system at IBM [21] and 25% of bugs in the 12 opensource programs [22] resided in the incorrect conditional statements. Therefore, we can conclude that the debuggability transformation of FIESTA that targets incorrect compound conditional statements reduces the number of CCTs effectively. 5.2. Regarding RQ2: Accuracy of the Fault Weight Metric on Test Cases Table 3 describes statistics on the test cases used to localize a fault in the target programs. The number of CCTs (i.e., 1111.5) is similar to TCTs (i.e., 1717.3) on average (see the third column and the fifth column of the last row of the table). 6 Average fault weight of CCTs (i.e., 0.86) is higher than that of TCTs (i.e., 0.66) (see the fourth column and the sixth column of the last row of the table). Figure 3 shows that the ratios of CCTs and TCTs for the 179 versions of the 12 target programs according to their fault weight values. Figure 3 shows that passing test cases with high weights are more likely CCTs. For example, a set of passing test cases whose weights are between 0.9 and 1 (mean value 0.95) 6 Note that the numbers of CCTs may be different for different versions of a target program P due to the different faulty lines in the different versions of P . 13 Failing test Passing test cases Total test cases CCT TCT cases used Avg. # of Avg. # of Avg. fault Avg. # of Avg. fault Avg. # of Avg. fault test cases test cases weight test cases weight test cases weight ptok 69.1 1960.1 0.75 2100.7 0.69 4130.0 0.72 ptok2 229.3 1187.9 0.83 2697.8 0.76 4115.0 0.80 replace 100.7 2218.8 0.79 3222.5 0.65 5542.0 0.72 schedule 142.8 1431.2 0.96 1069.0 0.64 2643.0 0.83 schedule2 29.0 2518.9 0.90 153.6 0.50 2701.4 0.87 tcas 40.6 852.6 0.90 714.8 0.62 1608.0 0.78 tot info 83.7 680.1 0.85 288.1 0.50 1052.0 0.77 flex 255.7 100.1 0.83 183.2 0.73 538.9 0.88 grep 241.5 36.0 0.87 486.8 0.67 764.3 0.78 gzip 47.6 37.4 0.96 128.0 0.81 213.0 0.91 sed 48.7 13.0 0.86 234.0 0.72 295.7 0.77 space 1954.2 2302.1 0.83 9328.6 0.67 13585.0 0.74 Average 270.3 1111.5 0.86 1717.3 0.66 3099.0 0.80 Target programs Table 3: Fault weights of test cases for the target programs Figure 3: Ratios of CCTs and TCTs for different fault weights consists of 86.4% CCTs and 13.6% TCTs (see the small circle at the top right part of Figure 3). In other words, the probability of a passing test case with weight larger than 0.9 to be a CCT is 86.4%. In contrast, the probability of a passing test case with a fault weight between 0.3 and 0.4 (mean value 0.35) to be a CCT is only 7.1% (see the small circle at the bottom of Figure 3). 14 Target programs print tokens print tokens2 replace schedule schedule2 tcas tot info flex grep gzip sed space Average % of executed stmts examined Tarantula Ochiai Op2 FIESTA 27.45 19.02 10.41 11.20 12.88 7.47 1.62 2.31 9.19 6.98 5.44 7.74 5.95 3.51 17.17 10.45 52.85 46.83 40.48 28.55 28.31 28.62 27.14 21.17 27.30 21.31 13.17 16.11 29.55 12.51 13.54 12.84 21.71 1.68 1.23 1.23 17.15 7.91 7.91 7.49 4.94 1.32 1.29 1.29 5.83 2.80 4.90 3.70 20.26 13.33 12.03 10.34 Table 4: % of executed statements examined to localize a fault in the target programs (i.e., precision) Therefore, we can conclude that the proposed fault weight metric is accurate enough to distinguish CCTs and TCTs approximately. Consequently, FIESTA that utilizes the fault weight metric can improve the precision of fault localization by reducing the negative effect of CCTs. 5.3. Regarding RQ3: Precision of FIESTA Table 4 enumerates the precisions of fault localization on the target programs by using Tarantula, Ochiai, Op2, and FIESTA. We mark the most precise results in bold font. For the programs, FIESTA localizes a fault after reviewing 10.34% of executed statements on average (the last column of the last row of the table). The comparison results with Tarantula, Ochiai, and Op2 are as follows. ), 22% (= 13.33−10.34 ), and 14% (= 12.03−10.34 ) relFIESTA is 49% (= 20.26−10.34 20.26 13.33 12.03 atively more precise than Tarantula, Ochiai, and Op2, on average, respectively (see the last row of Table 4). In addition, FIESTA localizes a fault more precisely in all target programs except schedule than Tarantula. With the exception of replace, schedule, flex, and space, FIESTA localizes a fault more precisely than Ochiai. Compared to Op2, FIESTA localizes a fault more precisely in the six programs (i.e., schedule, schedule2, tcas, flex, gzip, and space) and localizes a fault with the same precision in grep and sed. Furthermore, in the five real-world programs, FIESTA localizes a fault more or equally precisely compared to Op2. Also, Figure 4 shows that FIESTA is more precise than Tarantula, Ochiai, and Op2 in a different aspect. Figure 4 illustrates that faults of how many target program versions (y axis) can be detected by examining a given % of target program executed statements (x axis). As shown in Figure 4, FIESTA always 15 Figure 4: Accumulated % of target versions whose faults are localized after examining a certain amount of target statements detects a fault in more target versions than Tarantula does after examining the same number of target statements. FIESTA always detects a fault in more target versions than Ochiai and Op2 after examining the same number of target statements, except when % of executed statements examined is between 8-12% and 4-20% for Ochiai and Op2 respectively. For example, by examining 2% of executed statements, FIESTA can find faults in 37% of all 179 target program versions (i.e., 67 (=179×37%) versions) while Tarantula, Ochiai, and Op2 can find faults in 22% (40 versions), 31% (56 versions), and 33% (59 versions) of the target program versions respectively (see the small circles at the left part of the figure). 5.4. Regarding RQ4: Improved Precision due to the Debuggability Transformation Table 5 shows how much the debuggability transformation improves the precision of the four CFL techniques. The table shows that the debuggability transformation relatively increases the precision of Tarantula, Ochiai, Op2, and FIESTA by 4.62%, 6.51%, 6.11%, and 7.72% on average, respectively (see the last row of the table). For example, for replace, FIESTA without the transformation finds a fault after reviewing 10.88% of the executed statements while FIESTA (with the transformation) does after reviewing 7.74% of the executed 16 Target Programs p tokens p tokens2 replace schedule schedule2 tcas tot info flex grep gzip sed space Average Tarantula Ochiai Op2 FIESTA % of exec. stmts Relative % of exec. stmts Relative % of exec. stmts Relative % of exec. stmts Relative examined Imprv. examined Imprv. examined Imprv. examined Imprv. w/o Trans. w/ Trans. (%) w/o Trans. w/ Trans. (%) w/o Trans. w/ Trans. (%) w/o Trans. w/ Trans. (%) 27.45 27.69 -0.86 19.02 19.02 0.00 10.41 10.41 0.00 11.20 11.20 0.00 12.88 13.45 -4.48 7.47 7.57 -1.42 1.62 1.68 -3.21 2.41 2.31 4.29 9.19 8.54 7.15 6.98 6.34 9.21 5.44 4.69 13.90 10.88 7.74 28.82 5.95 6.49 -9.11 3.51 3.65 -3.85 17.17 17.30 -0.79 10.32 10.45 -1.27 52.85 32.37 38.75 46.83 29.04 37.99 40.48 26.27 35.09 43.01 28.55 33.62 28.31 19.10 32.53 28.62 17.50 38.86 27.14 19.65 27.61 27.29 21.17 22.44 27.30 27.84 -2.00 21.31 21.89 -2.71 13.17 13.41 -1.80 15.76 16.11 -2.22 29.55 28.31 4.21 12.51 11.51 8.01 13.54 12.72 6.02 13.75 12.84 6.67 21.71 22.11 -2.30 1.68 1.75 -4.30 1.23 1.23 0.00 1.23 1.23 0.00 17.15 17.50 -2.04 7.91 7.91 0.00 7.91 7.91 0.00 7.51 7.49 0.30 4.94 5.25 -6.11 1.32 1.36 -3.41 1.29 1.33 -3.49 1.29 1.29 0.00 5.86 5.88 -0.32 2.80 2.81 -0.24 4.90 4.90 -0.05 3.70 3.70 0.00 20.26 17.89 4.62 13.33 10.86 6.51 12.03 10.13 6.11 12.36 10.34 7.72 Table 5: Effect of the transformation technique on the precision of Tarantula, Ochiai, Op2, and FIESTA Tarantula Ochiai Op2 FIESTA % of exec. stmts Relative % of exec. stmts Relative % of exec. stmts Relative % of exec. stmts Relative examined Imprv. examined Imprv. examined Imprv. examined Imprv. w/o Trans. w/ Trans. (%) w/o Trans. w/ Trans. (%) w/o Trans. w/ Trans. (%) w/o Trans. w/ Trans. (%) p tokens2 20.03 20.97 -4.69 10.64 10.49 1.47 1.72 1.72 0.00 1.88 1.41 25.00 replace 7.13 4.09 42.66 5.84 3.43 41.36 4.21 1.87 55.59 8.43 2.57 69.49 schedule2 84.41 35.60 57.82 77.29 34.89 54.85 65.22 31.45 51.78 65.41 31.08 52.48 tcas 38.72 16.08 58.47 39.89 14.91 62.63 37.24 20.85 44.01 37.94 24.05 36.62 tot info 42.37 34.75 18.00 24.58 20.34 17.24 13.56 8.47 37.50 14.41 10.17 29.41 flex 19.38 10.96 43.44 9.26 3.63 60.76 6.69 2.07 69.12 7.14 2.14 69.97 Average 35.34 20.41 35.95 27.92 14.61 39.72 21.44 11.07 43.00 22.53 11.90 47.16 Target Programs Table 6: Effect of the transformation technique on the precision of Tarantula, Ochiai, Op2, and FIESTA for versions containing faulty compound conditional statements statements, which is 28.82% (= 10.88−7.74 ) relatively more precise result (see the 10.88 fourth row of Table 5). Note that the transformation improves the precision much for the target versions that contain faulty compound conditional statements. Table 6 shows the improved precision on such versions only (i.e., 42 out of 179 versions (see the second column of Table 3)). For such versions of the target programs, the debuggability transformation relatively increases the precision of Tarantula, Ochiai, Op2, and FIESTA significantly by 35.95%, 39.72%, 43.00%, and 47.16%, respectively (see the last row of Table 6). 7 Therefore, if we can generalize the experiment results, we can conclude that 7 For the 137 versions that have no faulty compound conditional statements, the transformation relatively decreases the precision of Tarantula, Ochiai, Op2, and FIESTA by -5.53%, -3.96%, -2.46%, and -0.76%, respectively. 17 the debuggability transformation technique can improve the precision of CFL techniques in general. 6. Limitations Although FIESTA detected a first fault by examining 10.34% of the executed statements on average (see Table 4), there are 8 versions out of the total 179 versions of the target programs for which FIESTA examined more than 50% of executed statements to localize a fault. We can classify these 8 versions into the following four categories according to the reasons for the low precision: • A large number of statements covered together (one version) • Correct statements executed by only failing test cases and CCTs (one version) • A fault in a large basic block (three versions) • Narrow distribution of the suspiciousness scores (three versions) Through the detailed analysis of these versions, we found that these limitations are common ones for most coverage-based fault localization techniques, not specific for FIESTA. For example, the precision of FIESTA for these versions is 62.99% on average and the precision of Tarantula, Ochiai, and Op2 for these versions is 74.40%, 67.26%, and 53.66% on average respectively. 6.1. Large Number of Test Cases that Cover the Same Statements The precision of FIESTA for flex version 8 was 52.11%. A main reason for this imprecise result is that more than half of executed statements are covered by the same set of test cases. Specifically, 396 statements (52.11% of executed statements) including the faulty statements are covered together by each of the 553 test cases (out of total 567 test cases), so that those 396 statements have the same suspiciousness scores. Although the suspiciousness score of the faulty statement is the highest, % of code to examine is more than 50%, since the 396 statements that have the same suspiciousness score should be examined altogether. This problem can be solved by constructing test cases to execute different execution paths. 6.2. Correct Statements Executed by Only Failing Test Cases and CCTs The precision of FIESTA for tot info version 10 was 51.75%. An important reason is that this version has statements that are executed by only failing test cases and CCTs. Figure 5 shows the case of tot info version 10. S1 is a correct statement that has a higher rank than a faulty statement S2 for the following reason. We found that all 8 failing test cases execute both S1 and S2, and there are 789 CCTs that execute S2 (121 CCTs execute both S1 and S2 and 668 CCTs execute S2 only) and passing(S1) ⊂ passing(S2) (i.e., S1 is not executed by TCTs). In this situation, the 668 CCTs have lower fault weights on average 18 Figure 5: Exceptional case where a correct statement S1 is executed by only failing test cases and CCTs (i.e., 0.84) than the 121 CCTs (i.e., 0.92), because the 668 CCTs execute a smaller number of statements that are executed by the failing test cases than the 121 CCTs execute. Thus, S1 has a higher suspiciousness score than S2. Note that this case is exceptional and problematic for most coverage-based faulty localization techniques such as Tarantula, Ochiai, and Op2. For example, the precision of Tarantula, Ochiai, and Op2for this version are 60.53%, 58.77%, and 51.75%, respectively. 6.3. Fault in a Large Basic Block tcas versions 3, 12, and 34 have a fault in an initial basic block of main() function. This basic block consists of 26 statements, which are more than 40% of the executed statements of tcas (i.e., these 26 statements are always executed together). Although the suspiciousness of the fault is 1 (i.e., the highest score), % of executed statements to examine is more than at least 40%, since the 26 statements in the same basic block should be examined together. 6.4. Narrow Distribution of the Suspiciousness Scores For the remaining cases, we found that the difference between the suspiciousness score of a faulty statement and the high scores of non-faulty statements are negligibly small. In other words, although a suspiciousness score of a faulty statement is very high, the suspiciousness rank of the faulty statement can be low, because there can be many non-faulty statements whose suspiciousness scores are higher than that of the faulty statement by a very small degree (i.e., difference is less than 0.03). Table 7 shows that the percentages of executed statements examined by FIESTA in schedule2 versions 7, 10, and tot info version 21 are 58.87% on average (see the third column of the last row in Table 7). However, the difference between the suspiciousness scores of the faulty statements and the highest suspiciousness scores of non-faulty statements are very small (i.e., 0.014 on average). For example, for schedule2 version 7 (see the second row of Table 7), the faulty statement has very high suspiciousness score which is almost same to the top score, but there are 78 non-faulty statements which have slightly higher suspiciousness scores, which make % of executed statements examined high (i.e., 61.59%). 19 Target program sched2 Version tot info Avg. 7 10 21 % of exec. stmts examined 61.59 51.85 63.16 58.87 Highest susp. 0.961 0.962 0.919 0.947 Susp. of the fault 0.945 0.942 0.914 0.934 Difference 0.016 0.020 0.005 0.014 Table 7: Differences between the suspiciousness scores of the faulty statements and high-rank non-faulty statements 7. Related Work Wong et al. [23] and Alipour [24] survey various fault localization approaches including program state-based methods [2], machine learning-based fault localization methods [25], coverage-based fault localization methods, and so on. Since coverage-based fault localization techniques utilize only test coverage of a target program, we can use CFL with other fault localization technique such as program slicing [26] which utilizes data dependencies or delta-debugging [2] which utilizes states of executions together in a complementary way. For example, we can use FIESTA and JINSI [27] which employs delta-debugging and program slicing together to locate suspicious statements pointed by both techniques to improve precision. In this section, we focus to review related work on coverage-based fault localization techniques. 7.1. Coverage-based Fault Localization (CFL) Techniques Tarantula [3] and Ochiai [4] are popular CFL techniques that use the statement coverage and test execution results to compute the suspiciousness of each statement (see Section 3.1). Op2 [5] is a relatively new CFL technique which is known as more precise than Tarantula and Ochiai [17]. Dallmeier et al. [28] collect execution data of a Java program and analyze method call sequences to identify a fault. Santelices et al. [29] extend Tarantula by targeting not only statements, but also branches and du-pairs. The technique maps branches and du-pairs to corresponding statements and calculates the average suspiciousness scores of these statements to improve the precision of fault localization. Wong et al. [30] consider ‘contribution’ of a test case for calculating suspiciousness of a statement based on the idea that a test case that executes a target statement s later should contribute less to the suspiciousness score of s than a test case that executed s earlier. Yoo [31] mechanically generates and evaluates various suspiciousness formulas through genetic programming techniques on five variables such as |passing(s)| and |f ailing(s)| that are used in suspiciousness formulas of popular CFL techniques such as Tarantula, Ochiai, and Op2. Agrawl et al. [32] computes ‘dice’, which is a set of statements that are executed by a failing test case, but not by a passing test case and suspected to contain a fault. Renieres et al. [33] extend Agrawl et al. [32] and compare spectra of passing test runs and a failing test run to select a passing test run that has the shortest distance to 20 the given failing run. Baah et al. [34, 35] utilize dynamic data dependencies to improve the precision of CFL through a causal-inference technique. 8 Gopinath et al. [19] improve the precision of CFL techniques by giving high suspiciousness to the statements that are pointed by both SAT-based fault localization technique [36] and Tarantula [3]. 7.2. Techniques to Solve CCT Problems Masri et al. [9] present an approach that removes CCTs by using a clustering technique. The technique selects a set of suspicious statements (calling them CCE) that are executed by all failing test cases and a number of passing test cases (the number is given by a user as a threshold). The technique then clusters test cases into two groups based on the similarity of the executed statements of the test cases to the CCEs and the group that is more closely related to the CCEs is removed. A coverage refinement approach adjusts program coverage to strengthen correlation between faults and failing test runs [8]. The work defines ‘context patterns’ which are control and data flows before and after the execution of the faulty statements. They define context patterns according to fault types (e.g., missing assignment), and use the context patterns for adjusting coverage of test cases to reduce negative effect of CCTs. However, this approach requires human knowledge on the types of faults to define context patterns. In contrast, FIESTA does not require the types of program faults to be known, but still filters out negative effect of CCTs through the debuggability transformation (see Section 2) and the fault weights on test cases (see Section 3). The motivation of the fault weight metric of FIESTA is similar to the related work described in this section, in a sense that all of these techniques try to estimate ‘distance’ between passing test runs and failing test runs to localize a fault precisely. However, distance metrics utilized vary depending on the techniques (fault weight for FIESTA, and CCE distance for Masri et al. [9], Ulam distance for Renieres et al. [33], etc.). Furthermore, although all related work mentioned in this section may suffer false positives or false negatives to recognize CCTs, the debuggability transformation of FIESTA reduces a number of CCTs without false positives nor false negatives, which can increase the precision of CFL techniques in general. Similarly to the fault weight of FIESTA, Bandyopadhyay [37] assigns a weight to a passing test case tp based on the CC-proximity [38] of tp with all the failing test cases. However, the definition of the weight metric itself is different from that of FIESTA since the weight of tp decreases as the proximity of tp increases if the proximity is not between a low threshold L and a high threshold H ; the weight increases as the proximity increases otherwise. Thus, Bandyopadhyay [37] requires a user to select L and H which affect the precision of the fault 8 Precision of FIESTA is higher than that of Baah et al. [34, 35], since positive improvements of FIESTA over Ochiai is 10.79% while that of Baah et al. [34, 35] is 7.77%, although target subjects are not identical. 21 localization, but FIESTA does not. Furthermore, FIESTA clearly demonstrated the improved precision on the 12 target programs including the five real-world programs compared with Tarantula, Ochiai, and Op2, but Bandyopadhyay [37] applied Ochiai and his method only to the small Siemens programs, and did not show the clear precision improvement. Finally, we showed empirically that the fault weight metric is accurate enough to recognize CCTs (see Figure 3), but Bandyopadhyay did not. 7.3. Techniques to Utilize Predicates as a Target Program Spectrum There have been several fault localization techniques that utilize predicates in the program code and additional predicates on program execution information. For example, Liblit et al. [39] and Liu et al. [40] instrument/transform a target program to record how program predicates (and additional predicates such as if a return value of a function is equal to zero) are evaluated and rank suspicious predicates. In other words, they utilize the numbers of failing tests and passing tests where a predicate is true (or false) to localize a fault. These techniques can also be considered to perform a debuggability transformation to extract useful execution information for fault localization. The debuggability transformation of FIESTA, however, improves the precision of fault localization actively by reducing the number of CCTs explicitly (see Sections 2 and 5.1) while the above techniques improve the precision passively by observing more execution information. In addition, those techniques handle a predicate consisting of multiple clauses as a whole (although the effect of compound predicates were briefly discussed in Abreu et al. [41]). In contrast, FIESTA targets each clause of a predicate individually through the debuggability transformation. 8. Conclusion and Future Work In this paper, we have described FIESTA which improves the precision of coverage-based fault localization by reducing the negative effects of CCTs based on the debuggability transformation and the fault weight metric on test cases. Through the experiments on the 12 program including five real-world programs of large sizes, we have demonstrated that FIESTA is 49%, 22%, and 14% relatively more precise than Tarantula, Ochiai, and Op2 on average, respectively. In addition, compared with Tarantula, Ochiai, and Op2, FIESTA localizes a fault equally or more precisely on 11, 8, and 8 target programs out of the 12 target programs on average, respectively. Furthermore, the debuggability transformation improves the precision of CFL techniques in general (i.e., increasing the precision of Tarantula, Ochiai, Op2, and FIESTA relatively in 4.62%, 6.51%, 6.11%, and 7.72% on average, respectively). We will extend the debuggability transformation technique by not only generating semantically equivalent programs, but also mutants through mutation operators. Our basic intuition is based on the observation that a faulty statement will be still faulty after mutating the statement, while a non-faulty statement will become faulty after the mutation (similar observation from Papadakis 22 et al. [42]). Finally, we plan to extend our tool to support integrated environments (e.g., coverage visualization, code inspection feature, etc) for practical debugging as discussed in Parmin et al. [43]. References [1] I.Vessey, Expertise in Debugging Compute Programs, Inter.J.of ManMachine Studies 23 (5) (1985) 459–494. [2] A. Zeller, Isolating Cause-Effect Chains from Computer Programs, in: European Software Engineering Conference/Foundations of Software Engineering (ESEC/FSE), 2002. [3] J.A.Jones, M.J.Harrold, J.Stasko, Visualization of Test Information to Assist Fault Localization, in: International Conference on Software Engineering (ICSE), 2002. [4] R.Abreu, P.Zoeteweij, A. Gemund, An evaluation of similarity coefficients for software fault localization, in: Pacific Rim International Symposium on Dependable Computing (PRDC), 2006. [5] L. Naish, H. J. Lee, K. Ramamohanarao, A model for spectra-based software diagnosis, ACM Transactions on Software Engineering Methodology 20 (3) (2011) 11:1–11:32. [6] M.Daran, Software Error Analysis: A Real Case Study Involving Real Faults and Mutations, in: International Symposium on Software Testing and Analysis (ISSTA), 1996. [7] B.Baudry, F.Fleurey, Y.L.Traon, Improving test suites for efficient fault localization, in: International Conference on Software Engineering (ICSE), 2006. [8] X.Wang, S.C.Cheung, W.K.Chan, Z.Zhang, Taming coincidental correctness: Coverage refinement with context patterns to improve fault localization, in: International Conference on Software Engineering (ICSE), 2009. [9] W.Masri, R.A.Assi, Cleansing Test Suites from Coincidental Correctness to Enhance Fault-Localization, in: International Conference on Software Testing, Verification and Validation (ICST), 2010. [10] M.Harman, L.Hu, R.Hierons, J.Wegener, H.Sthamer, A. Baresel, M. Roper, Testability Transformation, IEEE Transactions on Software Engineering 30 (1) (2004) 3–16. [11] G. Necula, S. McPeak, S. Rahul, W. Weimer, CIL: Intermediate language and tools for analysis and transformation of C programs, in: CC, 2002. 23 [12] N. Digiuseppe, J. Jones, On the influence of multiple faults on coveragebased fault localization, in: International Symposium on Software Testing and Analysis (ISSTA), 2011. [13] C. LIU, J. HAN, Failure proximity: a fault localization-based approach, in: International Symposium on Foundations of Software Engineering, 2006. [14] A. ZHENG, M. JORDAN, B. LIBLIT, M. NAIK, A. AIKEN, Statistical debugging: simultaneous identification of multiple bugs, in: International Conference on Machine Learning, 2006. [15] J. JONES, J. BOWRING, M. HARROLD, Debugging in parallel, in: International Symposium on Software Testing and Analysis (ISSTA), 2007. [16] H. Do, S. Elbaum, G. Rothermel, Supporting Controlled Experimentation with Testing Techniques: An Infrastructure and its Potential Impact., Empirical Software Engineering Journal 10 (4) (2005) 405–435. [17] X.Xie, T.Y.Chen, F.C.Kuo, B.Xu, A Theoretical Analysis of the Risk Evaluation Formulas for Spectrum-Based Fault Localization, ACM Transaction on Software Engineering and Methodology (TOSEM) . [18] S.Artzi, J.Dolby, F.Tip, M.Pistoia, Directed test generation for effective fault localization, in: International Symposium on Software Testing and Analysis (ISSTA), 2010. [19] D.Gopinath, R.N.Zaeem, S.Khurshid, Improving Effectiveness of SpectraBased Fault Localization Using Specifications, in: Automated Software Engineering (ASE), 2012. [20] J.Rößler, G.Fraser, A.Zeller, A.Orso, Isolating Failure Causes through Test Case Generation, in: International Symposium on Software Testing and Analysis (ISSTA), 2012. [21] R.Chillarege, W.Kao, R.G.Condit, Defect Type and its Impact on the Growth Curve, in: International Conference on Software Engineering (ICSE), 1991. [22] J.A.Duraes, H.S.Madeira, Emulation of Software Faults: A Field Data Study and a Practical Approach, IEEE Transactions on Software Engineering 32 (11) (2006) 849–867. [23] W.E.Wong, V.Debroy, A Survey of Software Fault Localization, Tech. Rep., University of Texas at Dallas, technical Report UTDCS-45-09, 2009. [24] M.A.Alipour, Automated Fault Localization Techniques: a Survey, Tech. Rep., Oregon State University, 2012. [25] Y.Brun, M.D.Ernst, Finding Latent Code Errors via Machine Learning over Program Executions, in: International Conference on Software Engineering (ICSE), 2004. 24 [26] F.Tip, A survey of program slicing techniques, Journal of Programming Languages 3 (1995) 121–189. [27] M.Burger, A.Zeller, Minimizing Reproduction of Software Failures, in: International Symposium on Software Testing and Analysis (ISSTA), 2011. [28] V.Dallmeier, C.Lindig, A.Zeller, Lightweight defect localization for java, in: European Conference on Object-Oriented Programming (ECOOP), 2005. [29] R.Santelices, J.A.Jones, Y.Yu, M.J.Harrold, Lightweight fault-localization using multiple coverage types, in: International Conference on Software Engineering (ICSE), 2009. [30] W.E.Wong, V.Debroy, B.Choi, A family code coverage-based heuristics for effective fault localization, Journal of Software Systems 83 (2010) 188 –208. [31] S.Yoo, Evolving Human Competitive Spectra-based Fault Localisation Techniques, in: Search Based Software Engineering (SBSE), 2012. [32] H.Agrawal, J.R.Horgan, S.London, W.E.Wong, Fault Localization using Execution Slices and Dataflow Tests, in: International Symposium on Software Reliability Engineering (ISSRE), 1995. [33] M.Renieres, S.P.Reiss, Fault localization with nearest neighbor queries, in: Proceedings of the 18th International Conference on Automated Software Engineering, 30 – 39, 2003. [34] G.K.Baah, A.Podgurski, M.J.Harrold, Causal Inference for Statistical Fault Localization, in: International Symposium on Software Testing and Analysis (ISSTA), 2010. [35] G.K.Baah, A.Podgurski, M.J.Harrold, Mitigating the Confounding Effects of Program Dependences for Effective Fault Localization, in: European Software Engineering Conference/Foundations of Software Engineering (ESEC/FSE), 2011. [36] M.Jose, R.Majumdar, Cause Clue Clauses: Error Localization using Maximum Satisfiability, in: Program Language Design and Implementation (PLDI), 2011. [37] A.Bandyopadhyay, Improving spectrum-based fault localization using proximity-based weighting of test cases, in: Automated Software Engineering (ASE), 2011. [38] C.Liu, X.Zhang, J.Han, A systematic study of failure proximity, IEEE Trans.Softw.Eng. 34 (2008) 826–843. [39] B.Liblit, M.Naik, A.X.Zheng, A.Aiken, M.I.Jordan, Scalable statistical bug isolation, in: Program Language Design and Implementation (PLDI), 2005. 25 [40] C.Liu, X.Yan, L.Fei, J.Han, S.P.Midki, Sober: statistical model-based bug localization, in: Foundations of Software Engineering, 2005. [41] R. Abreu, W. Mayer, M. Stumptner, A. van Gemund, Refining spectrumbased fault localization rankings, in: ACM Symposium on Applied Computing (SAC), 2009. [42] M.Papadakis, Y.L.Traon, Using Mutants to Locate “Unknown” Faults, in: International Workshop on Mutation Analysis, 2012. [43] C.Parmin, A.Orso, Are automated debugging techniques acutally helping programmers?, in: International Symposium on Software Testing and Analysis (ISSTA), 2011. 26
© Copyright 2025 Paperzz