2014 IEEE International Conference on Software Testing, Verification, and Validation Reproducing Field Failures for Programs with Complex Grammar Based Input Fitsum Meshesha Kifetew1 , Wei Jin2 , Roberto Tiella1 , Alessandro Orso2 , Paolo Tonella1 1 Fondazione Bruno Kessler, Trento, Italy Georgia Institute of Technology, USA [email protected], [email protected], [email protected], [email protected], [email protected] 2 Abstract—To isolate and fix failures that occur in the field, after deployment, developers must be able to reproduce and investigate such failures in-house. In practice, however, bug reports rarely provide enough information to recreate field failures, thus making in-house debugging an arduous task. This task becomes even more challenging for programs whose input must adhere to a formal specification, such as a grammar. To help developers address this issue, we propose an approach for automatically generating inputs that recreate field failures in-house. Given a faulty program and a field failure for this program, our approach exploits the potential of grammar-guided genetic programming to iteratively find legal inputs that can trigger the observed failure using a limited amount of runtime data collected in the field. When applied to 11 failures of 5 real-world programs, our approach was able to reproduce all but one of the failures while imposing a limited amount of overhead. I. This paper presents a failure reproduction technique, called SBFR (Search-Based Failure Reproduction), that is specifically designed to handle complex programs with highly structured inputs (e.g., compilers and interpreters), hence going beyond the features of the programs used in previous failure reproduction studies [3, 4, 5, 6, 7, 8, 9]. SBFR takes as input the failing program, a grammar describing the program input, and the (partial) call sequence for the failing execution, and uses genetic programming (a search-based optimisation algorithm) for generating failing inputs. Genetic programming relies on mutation and crossover operators to manipulate the parse tree of the structured program input, initially constructed by random application of grammar productions. Parse trees are evolved through mutation and crossover, so as to minimize a fitness function that measures the distance between the execution trace of each candidate input tree and the execution trace of the failing execution. The search stops when a parse tree is produced which is capable of successfully reproducing the target failing behavior. I NTRODUCTION Software systems are increasingly complex and can be used in unpredictable (and untested) ways. Hence, users experience field failures (failures occurring after deployment), which should be reproduced and fixed as quickly as possible. Field failure reproduction, however, is not an easy task for developers [2], as bug reports provide limited information on how the failure can be reproduced in the testing lab. To assess SBFR’s efficiency and effectiveness, we developed a tool that implements our approach and used it to perform an empirical evaluation on 11 failures of 5 real-world programs. Our results are promising: whereas BugRedux could not reproduce any of the 11 failures considered, SBFR was able to reproduce 10 of the 11 failures with a failure reproduction rate of 0.79 on average. In previous work, we developed BugRedux, a general approach for failure reproduction based on guided symbolic execution [3]. BugRedux takes as input a sequence of key intermediate points in the observed failing execution and guides the symbolic execution towards these points until it reaches the point of failure. In particular, empirical studies with BugRedux show that (partial) call sequences provide enough guidance for in-house synthesis of failing executions. Call sequences usually do not introduce privacy problems and can be gathered with limited overhead, so they represent a good trade off between the amount of information collected and the effectiveness of field failure reproduction. II. In this section we provide the background and terminology necessary for describing our approach. A. Evolutionary Search Evolutionary search is an optimization heuristic that works by maintaining a population of candidate solutions that are evolved through generations via reproduction and mutation. There are a number of ways in which this evolutionary scheme can be implemented. Among them, the most dominant are Genetic Algorithms (GA) and Genetic Programming (GP). The key limitation of BugRedux is that it relies on symbolic execution, which is known to be ineffective on (1) programs with highly structured inputs, such as inputs that must adhere to a (non-trivial) grammar, (2) programs that interact with external libraries, and (3) large complex programs in general (e.g., programs that generate constraints that the constraint solver cannot handle). In a GA, candidate solutions are encoded into individuals following various encoding schemes (e.g., binary strings). Initially, a population of individuals is generated, usually randomly. The GA then assigns a fitness value to each individual in the population by evaluating it using a fitness function— a function that measures the “goodness” of each individual with respect to the problem being solved. After performing this evaluation, GA selects from the population pairs of We presented an earlier version of this work as short paper in ASE 2013’s New Ideas Track [1]. This paper provides a more detailed and extensive technical description of our approach and an empirical evaluation that demonstrates the effectiveness of the approach. 978-0-7695-5185-2/14 $31.00 © 2014 IEEE DOI 10.1109/ICST.2014.29 BACKGROUND 163 'HULYDWLRQVHTXHQFH *UDPPDU H[S! PXOW([S!DOW! H[S! DOW! DOW!DOW! DOW! ˨ PXOW([S!DOW! DOW! PXOW([S! DOW! PXOW([S! PXOW([S! DWRP!DOW! DOW! DOW!DOW! DOW! ˨ DOW! DWRP! DOW! DWRP! DWRP! DWRP! DWRP! DWRP! H[S! LG WB75,*!H[S! QXP WB75,*! VLQ WB75,*! FRV WB75,*! WDQ process. Crossover and mutation could be performed in a number of different ways. In GP, sub-tree crossover and mutation are commonly used. Sub-tree crossover between two individuals is performed by exchanging two sub-trees, rooted at the same non-terminal, from their tree representations. Subtree crossover ensures that the newly formed individuals are well formed with respect to the underlying grammar. An example of sub-tree crossover is shown in Figure 2. DWRP!DOW!DOW! WB75,*!H[S!DOW!DOW! FRVH[S!DOW!DOW! FRVPXOW([S!DOW!DOW!DOW! FRVDWRP!DOW!DOW!DOW!DOW! FRVQXPDOW!DOW!DOW!DOW! FRVQXPDOW!DOW!DOW! FRVQXPDOW!DOW! FRVQXPDOW! FRVQXP Fig. 1. An example of a derivation. Each non-terminal is expanded by one of its productions, until the final string is composed of terminal symbols only. individuals (parents) with better fitness values and subjects them to reproduction (crossover) to produce their offspring. The GA may further subject the offspring to a process of mutation, in which the individuals’ encoding is modified to introduce diversity into the population. At this point, there is a new generation of individuals that can further reproduce. This process of evaluation, selection, and reproduction continues until either the solution is found or a given stopping condition (e.g., number of cycles) is reached. Fig. 2. Sub-tree crossover: sub-trees of the same type (in circles) from parents are exchanged to create children. See Figure 1 for the derivation process. Similarly, sub-tree mutation on an individual is performed by replacing a sub-tree from its tree representation with a new sub-tree of the same type generated from the grammar. Figure 3 shows an example of sub-tree mutation. GP [10] follows a similar process. However, the individuals manipulated by the search are tree-structured data (programs, in the GP terminology) rather than encodings of solution instances. These complex data usually have a well formed structure defined by a set of formal rules (e.g., a grammar). While there are a number of variants of GP in the literature, in this work we focus on Grammar Guided GP (GGGP), and in particular on Grammatical Evolution (GE) [11]. In GGGP, individuals are sentences generated according to the formal rules prescribed by a grammar. Specifically, in the case of GE, sentences are generated from a Context Free Grammar (CFG), so that new individuals produced by the GP search operators (crossover and mutation) are guaranteed to be valid with respect to the associated grammar. In GP, the initial population of individuals is generated from the grammar following a number of techniques, mostly based on some form of random grammar-based generation. 1) Input Representation: An individual (a sentence from the grammar) in the population is represented by its syntax tree (derivation tree). The tree is built through the process of derivation: starting from the root (start) symbol of the grammar, productions are applied to substitute non-terminal symbols, resulting eventually in a terminal string. The process is shown in Figure 1. Tree representation of individuals is appropriate, as the underlying search operators, described below, are based on sub-tree manipulation. Fig. 3. Sub-tree mutation: a sub-tree is replaced by a new sub-tree of the same type generated starting from the grammar. Child1 from Figure 2 is mutated. B. Terminology A field failure is a failure of a deployed program while it executes on a user machine. We use the term execution data to refer to any runtime information collected from a program executing on a user machine. In particular, a call sequence is a specific type of execution data that consists 2) Evolution Operators: Evolution operators (crossover and mutation) play a crucial role in the evolutionary search 164 of a (sub)sequence of functions (or methods) invoked during the execution of a program on a user machine. We define a field failure reproduction technique as a technique that can synthesize, given a program P , a field execution E of P that results in a failure F , and a set of execution data D for E, an in-house execution E as follows. First, E should result in a failure F that is analogous to F , that is, F has the same observable behavior of F . If F is the violation of an assertion at a given location in P , for instance, F should violate the same assertion at the same point. Second, E should be an actual execution of P , that is, the approach should be sound and generate an actual input that, when provided to P , results in E . Finally, the approach should be able to generate E using only P and D, without the need for any additional information. III. To avoid this problem, the search can assign probabilities to the productions in such a way that the recursive ones are applied less frequently than the non-recursive ones. In this way, the sentences generated in the initial population will be of balanced depth and more representative than those generated using a uniform random approach. A simple, yet effective rule that we devised to achieve this result is called the 80/20 rule. According to this rule, a total probability of 0.2 is uniformly distributed across all recursive productions, while a total probability of 0.8 is uniformly distributed across the nonrecursive productions. Indirect recursion is regarded as plain recursion in the application of the 80/20 rule. With reference to the example in Figure 1, production 0 (a recursive production) for non-terminal <alt1*> is assigned probability 0.2, while production 1 (a non-recursive production) is assigned probability 0.8. S EARCH BASED FAILURE R EPRODUCTION For complex grammars, evolving meaningful and representative sentences may be quite difficult for GP if the initial population is not chosen carefully. One of the subject programs used in our experimental study, PicoC, is a good example of such a case. It is an interpreter for a subset of the C programming language. The grammar used to generate inputs for this program is fairly large (194 production rules) with complex and highly recursive structures. If the initial population consists of random sentences (obtained using the 80/20 rule), individuals will still be very different from real C programs, being mostly limited to shallow structures, such as paired braces. Common programming constructs, such as assignment statements, will be very difficult to generate from such a complex grammar using random techniques. SBFR’s goal is to reproduce an observed (field) failure based on a (ideally minimal) set of runtime information about the failure, or execution data. As commonly done in these cases, SBFR collects execution data by instrumenting the software under test (SUT, hereafter) before deploying it to users. Upon failure, the execution data for the failing execution are used by SBFR to perform a GP based search for inputs that can trigger the failure observed while the SUT was running on the user machine. This part of the approach would be ideally performed in the field (by leveraging free cycles on the user’s machine), so as to send back only the generated inputs instead of the actual execution data. As in our previous BugRedux work [3], execution data are used to provide guidance to the search. However, unlike BugRedux, SBFR performs the search for the failure inducing input using evolutionary search, and the guidance is provided indirectly, through the use of the fitness function discussed in Section III-C. More precisely, the individuals in the GP search are candidate test inputs for the SUT, that is, structured input strings that adhere to a formal grammar. The search maintains a population of individuals, evaluates them by measuring how close they get to the desired solution using a fitness function, and evolves them via genetic operators (see Section III-B). If at least one candidate is able to trigger the desired failure, a solution is found and the search terminates. If no candidate solution is found after consuming the whole search budget, the search is deemed unsuccessful. Instead of applying a set of production probabilities rigidly determined by the 80/20 rule, in these cases probabilities can be learned from examples of existing, human written inputs. Namely, the stochastic grammar used for the generation of the initial population is learned from a corpus of well formed, human written sentences for the SUT. It is usually easy to obtain such corpus for popular languages, such as C, since a huge amount of code is publicly available. Even if such corpus is not publicly available, the manual test suite which is usually shipped with the SUT itself could provide a good starting point. In our experiments, for subjects with very complex grammars, we used the Inside-Outside algorithm [12] to learn the production probabilities from a corpus of input sentences. The Inside-Outside algorithm estimates inside probabilities (probabilities of generating a given terminal sequence from each non-terminal) and outside probabilities (probabilities of generating each such non-terminal as well as the terminals outside the given sequence) from a corpus. These probabilities determine the weights of the generative stochastic grammar. The Inside-Outside algorithm tolerates arbitrary degree of grammar ambiguity, an important property for grammars originally used by parsers, as those considered in our experiments. A. Seeding the Search with Representative Inputs When SBFR generates the initial population, uniform selection of the grammar productions to apply tends to run into problems in the presence of recursive productions, which are quite frequent in commonly used grammars. In fact, repeated application of recursive productions (e.g., production 0 for <alt1*> in Figure 1) may result in a non-terminating derivation process. For practical purposes, the derivation process is continued until some maximum depth is reached. However, if the maximum depth is reached before substituting all nonterminals, the evolutionary search process discards the individual. If the application of recursive productions is not controlled, the individuals that are left after discarding those containing non-terminals tend to be associated with shallow derivation trees and short, non representative strings. B. Input Representation and Genetic Operators Once the initial population of individuals is generated, either using the 80/20 rule or by learning probabilities as discussed in the previous subsection, they are represented as syntax trees. The genetic operators manipulate these tree representations of the individuals. 165 We apply sub-tree crossover and mutation operations on the individuals (see Section II). We chose to use tree based operations because they preserve the well formedness of the resulting individuals—if both parents are well formed individuals (according to the grammar), the offspring produced by sub-tree crossover are also going to be well formed. Similarly, sub-tree mutation of a well formed individual results in a well formed individual. The probabilities used in sub-tree mutation are the same as those used to generate the initial population (i.e., they are either determined by the 80/20 rule or they are learned from a corpus). fitness evaluations, is exhausted. If successful, the search will produce an individual (i.e., an input) that causes the program to follow a trajectory similar to that of the observed failure, reach the point of failure, and fail at that point with the same observable behavior as the original failure (details on how our current implementation actually assesses whether the reproduced failing behavior matches the observed one are provided in the next section). To evaluate the fitness of an individual, its tree representation is “unparsed” into a string representation, which is passed to the SUT as input. Based on the execution of the SUT on the input string, the fitness of the individual is computed. Figure 4 shows an overall view of the prototype tool that implements SBFR. The tool consists of three main modules: GP Search, Instrumenter, and Learner. The GP Search component performs the evolutionary search, eventually producing a test case. To evaluate individuals in the search, it uses a SUT Runner component, which is a simple wrapper that executes the individual with a given timeout and returns the execution trace together with the exit status of the execution (including possible error messages). In cases where learning is employed, the Learner produces a stochastic CFG (see Sections II and III-A) starting from the SUT’s input grammar and a corpus. IV. C. Fitness Computation and Search Termination SBFR evaluates candidate solutions based on the trace obtained when executing them against the SUT. To evaluate how good a candidate individual is, the instrumented SUT is executed using the candidate individual as input, resulting in a set of execution data. In this work, we consider execution data that consist of call sequences and refer to a call sequence using the term trajectory. More formally, we define a trajectory as a sequence T = c1 , ..., cn , where each ci is a function/method call. We made this choice of execution data because our findings in previous work show that call sequences provide the best tradeoffs in terms of cost benefit for synthesizing inhouse executions [3]. Furthermore, with anecdotal evidence from manually checking the collected call sequences from our empirical study, call sequences are unlikely to reveal sensitive or confidential information about the original execution. In SBFR, we use a fitness function to compare how “similar” the trajectory of a candidate individual is to the trajectory of the failing execution obtained from the field. This comparison is implemented using sequence alignment between the two trajectories. That is, we propose a fitness function based on the distance between the trajectory of the failing execution and the trajectory of a candidate individual. Hence, our GP approach tries to minimize this distance with the objective of finding individuals that generate trajectories identical to that of the failing execution. The distance between two trajectories T1 and T2 can be defined as: distance(T1 , T2 ) = |T1 | + |T2 | − 2 ∗ |LCS(T1 , T2 )| I MPLEMENTATION Fig. 4. SBFR prototype: GP Search performs the evolutionary search guided by the Target trajectory. Fitness evaluation is performed by executing an individual via a SUT Runner component that runs the SUT with a timeout. The execution data and the exit status are returned to the search component for fitness evaluation. In cases where learning is applied, the grammar (in BNF) is augmented with probabilities learned from a corpus by the Learner. We have implemented the core evolutionary search component of our failure reproduction framework based on GEVA [14], a general purpose GE tool written in Java. GEVA provides the necessary infrastructure, such as representation of individuals, basic GP operators (e.g., sub-tree crossover, and mutation), and the general functionality to manage the overall search process. On top of this infrastructure, we have implemented customized operators for stochastic initialization and fitness evaluation, which are central to our proposed failure-reproduction scheme. (1) where LCS stands for Longest Common Subsequence [13], and |T | is the length of the trajectory T . For instance, T1 = f, g, g, h, m, n and T2 = f, g, h, m, m have LCS = f, g, h, m. Hence, their distance is 6 + 5 - 8 = 3, which corresponds to the number of calls that appear only in T1 (second g, n) or in T2 (second m). Fitness evaluation is performed by executing the instrumented version of the SUT externally using the SUT Runner, with the string representation of an individual as input. When the execution terminates, its trajectory is returned to the search component (GEVA), which computes the distance between the trajectory of the individual and that of the target. Since the sentences generated may contain constructs that lead the SUT to non-terminating executions, the Runner executes the SUT with a timeout. The fitness value of an individual in the GP will hence be computed as the distance between its trajectory and that of the target trajectory using Equation 1. The fitness value is then minimized by the search, with the ultimate objective of producing individuals that reproduce the desired failure. The search stops when a desired solution is found or the search budget, expressed as the maximum number of 166 The major computational cost associated with SBFR is the cost of the fitness evaluation. To reduce this cost, our tool uses caching, which minimizes the fitness-evaluation cost by avoiding the re-execution of previously evaluated inputs. Consequently, the search budget is computed as the total number of unique fitness evaluations. In the rest of this section, we present the subject programs and failures that we used for our experiments, illustrate our experiment protocol, and discuss our results and the possible threats to their validity. To determine whether an individual triggers a failure analogous to the one observed in the field, our tool proceeds as follows. For each candidate input, the error/exception possibly generated while executing the SUT is compared with that of the reported failure. This comparison is performed by the SUT Runner by comparing the error messages and the location where the errors manifest themselves and returns an Exit Status indicating success if the two failures match. In our empirical evaluation of SBFR, we consider eleven failures from five grammar based programs. We selected these programs because they are representative of the kind of programs we target and their grammars are available. As our approach deals with grammar-based programs, the corresponding grammars are generally available with the program itself. Even so, some work may be still necessary, for example to convert the available grammar into a format (BNF) accepted by our tool (this task is usually easy to automate). A. Program Subjects The Instrumenter module adds software probes to the SUT at compile time, for collecting call sequences when the SUT is executed. We implemented two versions of this module, one for C programs (based on the LLVM compiler infrastructure1 ) and the other for Java programs (based on the Javassist bytecode-manipulation library2 ). As a result, the instrumented version of the SUT can output the dynamic call sequence for the given input. Table I presents a summary of the subject programs used in our experimental study. Calc4 is an expression evaluator that accepts an input language including variable declarations and arbitrary expressions. bc5 is a command-line calculator commonly found in Linux/Unix systems. MDSL6 is an interpreter for the Minimalistic Domain Specific Language (MDSL), a language including programming constructs such as functions, loops, conditionals etc. PicoC7 is an interpreter for a subset of the C language. Lua8 is an interpreter for the Lua scripting language. Calc and MDSL are developed in Java based on the ANTLR parser generator. bc is developed in C based on the Lex/Yacc parser generator tools. PicoC and Lua are developed in C, but do not rely on a parser generator. We defined a BNF grammar for PicoC based on an existing C grammar suitably reduced to the subset of C accepted by PicoC. We also defined a BNF grammar for Lua based on the semi-formal specification of the language provided on the official website. Note that the search is completely language agnostic. Handling SUTs developed in another language L would simply amount to developing an instrumentation tool for L and giving the SUT Runner the ability to run programs written in L, so that call sequences can be collected during execution. The core components of the failure reproduction framework would remain unchanged. The Learner component, which generates a stochastic CFG from a CFG and a corpus of sample inputs, is an extension of a C implementation of the inside-outside algorithm,3 which takes as input a grammar and a set of sentences and produces as output a probability for each rule in the grammar. To be accepted by the implementation of the inside-outside algorithm that we use, a grammar has to be transformed into a weakly equivalent grammar (i.e., a grammar that generates the same language) with all terminals introduced by unary rules only and all empty rules removed. We implemented a tool that performs this transformation using an existing approach [15]. V. Table I reports the number of productions in the grammar of each application, ranging from 38, for Calc, to 194, for PicoC. These grammars are fairly large and complex, and bigger than those typically found in the GP literature. Even if the subject programs are not necessarily large in terms of LOCs, they are quite challenging for input-generation techniques. In particular, we also tried to apply vanilla BugRedux to reproduce the same field failures. However, BugRedux failed to generate any input for all faults considered in this study after 72 hours. The reason of ineffectiveness of BugRedux is that the current implementation does not leverage the grammar information (as done e.g. with “symbolic tokens” [16, 17]) and the guided symbolic execution search gets stuck in the lexical analysis functions because these functions usually contain a huge number of paths. E MPIRICAL E VALUATION The main goal of our empirical evaluation is to assess the effectiveness and practical applicability of our SBFR approach for programs with structured and complex input. To achieve this goal we performed a study on several real-world programs and real failures for these programs. Specifically, we investigated the following research questions: • RQ1: How effective is SBFR in reproducing real field failures for programs with structured input? • RQ2: What is the performance overhead imposed by the instrumentation required for failure reproduction? • RQ3: What is the role of input seeding in search based failure reproduction? Table I reports the number of faults (equal to the number of failures) considered for each subject. The faults in bc, PicoC, and Lua have been selected from their respective bug tracking systems and affect the latest versions of the programs. For instance, the bc bug crashes the bc program deployed with most modern Linux systems. The bugs for 4 https://github.com/cmhulett/ANTLR-java-calculator/ 5 http://www.gnu.org/software/bc/ 1 http://llvm.org 6 http://mdsl.sourceforge.net/ 2 http://www.csg.is.titech.ac.jp/∼chiba/javassist/ 7 https://code.google.com/p/picoc/ 3 http://web.science.mq.edu.au/∼mjohnson/Software.htm 8 http://www.lua.org 167 TABLE I. S UBJECTS USED IN OUR EXPERIMENTAL STUDY. Name Language Calc bc MDSL PicoC Lua Java C Java C C Size (KLOC) # Productions # Faults 2 12 13 11 17 38 80 140 194 106 2 1 5 1 2 We also measured the execution time of the SUT before and after instrumentation, so as to determine the time overhead imposed on the end user. Specifically, we ran all test cases available for each subject used in our experimental study and measured the associated execution time with (ET’) and without (ET) instrumentation. The percentage increment of the test suite execution time is used to quantify the overhead introduced by the instrumentation. We also measure the size (SZ) of the trace files used to store the call sequences associated with failing executions, so as to assess the space overhead imposed by SBFR. We consider the size of the trace files both before and after compressing them (ZSZ), as in practice such data can be stored (and transferred over networks) in compressed format. Calc and MDSL have been discovered by the authors while investigating the programs in a different work. Each fault causes a crashing failure, that is, a failure that results in the unexpected termination of the program. The execution data used to guide the search in SBFR are generated by test cases that expose these failures, and thus simulate the occurrence of a field failure. As there are several parameters that control the search process, we performed sensitivity analysis to determine appropriate values for the dominant search parameters in our experiments. The values we used are: population size of 500; crossover probability of 0.8; mutation probability of 0.2; threeway tournament selection, preserving the elite; total search budget of 10,000 unique fitness evaluations. B. Experiment Protocol We evaluated the effectiveness of SBFR using random grammar-based test case generation as a baseline. Specifically, we implemented a random generation technique (RND hereafter) that applies the 80/20 rule (discussed in Section III-A). RND generates a new input from the grammar based on the 80/20 rule and executes the SUT with that input. If the input triggers the desired failure, a solution is found and RND stops. If the input does not trigger the failure, another input is generated and evaluated. This process continues until either a solution is found or the search budget is finished. C. Results Table II presents the results of our empirical evaluation. For each bug, the table reports FRP for both SBFR and RND, together with the results of the Wilcoxon statistical test of significance. For Bug3 of MDSL, both SBFR and RND are able to reproduce the failure, so we further performed a comparison of the search budget consumed by each to reproduce the failure (FIT metrics, discussed above). However, a Wilcoxon test (pvalue 0.4813) shows that there is no significant difference in the consumption of search budget either. When seeding is employed (related to RQ3), for each of the considered subjects we used the stochastic grammar learned from a corpus of human written tests (see Section IV) to generate inputs. Hence, in SBFR the initial population is generated from the stochastic grammar, rather than using the 80/20 rule. Similarly, in the case of RND, inputs are generated from the stochastic grammar. TABLE II. FAILURE REPRODUCTION PROBABILITY FOR RND AND SBFR. p- VALUES WHICH ARE STATISTICALLY SIGNIFICANT ARE SHOWN IN BOLDFACE . Since both SBFR and RND involve non-deterministic actions, we ran both SBFR and RND 10 times for every failure considered. For each such run, we recorded whether or not the failure was reproduced and, if reproduced, how much of the search budget was consumed to reproduce it. If the failure was not reproduced after consuming the entire budget, the search was deemed unsuccessful. We calculate the failure reproduction probability (FRP) as the number of runs that reproduced the failure divided by the total number of runs (i.e., 10) for each subject. For example, using SBFR with the 80/20 rule, we reproduced Calc Bug 1 in 6 runs out of 10 runs, hence the FRP = 0.6. Subject:Bug FRP (RND) FRP (SBFR) Calc Bug1 Calc Bug2 bc MDSL Bug1 MDSL Bug2 MDSL Bug3 MDSL Bug4 MDSL Bug5 PicoC Lua Bug 1 Lua Bug 2 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.6 0.8 1.0 1.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0 p-value 0.005016 0.00044 1.59E-005 1.59E-005 1.59E-005 NA 1.59E-005 1.59E-005 NA NA NA Table III shows the size of the execution trace collected for each crash before and after compression with the zip utility. As can be seen from the table, the size of the traces, especially after compression, is almost negligible. When there was no statistically significant difference in FRP, we measured a secondary effectiveness indicator, which accounts for the computational cost incurred by each technique to achieve the measured FRP: the number of fitness evaluations (FIT). Fitness evaluation represents the main computational cost for both SBFR and RND, and largely dominates all other computational costs. Therefore, we used FIT as an indicator of the cost of failure reproduction and measured it to assess whether SBFR offers any cost saving as compared to RND when both achieve the same FRP. In our experiments, this happened only for one of the bugs (discussed below in Section V-C). Table IV reports execution times with and without instrumentation. The execution time overhead ranges between 2.8% and 16.4%. Results have been obtained using an implementation that relies on buffering to minimize the number of disk writes. In practice, the size of the execution data, consisting only of call sequences, is usually small enough to be kept entirely in memory during a program execution. Since a trace is dumped to file only upon crash, for normal (non-failing) executions the entire trace can be kept in memory, and no disk write operation is required. 168 TABLE III. U NCOMPRESSED (SZ) AND COMPRESSED (ZSZ) EXECUTION TRACE SIZE . Subject SZ (kb) ZSZ (kb) 3.50 1.60 12.00 1.20 1.30 2.90 3.30 0.66 8.40 75.92 62.40 0.49 0.40 0.46 0.56 0.56 0.60 0.68 0.46 0.55 2.37 1.77 Calc Bug1 Calc Bug2 bc MDSL Bug1 MDSL Bug2 MDSL Bug3 MDSL Bug4 MDSL Bug5 PicoC Lua Bug 1 Lua Bug 2 TABLE IV. Based on the results we obtained, we can answer RQ1 and state that SBFR is effective in reproducing real field failures for programs with structured, grammar based input, while RND and BugRedux are not so. With respect to the overhead imposed by SBFR, which is the topic of RQ2, Table III shows that the size of the collected execution data (call sequences, in this case) is very small. In the worst case, for Lua Bug 1, the size of the uncompressed trace is 75.92Kb (2.37Kb compressed). Overall, on average, the size of the uncompressed trace is 15.74Kb, while the average compressed trace size is 0.8Kb. T EST SUITE EXECUTION TIME BEFORE AND AFTER INSTRUMENTATION . Subject ET (sec) ET (sec) ΔET % Calc bc MDSL PicoC Lua 4.28 7.57 15.97 1.00 1.38 4.47 8.81 16.64 1.11 1.42 4.4% 16.4% 4.2% 11% 2.8% As Table IV shows, the execution time overhead imposed by SBFR’s instrumentation (added to collect dynamic execution traces) is also acceptable for all five subjects considered, with an average overhead of about 8%. Moreover, we also expect these results to be really worst case scenarios for many reasons. First, all of these applications are processingintensive applications with no interaction with the user. The overhead would typically decrease dramatically in the case of interactive applications, for which idle time is dominant. Second, these are for the most part very short executions, where the fixed cost of the instrumentation’s initialization is not amortized over time. Third, it is always possible to sample and collect only partial execution data [3]. Finally, we use an unoptimized instrumentation; in our experience, sophisticated instrumentation can considerably reduce the time overhead imposed on the instrumented programs. Nevertheless, we plan to further reduce the overhead of collecting field data by incorporating different sophisticated techniques including the ones mentioned above. Table V presents the results of stochastic grammar learning, used to generate the test cases of RND and to seed the initial population of SBFR. We consider only bugs that could not be reproduced by SBFR when the 80/20 rule is used for the initialisation (see Table II). As the table shows, the FRP for SBFR is significantly higher than that of RND for two of the three bugs considered. For the third bug (Lua Bug 1), both SBFR and RND are unable to reproduce the failure. TABLE V. FAILURE REPRODUCTION PROBABILITY (FRP) FOR SBFR AND RND WITH INITIALIZATION USING THE LEARNED STOCHASTIC GRAMMAR , RATHER THAN THE 80/20 RULE . Subject:Bug PicoC Lua Bug 1 Lua Bug 2 FRP (RND) FRP (SBFR) p-value 0.1 0.0 0.0 0.8 0.0 0.5 0.0025 NA 0.01365 In summary, we can answer RQ2 in a positive way. According to our results, SBFR imposes almost negligible space overhead and acceptable time overhead in all cases considered. D. Discussion As Table II shows, SBFR with 80/20 seeding was able to reproduce all failures but three—the failures of PicoC and Lua. Grammar based test case generation using RND (80/20), conversely, was able to reproduce only one of the eleven failures—Bug3 of MDSL, which both SBFR and RND are able to reproduce with probability of 1. After further investigation of the results, we discovered that this failure is relatively easy to reproduce, as it is triggered by an input MDSL program that calls a method on an undeclared object, which is automatically initialised to null. BugRedux was unable to reproduce any of the considered failures within 72 hours. With respect to the role of input seeding (RQ3), as can be seen from Table V, for two of the three bugs, the FRP of SBFR has significantly improved with the aid of the learned grammar, while RND is still not able to reproduce any of the three bugs using the stochastic grammar. As we showed in Table II, these three failures are particularly difficult to reproduce using initialization with the 80/20 rule. For instance, the failure in PicoC is a segmentation fault caused by an incorrect use of pointers. The test case that triggers the failure, from the original bug report, contains the following statements: int n =5; int *k; k = &n; **k = &n; In particular, the failure is caused by the last assignment statement. Assignment statements, especially those associated with complex expressions, involve deeply nested and recursive grammar definitions. As a result, generating such type of statements from a grammar using randomized techniques is quite difficult and the derivation process for such constructs either stops prematurely or explodes exponentially. With learning, these kinds of constructs can be easily generated in the initial population. GP operators can then make use of these basic constructs, by manipulating and exchanging them, to evolve the desired trees with the constructs necessary for reproducing the failure at hand. Let us consider a specific example in which SBFR is successful and RND fails. The failure in bc is a segmentation fault that happens when performing memory allocation under very specific circumstances [18]; the failure is triggered by an instruction sequence that allocates at least 32 arrays and declares a number of variables higher than the number of allocated arrays. SBFR successfully recreates the input sequence that leads to this failure, while RND is unable to reproduce the failure. In this case, coverage of the grammar productions, is not enough to reproduce the failure, as the failure requires an input with very specific characteristics. Conversely, by using the call sequence as a guidance during the search for test cases, SBFR can successfully reproduce the failure. 169 Lua Bug 1 is not reproduced by neither SBFR nor RND, even with learning. This bug involves specific invocations of built-in functions of the language (in particular, calls to print and load). Such types of input are very difficult to generate from the grammar alone, because they depend on the identifiers instantiating the grammar tokens, in addition to the input structure. We have currently implemented a simple token instantiation strategy in SBFR, where a pool of random token instances is first generated; later, during test case generation and evolution, only token instances from the pool are used for newly created tokens. While this strategy works well for bugs that involve only user defined functions, it fails when built-in or library functions must be called. We plan to extend SBFR with a token instantiation strategy that augments the pool with built-in and library identifiers, to be kept only if they contribute to increase the fitness values. VI. R ELATED W ORK In this section, we focus on two research topics that are closely related to our approach: test input generation and field failure reproduction. A. Test Input Generation Symbolic execution [20] is a systematic approach for generating test inputs that traverse as many different control flow paths as possible (all paths, asymptotically). The dramatic growth in the computational power of today’s computers, together with the availability of increasingly powerful decision procedures, has resulted in a renewed interest in using symbolic execution for test-input generation [21, 22, 23, 24]. Despite these recent advances, however, symbolic execution is still an inherently limited technique, mostly due to the path explosion problem (i.e., the virtually infinite number of paths in the code), the environment problem (i.e., the challenges involved with handling interactions between the code and its environment, such as external libraries), and the limitations of constraint solvers in handling complex constraints and theories. When considering RQ3, based on our results we can conclude that for cases where the grammar of the SUT contains complex structures, learning a stochastic grammar from a corpus of existing inputs can improve substantially the effectiveness of search based failurereproduction techniques. Application of symbolic execution to programs with complex, structured inputs that must adhere to a formal language specification is a more recent topic of investigation. Existing approaches to this problem (e.g., [16, 17]) rely on the creation of symbolic grammar tokens associated with the non terminals in the grammar productions (i.e., All tokens returned by the lexer are replaced by symbolic variables.). These approaches propagate such symbolic grammar tokens, collect path constraints on these symbolic grammar tokens, and finally solve these path constraints to generate concrete tokens as input. While promising, these approaches also suffer the same limitations of traditional symbolic execution. E. Threats to Validity The main threats to validity for our results are internal, construct, conclusion, and external validity threats [19]. Internal validity threats concern factors that may affect a dependent variable and were not considered in the study. In our case, different grammar based test case generators may have different failure reproduction performance. We have chosen RND (with 80/20 or learned stochastic grammar), since it is representative of state-of-the-art tools for random, grammarbased test case generation. Further experiments using other generators are necessary to increase our confidence on the results. A type of test input generation technique alternative to symbolic execution relies on search based algorithms, and in particular genetic algorithms (see Section II) [25]. The fitness function used in these approaches typically accounts for the degree of coverage achieved or for the distance from the coverage target of each test case (e.g., [26, 25]). The main limitation of search based techniques, when compared to symbolic execution, is that they can get stuck in local optima and can only be indirectly (i.e., through the fitness function) guided towards a target of interest. However, search based techniques have the great advantage that they can scale to much larger programs and are not affected by the environment problem or by the complexity of the path constraints. Construct validity threats concern the relationship between theory and observation. We have carefully chosen the experimental measures used to answer RQ1, RQ2 and RQ3. In particular, the metrics used in the evaluation (FRP, FIT, SZ, ZSZ, ET) are direct measures of the effects under investigation. Moreover, all these metrics have been measured objectively, using tools. Conclusion validity threats concern the relationship between the treatment (SBFR vs. grammar based test generation) and the measured performance. To address this threat, we have drawn conclusions only when performance differences were reported to be statistically significant at level 0.05 by the Wilcoxon test. There are also grammar based test data generation techniques which are mainly focused on systematic enumeration of sentences from a given grammatical specification (e.g., [27]). While these techniques may work relatively well for achieving a certain level of coverage (mainly of the grammar productions), their applicability for generating test cases that reach specific targets, as in field failure reproduction, is limited. External validity threats are related to the generalizability of the results. We considered five subjects and eleven failures, with three subjects involving a moderately complex grammar and two subjects involving a fairly complex grammar. Generalization to other subjects should be done with care, especially if the associated grammar is highly complex. We plan to replicate our experiment on more subjects to increase our confidence in the external validity and generalizability of our findings. B. Field Failure Reproduction One type of techniques that can be used for field failure reproduction relies on capturing and recording program behaviors by monitoring or sampling field executions (e.g., [5, 28, 6]). These techniques, however, tend to either record 170 too much information to be practical or too little information to be effective. is the limited information available in crash stacks. In our previous empirical study in BugRedux [3], we observed that it is difficult to reproduce system-level field failure using only information provided in crash stacks. In addition, the fitness function employed by RECORE is quite different from the one used in SBFR. RECORE’s fitness function relies on aligning the stack traces and minimizing the object distance computed from the values in the stack. For system-level failure reproduction, the guidance gained from the distances between the objects (which is normalized to a value in [0,1) and averaged over all variables) is (in our experience) minimal, while the exposure of sensitive user data from these variables could be generally unacceptable. While value distances could be easily integrated into SBFR’s fitness function (e.g., by collecting parameter values as part of the call sequences [36]), in our experience this does not result in a significant gain in failure reproduction power that justifies the potential exposure of sensitive data. For this reason, researchers started investigating more sophisticated approaches to reproduce field failures using more limited information. Some debugging techniques, for instance, leverage weakest-precondition computation to generate inputs that can trigger certain types of exceptions in Java programs (e.g., [29, 30, 31]). Although potentially promising, these approaches tend to handle limited types of exceptions and operate mostly at the module level. SherLog [32] and its followup work LogEnhancer [33] use runtime logs to reconstruct and infer paths close to logging statements to help developers identify bugs. These techniques have shown to be effective, but they aim to highlight potential faulty code, rather than synthesizing failing executions. ReCrash [8] records partial object states at the method level dynamically to recreate an observed crash. It inspects the call stack (collected upon crash) at different levels of stack depth and tries to call each method in the stack with parameters capable of reproducing the failure. Although this approach can help reproduce a field failure, it either captures large amounts of program states, which makes it impractical, or reproduces the crash in a shallow way, at the module or even method level, which has limited usefulness (e.g., making a method fail by calling it with a null parameter does not provide useful information for the developer, who is rather interested in knowing why a null value reached the method). VII. C ONCLUSIONS AND F UTURE W ORK We have presented SBFR, a technique that leverages genetic programming to generate complex and structured test inputs capable of reproducing failures observed in the field. SBFR evolves a population of candidate failure inducing inputs by means of genetic operators that manipulate parse-tree representations of the inputs. Evolution is guided by a fitness function that measures the distance between the execution trace (call sequence) of the observed failure and those of the generated test cases. Both ESD [34] and CBZ [35] leverage symbolic execution to generate program inputs that reproduce an observed field failure. Specifically, ESD aims at reaching the point of failure (POF), whereas CBZ improves ESD by reproducing executions that follow partial branch traces, where the relevant branches are identified by different static and dynamic analyses. However, as some of the authors have shown in a previous paper [3], POFs and partial traces are unlikely to be successful for some failures. In our empirical evaluation, SBFR widely outperformed random grammar based test case generation, as well as BugRedux, our field failure reproduction technique based on guided symbolic execution. For subjects with moderately complex grammars that describe the structured input, no stochastic grammar learning is needed to produce the initial population evolved by SBFR. For subjects involving more complex grammars (e.g., a program that accepts as input a large subset of the C language), our results show that the learning component in our approach can dramatically improve the effectiveness of SBFR. Overall, SBFR was able to successfully reproduce 10 out of the 11 failures considered, while a purely random technique was able to reproduce only 1 of the failures, and BugRedux none of them. Similar to ESD and CBZ, BugRedux [3] is a general approach for synthesizing, in-house, an execution that mimics an observed field failure. BugRedux implements a guided symbolic execution algorithm that aims at reaching a sequence of intermediate points in the execution. Although the empirical evaluation of BugRedux has shown that it can reproduce realworld field failures effectively and efficiently, given a suitable set of field execution data, the approach is based on symbolic execution and suffers from the inherent problems of these kinds of techniques (see Section VI-A). Moreover, these approaches do not rely on any grammar as input, which makes them ineffective on programs with complex, structured grammarbased input. In future work, we will (1) conduct additional empirical studies on programs with highly complex input grammars (e.g., JavaScript), (2) investigate selective instrumentation techniques to further reduce the overhead imposed by SBFR without degrading its performance, and (3) investigate hybrid approaches to field failures reproduction that combine the strengths of symbolic execution and genetic programming and can handle a broader class of programs than the two approaches in isolation. Another approach, RECORE [9], applies genetic algorithms to synthesize executions from crash call stacks. However, the current empirical evaluation of RECORE focuses on unit-level, partial executions (i.e., executions of standalone library classes), so it is unclear whether the approach would be able to reproduce complete, system-level executions. Failures in library classes usually result in shallow crash stacks, and in our experience execution synthesis approaches based on symbolic execution work also quite well in these cases. At the system level, a potential fundamental limitation of RECORE ACKNOWLEDGEMENTS This work was partially supported by NSF awards CCF1320783 CCF-1161821, CCF-0964647, and by funding from Google, IBM Research and Microsoft Research. 171 [20] J. C. King, “Symbolic Execution and Program Testing,” Communications of the ACM, vol. 19, no. 7, pp. 385–394, 1976. [21] C. Cadar, D. Dunbar, and D. Engler, “KLEE: Unassisted and Automatic Generation of High-Coverage Tests for Complex Systems Programs,” in Proceedings of the 8th USENIX Conference on Operating Systems Design and Implementation, 2008, pp. 209–224. [22] W. Visser, C. S. Pǎsǎreanu, and S. Khurshid, “Test Input Generation with Java PathFinder,” SIGSOFT Software Engineering Notes, vol. 29, no. 4, pp. 97–107, 2004. [23] K. Sen, D. Marinov, and G. Agha, “CUTE: A Concolic Unit Testing Engine for C,” in Proceedings of the 10th European Software Engineering Conference and 13th ACM SIGSOFT Symposium on the Foundations of Software Engineering, 2005, pp. 263–272. [24] P. Godefroid, N. Klarlund, and K. Sen, “DART: Directed Automated Random Testing,” in Proceedings of the 2005 ACM SIGPLAN Conference on Programming Language Design and Implementation, 2005, pp. 213–223. [25] P. McMinn, “Search-based software test data generation: a survey,” Softw. Test. Verif. Reliab., vol. 14, no. 2, pp. 105–156, 2004. [26] M. Harman and P. McMinn, “A Theoretical and Empirical Study of Search-Based Testing: Local, Global, and Hybrid Search,” IEEE Transactions on Software Engineering, vol. 36, no. 2, pp. 226–247, 2010. [27] R. Lämmel and W. Schulte, “Controllable combinatorial coverage in grammar-based testing,” in Testing of Communicating Systems, ser. Lecture Notes in Computer Science, M. Uyar, A. Duale, and M. Fecko, Eds. Springer Berlin / Heidelberg, 2006, vol. 3964, pp. 19–38. [28] B. Liblit, M. Naik, A. X. Zheng, A. Aiken, and M. I. Jordan, “Scalable Statistical Bug Isolation,” in PLDI 2005, 2005, pp. 15–26. [29] S. Chandra, S. J. Fink, and M. Sridharan, “Snugglebug: A Powerful Approach to Weakest Preconditions,” in Proceedings of the 2009 ACM SIGPLAN Conference on Programming Language Design and Implementation, 2009, pp. 363–374. [30] M. G. Nanda and S. Sinha, “Accurate Interprocedural NullDereference Analysis for Java,” in Proceedings of the 31st International Conference on Software Engineering, 2009, pp. 133–143. [31] C. Flanagan, K. R. M. Leino, M. Lillibridge, G. Nelson, J. B. Saxe, and R. Stata, “Extended Static Checking for Java,” in Proceedings of the 2002 ACM SIGPLAN Conference on Programming Language Design and Implementation, 2002, pp. 234–245. [32] D. Yuan, H. Mai, W. Xiong, L. Tan, Y. Zhou, and S. Pasupathy, “SherLog: Error Diagnosis by Connecting Clues from Run-time Logs,” in Proceedings of the 15th International Conference on Architectural Support for Programming Languages and Operating Systems, 2010, pp. 143–154. [33] D. Yuan, J. Zheng, S. Park, Y. Zhou, and S. Savage, “Improving Software Diagnosability via Log Enhancement,” in Proceedings of the 16th International Conference on Architectural Support for Programming Languages and Operating Systems, 2011, pp. 3–14. [34] C. Zamfir and G. Candea, “Execution Synthesis: A Technique for Automated Software Debugging,” in Proceedings of the 5th European Conference on Computer Systems, 2010, pp. 321–334. [35] O. Crameri, R. Bianchini, and W. Zwaenepoel, “Striking a New Balance Between Program Instrumentation and Debugging Time,” in Proceedings of the 6th European Conference on Computer Systems, 2011, pp. 199–214. [36] F. M. Kifetew, “A search-based framework for failure reproduction,” in Proceedings of the 4th international conference on Search Based Software Engineering. Springer-Verlag, 2012, pp. 279–284. R EFERENCES [1] F. M. Kifetew, W. Jin, R. Tiella, A. Orso, and P. Tonella, “SBFR: A search-based approach for reproducing failures of programs with grammar based input.” in Proceedings of the 28th IEEE/ACM International Conference on Automated Software Engineering (ASE), 2013. [2] T. Zimmermann, R. Premraj, N. Bettenburg, S. Just, A. Schröter, and C. Weiss, “What Makes a Good Bug Report?” IEEE Transactions on Software Engineering, vol. 36, no. 5, pp. 618– 643, Sep. 2010. [3] W. Jin and A. Orso, “Bugredux: Reproducing field failures for in-house debugging,” in Proc. of the 34th International Conference on Software Engineering (ICSE), 2012, pp. 474– 484. [4] T. M. Chilimbi, B. Liblit, K. Mehra, A. V. Nori, and K. Vaswani, “H OLMES: Effective Statistical Debugging via Efficient Path Profiling,” in ICSE 2009, 2009, pp. 34–44. [5] J. Clause and A. Orso, “A Technique for Enabling and Supporting Debugging of Field Failures,” in ICSE 2007, 2007, pp. 261–270. [6] “The Amazing VM Record/Replay Feature in VMware Workstation˜6,” http://communities.vmware.com/community/vmtn/cto/steve/blog/ 2007/04/18/the-amazing-vm-recordreplay-feature-in-vmware-workstation-6, Apr. 2012. [7] L. Jiang and Z. Su, “Context-aware Statistical Debugging: From Bug Predictors to Faulty Control Flow Paths,” in Proceedings of the 22nd IEEE/ACM International Conference on Automated Software Engineering, 2007, pp. 184–193. [8] S. Artzi, S. Kim, and M. D. Ernst, “ReCrash: Making Software Failures Reproducible by Preserving Object States,” in Proceedings of the 22nd European Conference on Object-Oriented Programming, 2008, pp. 542–565. [9] J. Rößler, A. Zeller, G. Fraser, C. Zamfir, and G. Candea, “Reconstructing core dumps,” in Proc. of the 6th International Conference on Software Testing (ICST), 2013. [10] R. I. McKay, N. X. Hoai, P. A. Whigham, Y. Shan, and M. O’Neill, “Grammar-based genetic programming: a survey,” Genetic Programming and Evolvable Machines, vol. 11, no. 3-4, pp. 365–396, May 2010. [11] M. O’Neill and C. Ryan, “Grammatical evolution,” Evolutionary Computation, IEEE Transactions on, vol. 5, no. 4, pp. 349–358, Aug. 2001. [12] K. Lari and S. J. Young, “The estimation of stochastic contextfree grammars using the inside-outside algorithm,” Computer speech & language, vol. 4, no. 1, pp. 35–56, 1990. [13] T. H. Cormen, C. E. Leiserson, and R. L. Rivest, Introduction to Algorithms. MIT Press, 1990. [14] M. O’Neill, E. Hemberg, C. Gilligan, E. Bartley, J. McDermott, and A. Brabazon, “Geva: grammatical evolution in java,” ACM SIGEVOlution, vol. 3, no. 2, pp. 17–22, 2008. [15] D. Grune and C. J. H. Jacobs, Parsing techniques: a practical guide. Chichester, England: Ellis Horwood Limited, 1990. [16] R. Majumdar and R.-G. Xu, “Directed test generation using symbolic grammars,” in Proceedings of the 22nd IEEE/ACM International Conference on Automated Software Engineering (ASE), 2007, pp. 134–143. [17] P. Godefroid, A. Kiezun, and M. Y. Levin, “Grammar-based whitebox fuzzing,” in Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), 2008, pp. 206–215. [18] S. Lu, Z. Li, F. Qin, L. Tan, P. Zhou, and Y. Zhou, “BugBench: Benchmarks for Evaluating Bug Detection Tools,” in Workshop on the Evaluation of Software Defect Detection Tools, 2005. [19] C. Wohlin, P. Runeson, M. Höst, M. Ohlsson, B. Regnell, and A. Wesslén, Experimentation in Software Engineering - An Introduction. Kluwer Academic Publishers, 2000. 172
© Copyright 2026 Paperzz