Reproducing Field Failures for Programs with Complex

2014 IEEE International Conference on Software Testing, Verification, and Validation
Reproducing Field Failures for Programs with
Complex Grammar Based Input
Fitsum Meshesha Kifetew1 , Wei Jin2 , Roberto Tiella1 , Alessandro Orso2 , Paolo Tonella1
1
Fondazione Bruno Kessler, Trento, Italy
Georgia Institute of Technology, USA
[email protected], [email protected], [email protected], [email protected], [email protected]
2
Abstract—To isolate and fix failures that occur in the field,
after deployment, developers must be able to reproduce and investigate such failures in-house. In practice, however, bug reports
rarely provide enough information to recreate field failures, thus
making in-house debugging an arduous task. This task becomes
even more challenging for programs whose input must adhere
to a formal specification, such as a grammar. To help developers
address this issue, we propose an approach for automatically
generating inputs that recreate field failures in-house. Given a
faulty program and a field failure for this program, our approach
exploits the potential of grammar-guided genetic programming
to iteratively find legal inputs that can trigger the observed
failure using a limited amount of runtime data collected in the
field. When applied to 11 failures of 5 real-world programs, our
approach was able to reproduce all but one of the failures while
imposing a limited amount of overhead.
I.
This paper presents a failure reproduction technique, called
SBFR (Search-Based Failure Reproduction), that is specifically
designed to handle complex programs with highly structured
inputs (e.g., compilers and interpreters), hence going beyond
the features of the programs used in previous failure reproduction studies [3, 4, 5, 6, 7, 8, 9]. SBFR takes as input the
failing program, a grammar describing the program input, and
the (partial) call sequence for the failing execution, and uses
genetic programming (a search-based optimisation algorithm)
for generating failing inputs. Genetic programming relies on
mutation and crossover operators to manipulate the parse
tree of the structured program input, initially constructed by
random application of grammar productions. Parse trees are
evolved through mutation and crossover, so as to minimize
a fitness function that measures the distance between the
execution trace of each candidate input tree and the execution
trace of the failing execution. The search stops when a parse
tree is produced which is capable of successfully reproducing
the target failing behavior.
I NTRODUCTION
Software systems are increasingly complex and can be
used in unpredictable (and untested) ways. Hence, users experience field failures (failures occurring after deployment),
which should be reproduced and fixed as quickly as possible.
Field failure reproduction, however, is not an easy task for
developers [2], as bug reports provide limited information on
how the failure can be reproduced in the testing lab.
To assess SBFR’s efficiency and effectiveness, we developed a tool that implements our approach and used it to
perform an empirical evaluation on 11 failures of 5 real-world
programs. Our results are promising: whereas BugRedux could
not reproduce any of the 11 failures considered, SBFR was able
to reproduce 10 of the 11 failures with a failure reproduction
rate of 0.79 on average.
In previous work, we developed BugRedux, a general
approach for failure reproduction based on guided symbolic
execution [3]. BugRedux takes as input a sequence of key
intermediate points in the observed failing execution and
guides the symbolic execution towards these points until it
reaches the point of failure. In particular, empirical studies with
BugRedux show that (partial) call sequences provide enough
guidance for in-house synthesis of failing executions. Call
sequences usually do not introduce privacy problems and can
be gathered with limited overhead, so they represent a good
trade off between the amount of information collected and the
effectiveness of field failure reproduction.
II.
In this section we provide the background and terminology
necessary for describing our approach.
A. Evolutionary Search
Evolutionary search is an optimization heuristic that works
by maintaining a population of candidate solutions that are
evolved through generations via reproduction and mutation.
There are a number of ways in which this evolutionary scheme
can be implemented. Among them, the most dominant are
Genetic Algorithms (GA) and Genetic Programming (GP).
The key limitation of BugRedux is that it relies on symbolic
execution, which is known to be ineffective on (1) programs
with highly structured inputs, such as inputs that must adhere
to a (non-trivial) grammar, (2) programs that interact with
external libraries, and (3) large complex programs in general
(e.g., programs that generate constraints that the constraint
solver cannot handle).
In a GA, candidate solutions are encoded into individuals
following various encoding schemes (e.g., binary strings).
Initially, a population of individuals is generated, usually randomly. The GA then assigns a fitness value to each individual
in the population by evaluating it using a fitness function—
a function that measures the “goodness” of each individual
with respect to the problem being solved. After performing
this evaluation, GA selects from the population pairs of
We presented an earlier version of this work as short paper in ASE
2013’s New Ideas Track [1]. This paper provides a more detailed and
extensive technical description of our approach and an empirical evaluation
that demonstrates the effectiveness of the approach.
978-0-7695-5185-2/14 $31.00 © 2014 IEEE
DOI 10.1109/ICST.2014.29
BACKGROUND
163
'HULYDWLRQVHTXHQFH
*UDPPDU
H[S! PXOW([S!DOW!
H[S!
DOW! DOW!DOW!
DOW! ˨
PXOW([S!DOW!
DOW! PXOW([S!
DOW! PXOW([S!
PXOW([S! DWRP!DOW!
DOW! DOW!DOW!
DOW! ˨
DOW! DWRP!
DOW! DWRP!
DWRP!
DWRP!
DWRP!
DWRP!
H[S!
LG
WB75,*!H[S!
QXP
WB75,*! VLQ
WB75,*! FRV
WB75,*! WDQ
process. Crossover and mutation could be performed in a
number of different ways. In GP, sub-tree crossover and
mutation are commonly used. Sub-tree crossover between two
individuals is performed by exchanging two sub-trees, rooted
at the same non-terminal, from their tree representations. Subtree crossover ensures that the newly formed individuals are
well formed with respect to the underlying grammar. An
example of sub-tree crossover is shown in Figure 2.
DWRP!DOW!DOW!
WB75,*!H[S!DOW!DOW!
FRVH[S!DOW!DOW!
FRVPXOW([S!DOW!DOW!DOW!
FRVDWRP!DOW!DOW!DOW!DOW!
FRVQXPDOW!DOW!DOW!DOW!
FRVQXPDOW!DOW!DOW!
FRVQXPDOW!DOW!
FRVQXPDOW!
FRVQXP
Fig. 1. An example of a derivation. Each non-terminal is expanded by one
of its productions, until the final string is composed of terminal symbols only.
individuals (parents) with better fitness values and subjects
them to reproduction (crossover) to produce their offspring.
The GA may further subject the offspring to a process of
mutation, in which the individuals’ encoding is modified to
introduce diversity into the population. At this point, there is a
new generation of individuals that can further reproduce. This
process of evaluation, selection, and reproduction continues
until either the solution is found or a given stopping condition
(e.g., number of cycles) is reached.
Fig. 2. Sub-tree crossover: sub-trees of the same type (in circles) from parents
are exchanged to create children. See Figure 1 for the derivation process.
Similarly, sub-tree mutation on an individual is performed
by replacing a sub-tree from its tree representation with a new
sub-tree of the same type generated from the grammar. Figure
3 shows an example of sub-tree mutation.
GP [10] follows a similar process. However, the individuals
manipulated by the search are tree-structured data (programs,
in the GP terminology) rather than encodings of solution
instances. These complex data usually have a well formed
structure defined by a set of formal rules (e.g., a grammar).
While there are a number of variants of GP in the literature,
in this work we focus on Grammar Guided GP (GGGP),
and in particular on Grammatical Evolution (GE) [11]. In
GGGP, individuals are sentences generated according to the
formal rules prescribed by a grammar. Specifically, in the
case of GE, sentences are generated from a Context Free
Grammar (CFG), so that new individuals produced by the GP
search operators (crossover and mutation) are guaranteed to
be valid with respect to the associated grammar. In GP, the
initial population of individuals is generated from the grammar
following a number of techniques, mostly based on some form
of random grammar-based generation.
1) Input Representation: An individual (a sentence from
the grammar) in the population is represented by its syntax
tree (derivation tree). The tree is built through the process
of derivation: starting from the root (start) symbol of the
grammar, productions are applied to substitute non-terminal
symbols, resulting eventually in a terminal string. The process
is shown in Figure 1. Tree representation of individuals is
appropriate, as the underlying search operators, described
below, are based on sub-tree manipulation.
Fig. 3. Sub-tree mutation: a sub-tree is replaced by a new sub-tree of the same
type generated starting from the grammar. Child1 from Figure 2 is mutated.
B. Terminology
A field failure is a failure of a deployed program while
it executes on a user machine. We use the term execution
data to refer to any runtime information collected from a
program executing on a user machine. In particular, a call
sequence is a specific type of execution data that consists
2) Evolution Operators: Evolution operators (crossover
and mutation) play a crucial role in the evolutionary search
164
of a (sub)sequence of functions (or methods) invoked during
the execution of a program on a user machine. We define a
field failure reproduction technique as a technique that can
synthesize, given a program P , a field execution E of P that
results in a failure F , and a set of execution data D for E,
an in-house execution E as follows. First, E should result in
a failure F that is analogous to F , that is, F has the same
observable behavior of F . If F is the violation of an assertion
at a given location in P , for instance, F should violate the
same assertion at the same point. Second, E should be an
actual execution of P , that is, the approach should be sound
and generate an actual input that, when provided to P , results
in E . Finally, the approach should be able to generate E using
only P and D, without the need for any additional information.
III.
To avoid this problem, the search can assign probabilities
to the productions in such a way that the recursive ones
are applied less frequently than the non-recursive ones. In
this way, the sentences generated in the initial population
will be of balanced depth and more representative than those
generated using a uniform random approach. A simple, yet
effective rule that we devised to achieve this result is called the
80/20 rule. According to this rule, a total probability of 0.2 is
uniformly distributed across all recursive productions, while a
total probability of 0.8 is uniformly distributed across the nonrecursive productions. Indirect recursion is regarded as plain
recursion in the application of the 80/20 rule. With reference
to the example in Figure 1, production 0 (a recursive production) for non-terminal <alt1*> is assigned probability 0.2,
while production 1 (a non-recursive production) is assigned
probability 0.8.
S EARCH BASED FAILURE R EPRODUCTION
For complex grammars, evolving meaningful and representative sentences may be quite difficult for GP if the
initial population is not chosen carefully. One of the subject
programs used in our experimental study, PicoC, is a good
example of such a case. It is an interpreter for a subset of
the C programming language. The grammar used to generate
inputs for this program is fairly large (194 production rules)
with complex and highly recursive structures. If the initial
population consists of random sentences (obtained using the
80/20 rule), individuals will still be very different from real
C programs, being mostly limited to shallow structures, such
as paired braces. Common programming constructs, such as
assignment statements, will be very difficult to generate from
such a complex grammar using random techniques.
SBFR’s goal is to reproduce an observed (field) failure
based on a (ideally minimal) set of runtime information about
the failure, or execution data. As commonly done in these
cases, SBFR collects execution data by instrumenting the
software under test (SUT, hereafter) before deploying it to
users. Upon failure, the execution data for the failing execution
are used by SBFR to perform a GP based search for inputs that
can trigger the failure observed while the SUT was running on
the user machine. This part of the approach would be ideally
performed in the field (by leveraging free cycles on the user’s
machine), so as to send back only the generated inputs instead
of the actual execution data.
As in our previous BugRedux work [3], execution data
are used to provide guidance to the search. However, unlike
BugRedux, SBFR performs the search for the failure inducing
input using evolutionary search, and the guidance is provided
indirectly, through the use of the fitness function discussed in
Section III-C. More precisely, the individuals in the GP search
are candidate test inputs for the SUT, that is, structured input
strings that adhere to a formal grammar. The search maintains
a population of individuals, evaluates them by measuring how
close they get to the desired solution using a fitness function,
and evolves them via genetic operators (see Section III-B). If
at least one candidate is able to trigger the desired failure, a
solution is found and the search terminates. If no candidate
solution is found after consuming the whole search budget,
the search is deemed unsuccessful.
Instead of applying a set of production probabilities rigidly
determined by the 80/20 rule, in these cases probabilities
can be learned from examples of existing, human written
inputs. Namely, the stochastic grammar used for the generation
of the initial population is learned from a corpus of well
formed, human written sentences for the SUT. It is usually
easy to obtain such corpus for popular languages, such as C,
since a huge amount of code is publicly available. Even if
such corpus is not publicly available, the manual test suite
which is usually shipped with the SUT itself could provide
a good starting point. In our experiments, for subjects with
very complex grammars, we used the Inside-Outside algorithm
[12] to learn the production probabilities from a corpus of
input sentences. The Inside-Outside algorithm estimates inside probabilities (probabilities of generating a given terminal
sequence from each non-terminal) and outside probabilities
(probabilities of generating each such non-terminal as well
as the terminals outside the given sequence) from a corpus.
These probabilities determine the weights of the generative
stochastic grammar. The Inside-Outside algorithm tolerates
arbitrary degree of grammar ambiguity, an important property
for grammars originally used by parsers, as those considered
in our experiments.
A. Seeding the Search with Representative Inputs
When SBFR generates the initial population, uniform selection of the grammar productions to apply tends to run into
problems in the presence of recursive productions, which are
quite frequent in commonly used grammars. In fact, repeated
application of recursive productions (e.g., production 0 for
<alt1*> in Figure 1) may result in a non-terminating derivation process. For practical purposes, the derivation process is
continued until some maximum depth is reached. However,
if the maximum depth is reached before substituting all nonterminals, the evolutionary search process discards the individual. If the application of recursive productions is not controlled,
the individuals that are left after discarding those containing
non-terminals tend to be associated with shallow derivation
trees and short, non representative strings.
B. Input Representation and Genetic Operators
Once the initial population of individuals is generated,
either using the 80/20 rule or by learning probabilities as
discussed in the previous subsection, they are represented
as syntax trees. The genetic operators manipulate these tree
representations of the individuals.
165
We apply sub-tree crossover and mutation operations on
the individuals (see Section II). We chose to use tree based
operations because they preserve the well formedness of the
resulting individuals—if both parents are well formed individuals (according to the grammar), the offspring produced by
sub-tree crossover are also going to be well formed. Similarly,
sub-tree mutation of a well formed individual results in a well
formed individual. The probabilities used in sub-tree mutation
are the same as those used to generate the initial population
(i.e., they are either determined by the 80/20 rule or they are
learned from a corpus).
fitness evaluations, is exhausted. If successful, the search will
produce an individual (i.e., an input) that causes the program
to follow a trajectory similar to that of the observed failure,
reach the point of failure, and fail at that point with the
same observable behavior as the original failure (details on
how our current implementation actually assesses whether the
reproduced failing behavior matches the observed one are
provided in the next section).
To evaluate the fitness of an individual, its tree representation is “unparsed” into a string representation, which is passed
to the SUT as input. Based on the execution of the SUT on
the input string, the fitness of the individual is computed.
Figure 4 shows an overall view of the prototype tool that
implements SBFR. The tool consists of three main modules:
GP Search, Instrumenter, and Learner. The GP Search component performs the evolutionary search, eventually producing
a test case. To evaluate individuals in the search, it uses
a SUT Runner component, which is a simple wrapper that
executes the individual with a given timeout and returns the
execution trace together with the exit status of the execution
(including possible error messages). In cases where learning
is employed, the Learner produces a stochastic CFG (see
Sections II and III-A) starting from the SUT’s input grammar
and a corpus.
IV.
C. Fitness Computation and Search Termination
SBFR evaluates candidate solutions based on the trace
obtained when executing them against the SUT. To evaluate
how good a candidate individual is, the instrumented SUT is
executed using the candidate individual as input, resulting in a
set of execution data. In this work, we consider execution data
that consist of call sequences and refer to a call sequence using
the term trajectory. More formally, we define a trajectory as a
sequence T = c1 , ..., cn , where each ci is a function/method
call. We made this choice of execution data because our
findings in previous work show that call sequences provide
the best tradeoffs in terms of cost benefit for synthesizing inhouse executions [3]. Furthermore, with anecdotal evidence
from manually checking the collected call sequences from our
empirical study, call sequences are unlikely to reveal sensitive
or confidential information about the original execution.
In SBFR, we use a fitness function to compare how
“similar” the trajectory of a candidate individual is to the
trajectory of the failing execution obtained from the field. This
comparison is implemented using sequence alignment between
the two trajectories. That is, we propose a fitness function based
on the distance between the trajectory of the failing execution
and the trajectory of a candidate individual. Hence, our GP
approach tries to minimize this distance with the objective of
finding individuals that generate trajectories identical to that
of the failing execution. The distance between two trajectories
T1 and T2 can be defined as:
distance(T1 , T2 ) = |T1 | + |T2 | − 2 ∗ |LCS(T1 , T2 )|
I MPLEMENTATION
Fig. 4. SBFR prototype: GP Search performs the evolutionary search guided
by the Target trajectory. Fitness evaluation is performed by executing an
individual via a SUT Runner component that runs the SUT with a timeout.
The execution data and the exit status are returned to the search component for
fitness evaluation. In cases where learning is applied, the grammar (in BNF)
is augmented with probabilities learned from a corpus by the Learner.
We have implemented the core evolutionary search component of our failure reproduction framework based on
GEVA [14], a general purpose GE tool written in Java. GEVA
provides the necessary infrastructure, such as representation
of individuals, basic GP operators (e.g., sub-tree crossover,
and mutation), and the general functionality to manage the
overall search process. On top of this infrastructure, we have
implemented customized operators for stochastic initialization
and fitness evaluation, which are central to our proposed
failure-reproduction scheme.
(1)
where LCS stands for Longest Common Subsequence [13],
and |T | is the length of the trajectory T .
For instance, T1 = f, g, g, h, m, n and T2 =
f, g, h, m, m have LCS = f, g, h, m. Hence, their distance
is 6 + 5 - 8 = 3, which corresponds to the number of calls that
appear only in T1 (second g, n) or in T2 (second m).
Fitness evaluation is performed by executing the instrumented version of the SUT externally using the SUT Runner,
with the string representation of an individual as input. When
the execution terminates, its trajectory is returned to the search
component (GEVA), which computes the distance between the
trajectory of the individual and that of the target. Since the
sentences generated may contain constructs that lead the SUT
to non-terminating executions, the Runner executes the SUT
with a timeout.
The fitness value of an individual in the GP will hence
be computed as the distance between its trajectory and that
of the target trajectory using Equation 1. The fitness value is
then minimized by the search, with the ultimate objective of
producing individuals that reproduce the desired failure.
The search stops when a desired solution is found or
the search budget, expressed as the maximum number of
166
The major computational cost associated with SBFR is
the cost of the fitness evaluation. To reduce this cost, our
tool uses caching, which minimizes the fitness-evaluation cost
by avoiding the re-execution of previously evaluated inputs.
Consequently, the search budget is computed as the total
number of unique fitness evaluations.
In the rest of this section, we present the subject programs
and failures that we used for our experiments, illustrate our
experiment protocol, and discuss our results and the possible
threats to their validity.
To determine whether an individual triggers a failure analogous to the one observed in the field, our tool proceeds as
follows. For each candidate input, the error/exception possibly
generated while executing the SUT is compared with that of
the reported failure. This comparison is performed by the SUT
Runner by comparing the error messages and the location
where the errors manifest themselves and returns an Exit Status
indicating success if the two failures match.
In our empirical evaluation of SBFR, we consider eleven
failures from five grammar based programs. We selected
these programs because they are representative of the kind of
programs we target and their grammars are available. As our
approach deals with grammar-based programs, the corresponding grammars are generally available with the program itself.
Even so, some work may be still necessary, for example to
convert the available grammar into a format (BNF) accepted
by our tool (this task is usually easy to automate).
A. Program Subjects
The Instrumenter module adds software probes to the SUT
at compile time, for collecting call sequences when the SUT
is executed. We implemented two versions of this module, one
for C programs (based on the LLVM compiler infrastructure1 )
and the other for Java programs (based on the Javassist
bytecode-manipulation library2 ). As a result, the instrumented
version of the SUT can output the dynamic call sequence for
the given input.
Table I presents a summary of the subject programs used in
our experimental study. Calc4 is an expression evaluator that
accepts an input language including variable declarations and
arbitrary expressions. bc5 is a command-line calculator commonly found in Linux/Unix systems. MDSL6 is an interpreter
for the Minimalistic Domain Specific Language (MDSL), a
language including programming constructs such as functions,
loops, conditionals etc. PicoC7 is an interpreter for a subset
of the C language. Lua8 is an interpreter for the Lua scripting
language. Calc and MDSL are developed in Java based on
the ANTLR parser generator. bc is developed in C based
on the Lex/Yacc parser generator tools. PicoC and Lua are
developed in C, but do not rely on a parser generator. We
defined a BNF grammar for PicoC based on an existing
C grammar suitably reduced to the subset of C accepted by
PicoC. We also defined a BNF grammar for Lua based on
the semi-formal specification of the language provided on the
official website.
Note that the search is completely language agnostic.
Handling SUTs developed in another language L would simply
amount to developing an instrumentation tool for L and giving
the SUT Runner the ability to run programs written in L,
so that call sequences can be collected during execution. The
core components of the failure reproduction framework would
remain unchanged.
The Learner component, which generates a stochastic CFG
from a CFG and a corpus of sample inputs, is an extension
of a C implementation of the inside-outside algorithm,3 which
takes as input a grammar and a set of sentences and produces
as output a probability for each rule in the grammar. To be
accepted by the implementation of the inside-outside algorithm
that we use, a grammar has to be transformed into a weakly
equivalent grammar (i.e., a grammar that generates the same
language) with all terminals introduced by unary rules only and
all empty rules removed. We implemented a tool that performs
this transformation using an existing approach [15].
V.
Table I reports the number of productions in the grammar
of each application, ranging from 38, for Calc, to 194, for
PicoC. These grammars are fairly large and complex, and
bigger than those typically found in the GP literature. Even
if the subject programs are not necessarily large in terms of
LOCs, they are quite challenging for input-generation techniques. In particular, we also tried to apply vanilla BugRedux
to reproduce the same field failures. However, BugRedux failed
to generate any input for all faults considered in this study
after 72 hours. The reason of ineffectiveness of BugRedux is
that the current implementation does not leverage the grammar
information (as done e.g. with “symbolic tokens” [16, 17]) and
the guided symbolic execution search gets stuck in the lexical
analysis functions because these functions usually contain a
huge number of paths.
E MPIRICAL E VALUATION
The main goal of our empirical evaluation is to assess
the effectiveness and practical applicability of our SBFR
approach for programs with structured and complex input. To
achieve this goal we performed a study on several real-world
programs and real failures for these programs. Specifically, we
investigated the following research questions:
•
RQ1: How effective is SBFR in reproducing real field
failures for programs with structured input?
•
RQ2: What is the performance overhead imposed by
the instrumentation required for failure reproduction?
•
RQ3: What is the role of input seeding in search based
failure reproduction?
Table I reports the number of faults (equal to the number
of failures) considered for each subject. The faults in bc,
PicoC, and Lua have been selected from their respective
bug tracking systems and affect the latest versions of the
programs. For instance, the bc bug crashes the bc program
deployed with most modern Linux systems. The bugs for
4 https://github.com/cmhulett/ANTLR-java-calculator/
5 http://www.gnu.org/software/bc/
1 http://llvm.org
6 http://mdsl.sourceforge.net/
2 http://www.csg.is.titech.ac.jp/∼chiba/javassist/
7 https://code.google.com/p/picoc/
3 http://web.science.mq.edu.au/∼mjohnson/Software.htm
8 http://www.lua.org
167
TABLE I.
S UBJECTS USED IN OUR EXPERIMENTAL STUDY.
Name
Language
Calc
bc
MDSL
PicoC
Lua
Java
C
Java
C
C
Size (KLOC)
# Productions
# Faults
2
12
13
11
17
38
80
140
194
106
2
1
5
1
2
We also measured the execution time of the SUT before
and after instrumentation, so as to determine the time overhead
imposed on the end user. Specifically, we ran all test cases
available for each subject used in our experimental study and
measured the associated execution time with (ET’) and without
(ET) instrumentation. The percentage increment of the test
suite execution time is used to quantify the overhead introduced by the instrumentation. We also measure the size (SZ) of
the trace files used to store the call sequences associated with
failing executions, so as to assess the space overhead imposed
by SBFR. We consider the size of the trace files both before
and after compressing them (ZSZ), as in practice such data
can be stored (and transferred over networks) in compressed
format.
Calc and MDSL have been discovered by the authors while
investigating the programs in a different work. Each fault
causes a crashing failure, that is, a failure that results in the
unexpected termination of the program. The execution data
used to guide the search in SBFR are generated by test cases
that expose these failures, and thus simulate the occurrence of
a field failure.
As there are several parameters that control the search
process, we performed sensitivity analysis to determine appropriate values for the dominant search parameters in our
experiments. The values we used are: population size of 500;
crossover probability of 0.8; mutation probability of 0.2; threeway tournament selection, preserving the elite; total search
budget of 10,000 unique fitness evaluations.
B. Experiment Protocol
We evaluated the effectiveness of SBFR using random
grammar-based test case generation as a baseline. Specifically,
we implemented a random generation technique (RND hereafter) that applies the 80/20 rule (discussed in Section III-A).
RND generates a new input from the grammar based on the
80/20 rule and executes the SUT with that input. If the input
triggers the desired failure, a solution is found and RND
stops. If the input does not trigger the failure, another input is
generated and evaluated. This process continues until either a
solution is found or the search budget is finished.
C. Results
Table II presents the results of our empirical evaluation.
For each bug, the table reports FRP for both SBFR and RND,
together with the results of the Wilcoxon statistical test of
significance. For Bug3 of MDSL, both SBFR and RND are able
to reproduce the failure, so we further performed a comparison
of the search budget consumed by each to reproduce the failure
(FIT metrics, discussed above). However, a Wilcoxon test (pvalue 0.4813) shows that there is no significant difference in
the consumption of search budget either.
When seeding is employed (related to RQ3), for each
of the considered subjects we used the stochastic grammar
learned from a corpus of human written tests (see Section IV)
to generate inputs. Hence, in SBFR the initial population is
generated from the stochastic grammar, rather than using the
80/20 rule. Similarly, in the case of RND, inputs are generated
from the stochastic grammar.
TABLE II.
FAILURE REPRODUCTION PROBABILITY FOR RND AND
SBFR. p- VALUES WHICH ARE STATISTICALLY SIGNIFICANT ARE SHOWN
IN BOLDFACE .
Since both SBFR and RND involve non-deterministic actions, we ran both SBFR and RND 10 times for every failure
considered. For each such run, we recorded whether or not
the failure was reproduced and, if reproduced, how much
of the search budget was consumed to reproduce it. If the
failure was not reproduced after consuming the entire budget,
the search was deemed unsuccessful. We calculate the failure
reproduction probability (FRP) as the number of runs that
reproduced the failure divided by the total number of runs
(i.e., 10) for each subject. For example, using SBFR with the
80/20 rule, we reproduced Calc Bug 1 in 6 runs out of 10
runs, hence the FRP = 0.6.
Subject:Bug
FRP (RND)
FRP (SBFR)
Calc Bug1
Calc Bug2
bc
MDSL Bug1
MDSL Bug2
MDSL Bug3
MDSL Bug4
MDSL Bug5
PicoC
Lua Bug 1
Lua Bug 2
0.0
0.0
0.0
0.0
0.0
1.0
0.0
0.0
0.0
0.0
0.0
0.6
0.8
1.0
1.0
1.0
1.0
1.0
1.0
0.0
0.0
0.0
p-value
0.005016
0.00044
1.59E-005
1.59E-005
1.59E-005
NA
1.59E-005
1.59E-005
NA
NA
NA
Table III shows the size of the execution trace collected for
each crash before and after compression with the zip utility.
As can be seen from the table, the size of the traces, especially
after compression, is almost negligible.
When there was no statistically significant difference in
FRP, we measured a secondary effectiveness indicator, which
accounts for the computational cost incurred by each technique
to achieve the measured FRP: the number of fitness evaluations
(FIT). Fitness evaluation represents the main computational
cost for both SBFR and RND, and largely dominates all
other computational costs. Therefore, we used FIT as an
indicator of the cost of failure reproduction and measured it
to assess whether SBFR offers any cost saving as compared
to RND when both achieve the same FRP. In our experiments,
this happened only for one of the bugs (discussed below in
Section V-C).
Table IV reports execution times with and without instrumentation. The execution time overhead ranges between 2.8%
and 16.4%. Results have been obtained using an implementation that relies on buffering to minimize the number of disk
writes. In practice, the size of the execution data, consisting
only of call sequences, is usually small enough to be kept
entirely in memory during a program execution. Since a trace
is dumped to file only upon crash, for normal (non-failing)
executions the entire trace can be kept in memory, and no
disk write operation is required.
168
TABLE III.
U NCOMPRESSED (SZ) AND COMPRESSED (ZSZ)
EXECUTION TRACE SIZE .
Subject
SZ (kb)
ZSZ (kb)
3.50
1.60
12.00
1.20
1.30
2.90
3.30
0.66
8.40
75.92
62.40
0.49
0.40
0.46
0.56
0.56
0.60
0.68
0.46
0.55
2.37
1.77
Calc Bug1
Calc Bug2
bc
MDSL Bug1
MDSL Bug2
MDSL Bug3
MDSL Bug4
MDSL Bug5
PicoC
Lua Bug 1
Lua Bug 2
TABLE IV.
Based on the results we obtained, we can answer RQ1
and state that SBFR is effective in reproducing real field
failures for programs with structured, grammar based
input, while RND and BugRedux are not so.
With respect to the overhead imposed by SBFR, which is
the topic of RQ2, Table III shows that the size of the collected
execution data (call sequences, in this case) is very small. In
the worst case, for Lua Bug 1, the size of the uncompressed
trace is 75.92Kb (2.37Kb compressed). Overall, on average,
the size of the uncompressed trace is 15.74Kb, while the
average compressed trace size is 0.8Kb.
T EST SUITE EXECUTION TIME BEFORE AND AFTER
INSTRUMENTATION .
Subject
ET (sec)
ET (sec)
ΔET %
Calc
bc
MDSL
PicoC
Lua
4.28
7.57
15.97
1.00
1.38
4.47
8.81
16.64
1.11
1.42
4.4%
16.4%
4.2%
11%
2.8%
As Table IV shows, the execution time overhead imposed
by SBFR’s instrumentation (added to collect dynamic execution traces) is also acceptable for all five subjects considered,
with an average overhead of about 8%. Moreover, we also
expect these results to be really worst case scenarios for
many reasons. First, all of these applications are processingintensive applications with no interaction with the user. The
overhead would typically decrease dramatically in the case
of interactive applications, for which idle time is dominant.
Second, these are for the most part very short executions,
where the fixed cost of the instrumentation’s initialization is
not amortized over time. Third, it is always possible to sample
and collect only partial execution data [3]. Finally, we use an
unoptimized instrumentation; in our experience, sophisticated
instrumentation can considerably reduce the time overhead
imposed on the instrumented programs. Nevertheless, we plan
to further reduce the overhead of collecting field data by
incorporating different sophisticated techniques including the
ones mentioned above.
Table V presents the results of stochastic grammar learning,
used to generate the test cases of RND and to seed the initial
population of SBFR. We consider only bugs that could not
be reproduced by SBFR when the 80/20 rule is used for the
initialisation (see Table II). As the table shows, the FRP for
SBFR is significantly higher than that of RND for two of the
three bugs considered. For the third bug (Lua Bug 1), both
SBFR and RND are unable to reproduce the failure.
TABLE V.
FAILURE REPRODUCTION PROBABILITY (FRP) FOR SBFR
AND RND WITH INITIALIZATION USING THE LEARNED STOCHASTIC
GRAMMAR , RATHER THAN THE 80/20 RULE .
Subject:Bug
PicoC
Lua Bug 1
Lua Bug 2
FRP (RND)
FRP (SBFR)
p-value
0.1
0.0
0.0
0.8
0.0
0.5
0.0025
NA
0.01365
In summary, we can answer RQ2 in a positive way. According to our results, SBFR imposes almost negligible
space overhead and acceptable time overhead in all cases
considered.
D. Discussion
As Table II shows, SBFR with 80/20 seeding was able to
reproduce all failures but three—the failures of PicoC and
Lua. Grammar based test case generation using RND (80/20),
conversely, was able to reproduce only one of the eleven
failures—Bug3 of MDSL, which both SBFR and RND are able
to reproduce with probability of 1. After further investigation
of the results, we discovered that this failure is relatively easy
to reproduce, as it is triggered by an input MDSL program that
calls a method on an undeclared object, which is automatically
initialised to null. BugRedux was unable to reproduce any of
the considered failures within 72 hours.
With respect to the role of input seeding (RQ3), as can
be seen from Table V, for two of the three bugs, the FRP of
SBFR has significantly improved with the aid of the learned
grammar, while RND is still not able to reproduce any of
the three bugs using the stochastic grammar. As we showed
in Table II, these three failures are particularly difficult to
reproduce using initialization with the 80/20 rule. For instance,
the failure in PicoC is a segmentation fault caused by an
incorrect use of pointers. The test case that triggers the failure,
from the original bug report, contains the following statements:
int n =5; int *k; k = &n; **k = &n; In particular, the failure is caused by the last assignment statement.
Assignment statements, especially those associated with complex expressions, involve deeply nested and recursive grammar
definitions. As a result, generating such type of statements
from a grammar using randomized techniques is quite difficult
and the derivation process for such constructs either stops
prematurely or explodes exponentially. With learning, these
kinds of constructs can be easily generated in the initial
population. GP operators can then make use of these basic
constructs, by manipulating and exchanging them, to evolve
the desired trees with the constructs necessary for reproducing
the failure at hand.
Let us consider a specific example in which SBFR is
successful and RND fails. The failure in bc is a segmentation
fault that happens when performing memory allocation under
very specific circumstances [18]; the failure is triggered by
an instruction sequence that allocates at least 32 arrays and
declares a number of variables higher than the number of allocated arrays. SBFR successfully recreates the input sequence
that leads to this failure, while RND is unable to reproduce the
failure. In this case, coverage of the grammar productions, is
not enough to reproduce the failure, as the failure requires an
input with very specific characteristics. Conversely, by using
the call sequence as a guidance during the search for test cases,
SBFR can successfully reproduce the failure.
169
Lua Bug 1 is not reproduced by neither SBFR nor RND,
even with learning. This bug involves specific invocations
of built-in functions of the language (in particular, calls to
print and load). Such types of input are very difficult to
generate from the grammar alone, because they depend on the
identifiers instantiating the grammar tokens, in addition to the
input structure. We have currently implemented a simple token
instantiation strategy in SBFR, where a pool of random token
instances is first generated; later, during test case generation
and evolution, only token instances from the pool are used for
newly created tokens. While this strategy works well for bugs
that involve only user defined functions, it fails when built-in
or library functions must be called. We plan to extend SBFR
with a token instantiation strategy that augments the pool with
built-in and library identifiers, to be kept only if they contribute
to increase the fitness values.
VI.
R ELATED W ORK
In this section, we focus on two research topics that are
closely related to our approach: test input generation and field
failure reproduction.
A. Test Input Generation
Symbolic execution [20] is a systematic approach for
generating test inputs that traverse as many different control
flow paths as possible (all paths, asymptotically). The dramatic
growth in the computational power of today’s computers,
together with the availability of increasingly powerful decision procedures, has resulted in a renewed interest in using
symbolic execution for test-input generation [21, 22, 23, 24].
Despite these recent advances, however, symbolic execution is
still an inherently limited technique, mostly due to the path
explosion problem (i.e., the virtually infinite number of paths
in the code), the environment problem (i.e., the challenges
involved with handling interactions between the code and its
environment, such as external libraries), and the limitations of
constraint solvers in handling complex constraints and theories.
When considering RQ3, based on our results we can
conclude that for cases where the grammar of the
SUT contains complex structures, learning a stochastic
grammar from a corpus of existing inputs can improve
substantially the effectiveness of search based failurereproduction techniques.
Application of symbolic execution to programs with complex, structured inputs that must adhere to a formal language specification is a more recent topic of investigation.
Existing approaches to this problem (e.g., [16, 17]) rely on
the creation of symbolic grammar tokens associated with the
non terminals in the grammar productions (i.e., All tokens
returned by the lexer are replaced by symbolic variables.).
These approaches propagate such symbolic grammar tokens,
collect path constraints on these symbolic grammar tokens,
and finally solve these path constraints to generate concrete
tokens as input. While promising, these approaches also suffer
the same limitations of traditional symbolic execution.
E. Threats to Validity
The main threats to validity for our results are internal,
construct, conclusion, and external validity threats [19].
Internal validity threats concern factors that may affect a
dependent variable and were not considered in the study. In our
case, different grammar based test case generators may have
different failure reproduction performance. We have chosen
RND (with 80/20 or learned stochastic grammar), since it is
representative of state-of-the-art tools for random, grammarbased test case generation. Further experiments using other
generators are necessary to increase our confidence on the
results.
A type of test input generation technique alternative to
symbolic execution relies on search based algorithms, and
in particular genetic algorithms (see Section II) [25]. The
fitness function used in these approaches typically accounts
for the degree of coverage achieved or for the distance from
the coverage target of each test case (e.g., [26, 25]). The
main limitation of search based techniques, when compared to
symbolic execution, is that they can get stuck in local optima
and can only be indirectly (i.e., through the fitness function)
guided towards a target of interest. However, search based
techniques have the great advantage that they can scale to
much larger programs and are not affected by the environment
problem or by the complexity of the path constraints.
Construct validity threats concern the relationship between
theory and observation. We have carefully chosen the experimental measures used to answer RQ1, RQ2 and RQ3. In
particular, the metrics used in the evaluation (FRP, FIT, SZ,
ZSZ, ET) are direct measures of the effects under investigation.
Moreover, all these metrics have been measured objectively,
using tools.
Conclusion validity threats concern the relationship between the treatment (SBFR vs. grammar based test generation)
and the measured performance. To address this threat, we
have drawn conclusions only when performance differences
were reported to be statistically significant at level 0.05 by the
Wilcoxon test.
There are also grammar based test data generation techniques which are mainly focused on systematic enumeration
of sentences from a given grammatical specification (e.g.,
[27]). While these techniques may work relatively well for
achieving a certain level of coverage (mainly of the grammar
productions), their applicability for generating test cases that
reach specific targets, as in field failure reproduction, is limited.
External validity threats are related to the generalizability
of the results. We considered five subjects and eleven failures,
with three subjects involving a moderately complex grammar
and two subjects involving a fairly complex grammar. Generalization to other subjects should be done with care, especially if
the associated grammar is highly complex. We plan to replicate
our experiment on more subjects to increase our confidence in
the external validity and generalizability of our findings.
B. Field Failure Reproduction
One type of techniques that can be used for field failure reproduction relies on capturing and recording program
behaviors by monitoring or sampling field executions (e.g.,
[5, 28, 6]). These techniques, however, tend to either record
170
too much information to be practical or too little information
to be effective.
is the limited information available in crash stacks. In our
previous empirical study in BugRedux [3], we observed that
it is difficult to reproduce system-level field failure using
only information provided in crash stacks. In addition, the
fitness function employed by RECORE is quite different from
the one used in SBFR. RECORE’s fitness function relies on
aligning the stack traces and minimizing the object distance
computed from the values in the stack. For system-level failure
reproduction, the guidance gained from the distances between
the objects (which is normalized to a value in [0,1) and
averaged over all variables) is (in our experience) minimal,
while the exposure of sensitive user data from these variables
could be generally unacceptable. While value distances could
be easily integrated into SBFR’s fitness function (e.g., by
collecting parameter values as part of the call sequences [36]),
in our experience this does not result in a significant gain in
failure reproduction power that justifies the potential exposure
of sensitive data.
For this reason, researchers started investigating more sophisticated approaches to reproduce field failures using more
limited information. Some debugging techniques, for instance,
leverage weakest-precondition computation to generate inputs
that can trigger certain types of exceptions in Java programs
(e.g., [29, 30, 31]). Although potentially promising, these
approaches tend to handle limited types of exceptions and operate mostly at the module level. SherLog [32] and its followup work LogEnhancer [33] use runtime logs to reconstruct
and infer paths close to logging statements to help developers
identify bugs. These techniques have shown to be effective,
but they aim to highlight potential faulty code, rather than
synthesizing failing executions.
ReCrash [8] records partial object states at the method
level dynamically to recreate an observed crash. It inspects
the call stack (collected upon crash) at different levels of
stack depth and tries to call each method in the stack with
parameters capable of reproducing the failure. Although this
approach can help reproduce a field failure, it either captures
large amounts of program states, which makes it impractical,
or reproduces the crash in a shallow way, at the module or
even method level, which has limited usefulness (e.g., making
a method fail by calling it with a null parameter does not
provide useful information for the developer, who is rather
interested in knowing why a null value reached the method).
VII.
C ONCLUSIONS AND F UTURE W ORK
We have presented SBFR, a technique that leverages genetic programming to generate complex and structured test
inputs capable of reproducing failures observed in the field.
SBFR evolves a population of candidate failure inducing
inputs by means of genetic operators that manipulate parse-tree
representations of the inputs. Evolution is guided by a fitness
function that measures the distance between the execution
trace (call sequence) of the observed failure and those of the
generated test cases.
Both ESD [34] and CBZ [35] leverage symbolic execution
to generate program inputs that reproduce an observed field
failure. Specifically, ESD aims at reaching the point of failure
(POF), whereas CBZ improves ESD by reproducing executions
that follow partial branch traces, where the relevant branches
are identified by different static and dynamic analyses. However, as some of the authors have shown in a previous
paper [3], POFs and partial traces are unlikely to be successful
for some failures.
In our empirical evaluation, SBFR widely outperformed
random grammar based test case generation, as well as BugRedux, our field failure reproduction technique based on guided
symbolic execution. For subjects with moderately complex
grammars that describe the structured input, no stochastic
grammar learning is needed to produce the initial population
evolved by SBFR. For subjects involving more complex grammars (e.g., a program that accepts as input a large subset of
the C language), our results show that the learning component
in our approach can dramatically improve the effectiveness
of SBFR. Overall, SBFR was able to successfully reproduce
10 out of the 11 failures considered, while a purely random
technique was able to reproduce only 1 of the failures, and
BugRedux none of them.
Similar to ESD and CBZ, BugRedux [3] is a general
approach for synthesizing, in-house, an execution that mimics
an observed field failure. BugRedux implements a guided
symbolic execution algorithm that aims at reaching a sequence
of intermediate points in the execution. Although the empirical
evaluation of BugRedux has shown that it can reproduce realworld field failures effectively and efficiently, given a suitable
set of field execution data, the approach is based on symbolic
execution and suffers from the inherent problems of these kinds
of techniques (see Section VI-A). Moreover, these approaches
do not rely on any grammar as input, which makes them
ineffective on programs with complex, structured grammarbased input.
In future work, we will (1) conduct additional empirical
studies on programs with highly complex input grammars
(e.g., JavaScript), (2) investigate selective instrumentation techniques to further reduce the overhead imposed by SBFR
without degrading its performance, and (3) investigate hybrid
approaches to field failures reproduction that combine the
strengths of symbolic execution and genetic programming
and can handle a broader class of programs than the two
approaches in isolation.
Another approach, RECORE [9], applies genetic algorithms to synthesize executions from crash call stacks. However, the current empirical evaluation of RECORE focuses
on unit-level, partial executions (i.e., executions of standalone
library classes), so it is unclear whether the approach would be
able to reproduce complete, system-level executions. Failures
in library classes usually result in shallow crash stacks, and
in our experience execution synthesis approaches based on
symbolic execution work also quite well in these cases. At the
system level, a potential fundamental limitation of RECORE
ACKNOWLEDGEMENTS
This work was partially supported by NSF awards CCF1320783 CCF-1161821, CCF-0964647, and by funding from
Google, IBM Research and Microsoft Research.
171
[20] J. C. King, “Symbolic Execution and Program Testing,” Communications of the ACM, vol. 19, no. 7, pp. 385–394, 1976.
[21] C. Cadar, D. Dunbar, and D. Engler, “KLEE: Unassisted and
Automatic Generation of High-Coverage Tests for Complex Systems Programs,” in Proceedings of the 8th USENIX Conference
on Operating Systems Design and Implementation, 2008, pp.
209–224.
[22] W. Visser, C. S. Pǎsǎreanu, and S. Khurshid, “Test Input Generation with Java PathFinder,” SIGSOFT Software Engineering
Notes, vol. 29, no. 4, pp. 97–107, 2004.
[23] K. Sen, D. Marinov, and G. Agha, “CUTE: A Concolic Unit
Testing Engine for C,” in Proceedings of the 10th European
Software Engineering Conference and 13th ACM SIGSOFT
Symposium on the Foundations of Software Engineering, 2005,
pp. 263–272.
[24] P. Godefroid, N. Klarlund, and K. Sen, “DART: Directed Automated Random Testing,” in Proceedings of the 2005 ACM
SIGPLAN Conference on Programming Language Design and
Implementation, 2005, pp. 213–223.
[25] P. McMinn, “Search-based software test data generation: a
survey,” Softw. Test. Verif. Reliab., vol. 14, no. 2, pp. 105–156,
2004.
[26] M. Harman and P. McMinn, “A Theoretical and Empirical Study
of Search-Based Testing: Local, Global, and Hybrid Search,”
IEEE Transactions on Software Engineering, vol. 36, no. 2, pp.
226–247, 2010.
[27] R. Lämmel and W. Schulte, “Controllable combinatorial coverage in grammar-based testing,” in Testing of Communicating
Systems, ser. Lecture Notes in Computer Science, M. Uyar,
A. Duale, and M. Fecko, Eds. Springer Berlin / Heidelberg,
2006, vol. 3964, pp. 19–38.
[28] B. Liblit, M. Naik, A. X. Zheng, A. Aiken, and M. I. Jordan,
“Scalable Statistical Bug Isolation,” in PLDI 2005, 2005, pp.
15–26.
[29] S. Chandra, S. J. Fink, and M. Sridharan, “Snugglebug: A Powerful Approach to Weakest Preconditions,” in Proceedings of the
2009 ACM SIGPLAN Conference on Programming Language
Design and Implementation, 2009, pp. 363–374.
[30] M. G. Nanda and S. Sinha, “Accurate Interprocedural NullDereference Analysis for Java,” in Proceedings of the 31st
International Conference on Software Engineering, 2009, pp.
133–143.
[31] C. Flanagan, K. R. M. Leino, M. Lillibridge, G. Nelson,
J. B. Saxe, and R. Stata, “Extended Static Checking for Java,”
in Proceedings of the 2002 ACM SIGPLAN Conference on
Programming Language Design and Implementation, 2002, pp.
234–245.
[32] D. Yuan, H. Mai, W. Xiong, L. Tan, Y. Zhou, and S. Pasupathy,
“SherLog: Error Diagnosis by Connecting Clues from Run-time
Logs,” in Proceedings of the 15th International Conference on
Architectural Support for Programming Languages and Operating Systems, 2010, pp. 143–154.
[33] D. Yuan, J. Zheng, S. Park, Y. Zhou, and S. Savage, “Improving
Software Diagnosability via Log Enhancement,” in Proceedings
of the 16th International Conference on Architectural Support
for Programming Languages and Operating Systems, 2011, pp.
3–14.
[34] C. Zamfir and G. Candea, “Execution Synthesis: A Technique
for Automated Software Debugging,” in Proceedings of the 5th
European Conference on Computer Systems, 2010, pp. 321–334.
[35] O. Crameri, R. Bianchini, and W. Zwaenepoel, “Striking a
New Balance Between Program Instrumentation and Debugging
Time,” in Proceedings of the 6th European Conference on
Computer Systems, 2011, pp. 199–214.
[36] F. M. Kifetew, “A search-based framework for failure reproduction,” in Proceedings of the 4th international conference on
Search Based Software Engineering. Springer-Verlag, 2012,
pp. 279–284.
R EFERENCES
[1] F. M. Kifetew, W. Jin, R. Tiella, A. Orso, and P. Tonella,
“SBFR: A search-based approach for reproducing failures of
programs with grammar based input.” in Proceedings of the 28th
IEEE/ACM International Conference on Automated Software
Engineering (ASE), 2013.
[2] T. Zimmermann, R. Premraj, N. Bettenburg, S. Just, A. Schröter,
and C. Weiss, “What Makes a Good Bug Report?” IEEE
Transactions on Software Engineering, vol. 36, no. 5, pp. 618–
643, Sep. 2010.
[3] W. Jin and A. Orso, “Bugredux: Reproducing field failures
for in-house debugging,” in Proc. of the 34th International
Conference on Software Engineering (ICSE), 2012, pp. 474–
484.
[4] T. M. Chilimbi, B. Liblit, K. Mehra, A. V. Nori, and K. Vaswani,
“H OLMES: Effective Statistical Debugging via Efficient Path
Profiling,” in ICSE 2009, 2009, pp. 34–44.
[5] J. Clause and A. Orso, “A Technique for Enabling and Supporting Debugging of Field Failures,” in ICSE 2007, 2007, pp.
261–270.
[6] “The Amazing VM Record/Replay Feature in VMware Workstation˜6,” http://communities.vmware.com/community/vmtn/cto/steve/blog/
2007/04/18/the-amazing-vm-recordreplay-feature-in-vmware-workstation-6,
Apr. 2012.
[7] L. Jiang and Z. Su, “Context-aware Statistical Debugging: From
Bug Predictors to Faulty Control Flow Paths,” in Proceedings
of the 22nd IEEE/ACM International Conference on Automated
Software Engineering, 2007, pp. 184–193.
[8] S. Artzi, S. Kim, and M. D. Ernst, “ReCrash: Making Software
Failures Reproducible by Preserving Object States,” in Proceedings of the 22nd European Conference on Object-Oriented
Programming, 2008, pp. 542–565.
[9] J. Rößler, A. Zeller, G. Fraser, C. Zamfir, and G. Candea,
“Reconstructing core dumps,” in Proc. of the 6th International
Conference on Software Testing (ICST), 2013.
[10] R. I. McKay, N. X. Hoai, P. A. Whigham, Y. Shan, and
M. O’Neill, “Grammar-based genetic programming: a survey,”
Genetic Programming and Evolvable Machines, vol. 11, no. 3-4,
pp. 365–396, May 2010.
[11] M. O’Neill and C. Ryan, “Grammatical evolution,” Evolutionary
Computation, IEEE Transactions on, vol. 5, no. 4, pp. 349–358,
Aug. 2001.
[12] K. Lari and S. J. Young, “The estimation of stochastic contextfree grammars using the inside-outside algorithm,” Computer
speech & language, vol. 4, no. 1, pp. 35–56, 1990.
[13] T. H. Cormen, C. E. Leiserson, and R. L. Rivest, Introduction
to Algorithms. MIT Press, 1990.
[14] M. O’Neill, E. Hemberg, C. Gilligan, E. Bartley, J. McDermott,
and A. Brabazon, “Geva: grammatical evolution in java,” ACM
SIGEVOlution, vol. 3, no. 2, pp. 17–22, 2008.
[15] D. Grune and C. J. H. Jacobs, Parsing techniques: a practical
guide. Chichester, England: Ellis Horwood Limited, 1990.
[16] R. Majumdar and R.-G. Xu, “Directed test generation using
symbolic grammars,” in Proceedings of the 22nd IEEE/ACM
International Conference on Automated Software Engineering
(ASE), 2007, pp. 134–143.
[17] P. Godefroid, A. Kiezun, and M. Y. Levin, “Grammar-based
whitebox fuzzing,” in Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation
(PLDI), 2008, pp. 206–215.
[18] S. Lu, Z. Li, F. Qin, L. Tan, P. Zhou, and Y. Zhou, “BugBench:
Benchmarks for Evaluating Bug Detection Tools,” in Workshop
on the Evaluation of Software Defect Detection Tools, 2005.
[19] C. Wohlin, P. Runeson, M. Höst, M. Ohlsson, B. Regnell,
and A. Wesslén, Experimentation in Software Engineering - An
Introduction. Kluwer Academic Publishers, 2000.
172