Proceedings of ISEC '12, Feb. 22-25, 2012 Kanpur, UP, India
A Static Technique for Fault Localization Using Character
N-Gram Based Information Retrieval Model
Sangeeta Lal and Ashish Sureka
Indraprastha Institute of Information Technology
(IIIT-Delhi), India
{sangeeta, ashish} @iiitd.ac.in
ABSTRACT
Bug or Fault localization is a process of identifying the specific location(s) or region(s) of source code (at various granularity levels such as the directory path, file, method or statement) that is faulty and needs to be modified to repair the
defect. Fault localization is a routine task in software maintenance (corrective maintenance). Due to the increasing size
and complexity of current software applications, automated
solutions for fault localization can significantly reduce human effort and software maintenance cost.
We present a technique (which falls into the class of static
techniques for bug localization) for fault localization based
on a character n-gram based Information Retrieval (IR) model.
We frame the problem of bug localization as a relevant document(s) search task for a given query and investigate the
application of character-level n-gram based textual features
derived from bug reports and source-code file attributes. We
implement the proposed IR model and evaluate its performance on dataset downloaded from two popular open-source
projects (JBoss and Apache).
We conduct a series of experiments to validate our hypothesis and present evidences to demonstrate that the proposed approach is effective. The accuracy of the proposed
approach is measured in terms of the standard and commonly used SCORE and MAP (Mean Average Precision)
metrics for the task of fault localization. Experimental results reveal that the median value for the SCORE metric
for JBoss and Apache dataset is 99.03% and 93.70% respectively. We observe that for 16.16% of the bug reports in
the JBoss dataset and for 10.67% of the bug reports in the
Apache dataset, the average precision value (computed at
all recall levels) is between 0.9 and 1.0.
Categories and Subject Descriptors
H.4 [Information Systems Applications]: Miscellaneous;
D.2.9 [Software Engineering]: Management—Productivity, Programming teams; H.3.1 [Content Analysis and
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
ISEC ’12, Feb. 22-25, 2012, Kanpur, UP, India
Copyright 2012 ACM 978-1-4503-1142-7/12/02 ...$10.00.
109
Indexing]: [Dictionaries, Indexing methods]; H.3.3 [Information
Search and Retrieval]: [Retrieval Models]
General Terms
Algorithms, Experimentation, Measurement
Keywords
Bug Localization, Mining Software Repositories (MSR), Information Retrieval (IR), Automated Software Engineering
(ASE)
1.
INTRODUCTION
Bug or Fault localization is a process of identifying the
specific location(s) or region(s) of source code (at various
granularity levels such as the directory path, file, method or
statement) that is faulty and needs to be modified to repair
the defect. Fault localization is a routine task in software
maintenance (corrective maintenance). Due to the increasing size and complexity of current software applications, automated solutions for fault localization can significantly reduce human effort and software maintenance cost.
The aim of the research study presented in this paper is to
investigate novel text mining based approaches to analyze
bug databases and version archives to uncover interesting
patterns and knowledge which can be used to support developers in bug localization. We propose an IR based approach
(framing the problem as a search problem: retrieve relevant
documents in a document collection for a given query) to
compute the semantic and lexical similarity between title
and description in bug reports with file-name and path-name
of source-code files by performing a low-level character-level
n-gram based analysis. N-gram means a subsequence of
N contiguous items within a sequence of items. Word ngrams represent a subsequence of words and character ngrams represent a subsequence of characters. For example,
various word-level bi-grams (N=2) and tri-grams (N=3) in
the phrase Mining Software Repositories Research are: Mining Software, Software Repositories, Repositories Research,
Mining Software Repositories and Software Repositories Research. Similarly, various character-level bi-grams and trigrams in the word Software are: So, of, ft, tw, wa, re, Sof,
oft, ftw, twa, war and are respectively. The paper by Peng et
al. lists the advantages (such as natural-language independence, robustness towards noisy data and effective handling
of domain specific term variations) of character-level n-gram
language models for language independent text categorization tasks [11].
Proceedings of ISEC '12, Feb. 22-25, 2012 Kanpur, UP, India
Study
Antoniol,
2000 [1]
Bug Reports
Maintenance
requests (excerpt from
the change log files)
Source Code
source code or higher
level documents
Granularity
Class level
2
Canfora, 2005
[2]
Title & description of
new and past change
requests
Revision comments
File Level
3
Marcus, 2005
[8]
Specific keywords in
change requests
Source code
Class level
4
Poshyvanyk,
2006 [12]
Bug Description
Source code elements
5
Lukins, 2008
[6]
6
Fry, 2010 [4]
Manual extraction of
keywords from Title
and Description
Title,
Description,
Stack Trace, Operating System
7
Moin, 2010 [9]
Title & description
Comments,
Identifiers and StringLiterals
Class Name, Method
Names, Method Bodies, Comments, Literals
Source File Hierarchy
Class
Level
and Method
Level
Method Level
8
Nichols, 2010
[10]
Bug description of
old and new bugs
Identifier
names,
string literals
1
File Level
Revision Path
Level
Method Level
IR Model
vector space
model
and
stochastic
language
model
Probabilistic
model
described
in
[Jones, 2000]
latent semantic indexing,
dependency
search
SBP
Ranking and LSI
Based
LDA Based
Case Study
LEDA C++ Class library
Word-based
cosine similarity vector
comparison
SVM Classifier
LSI Based
Eclipse,
Mozilla,
OpenOffice
Mozilla Firefox and
KDE (kcalc, kpdf,
kspread, koffice)
AOI 3D modeling
studio, Doxygen
Mozilla SBP Ranking
and LSI Based
Rhino, Eclipse and
Mozilla
SVM Classifier
Rhino
Table 1: Previous approaches for bug localization using static analysis and information retrieval models. The
papers are listed in a chronological order and the traditional methods are classified into various dimensions.
Automated solutions for the software engineering task of
fault localization using information retrieval models (static
analysis based technique in contrast to dynamic analysis
which requires program execution) is an area that has attracted several researcher’s attention. Rao et al. [13] perform a survey and comparison of few IR based models for
fault localization. We review traditional techniques and
present some of the main points relevant to this work in
Table 1. As displayed in Table 1, we arrange previous work
in a chronological order and mention the key features of
previous methods. We study previous methods for IR based
bug localization in terms of the IR model, granularity level,
experimental dataset (case study) and the bug reports and
source code elements used for comparisons. For example,
Antoniol et al. perform experiments on LEDA C++ Class
library and work at Class level granularity whereas Lukins
et al. perform experiments on Rhino, Eclipse and Mozilla
dataset and work at Method level granularity (refer to Table
1).
The approach presented in this paper departs from previous approaches for bug localization as we make use of lowlevel character-level representation whereas all the previous
approaches are based on word-level analysis. We hypothesize that the inherent advantages of character-level analysis is suitable to the task of automating bug localization as
some of the key linguistic features present in bug report attributes and source-code attributes can be better captured
using character n-grams in contrast to word-level analysis.
The performance of character-level representation for the
110
task of fault localization is an open research question. The
study presented in this paper attempts to advance the stateof-the-art on IR based fault localization techniques and in
context to the related work makes the following unique contributions:
1. A novel method for automated bug localization (based
on the information contained in bug title and description expressed in free-form text and source code filename and path-name) using a character-level n-gram
model for text similarity matching. While the method
presented in this paper is not the first approach on
IR based bug localization, this study is the first to
investigate the performance of character-level n-gram
language models for the fault localization problem.
This study is the first to investigate sub-word features (slices of n characters within a word) and a bagof-characters document representation model (rather
than the bag-of-words model in previous approaches)
for the task of static analysis and IR based bug localization.
2. An empirical evaluation of the proposed approach on
real-world and publicly available dataset from two popular open-source projects (JBoss and Apache). We experiment with different configuration parameters (such
as weights for title-filename, title-pathname, descriptionfilename and description-pathname comparisons and
length normalization) and present results throwing light
on the performance of the approach as well as insights
Proceedings of ISEC '12, Feb. 22-25, 2012 Kanpur, UP, India
Similarity
Function
Bag of Character
N-Grams
[TITL] + [DESC]
TEXTUAL FEATURE EXTRACTION [TITLE + DESCRIPTION]
BUG REPORT
Bag of Character
N-Grams
[PATH] + [FNAME]
TEXTUAL FEATURE EXTRACTION [PATH + FILE NAME]
SOURCE CODE
TITL
PATH
DESC
FNAME
SIMILARITY
COMPUTATION
RANK
FILE
NAME
1
.........
2
.........
3
.........
SCORE
TOP K SEARCH RESULTS
Figure 1: High-level solution architecture depicting the text processing pipeline, various modules and their
connections
on the predictive power of different variables in the
proposed textual similarity function.
2.
SOLUTION APPROACH
Figure 1 presents the high level architecture of the proposed solution. The system consists of three components:
(1) Character N-gram Based Feature Extractor (2) Similarity Computation and (3) Rank Generator. The feature extractor module extracts all the character N-grams (of length
3 to 10) from bug reports (title and description) and source
code files (file-name and file-path). The similarity computation module calculates the final score between bug report
and source code described in the following Section.
In contrast to word level n-gram textual features, character
level n-gram feature extraction provides some unique advantages which aligns with the task of bug localization. Each
advantage (in the following list) is supported by real world
examples taken (derived from manual analysis and visual
inspections) from JBoss and Apache bug reports demonstrating effectiveness of proposed solution in bug localization
domain. Table 2 presents examples demonstrating the advantages of the character n-gram based analysis over wordbased analysis.
1. Ability to match concepts present in source code and
string literals: Bug reports frequently contain source
code segments and system error messages which is not
natural language text. Consider bugid JBAS-4649 which
modifies the file ”JRMPInvokerProxyHA”. Bug report
contains terms ”invoker HA proxies”. Character ngram based approach performs matching at character
sequence level and concepts embedded in file names
are matched with concepts present in bug reports.
2. Match term variation to common root: For example,
Bugid JBAS-1862 contains the term ”Interceptors” and
modifies a file with filename containing the term ”Interceptor”. Terms interceptor, interceptors, interception are morphological variations of the term intercept. World level string matching algorithm will require a stemmer function to map all these morphological variation to common root for producing desired result. Character level analysis can detect such
111
concept matches without requiring a word stemmer.
Some more examples derived from the dataset are:
(Call,Caller,Calling), (Manage, Manager, Managed),
(Suffix, Suffixes), (Transaction, Transactions), (Proxy,
Proxies), (Exception, Exceptions), (Enum, Enums),
(Marshel,UnMarshel), (Unit, Units), (Connection, Connections, Connector), (Timer, Timers), (Authenticated,
Unauthenticated), (Load, Loads, Loader), (Wrapper,
Wrapped), (Starting, Startup), (Cluster, Clustering),
(Pool, Pooling), (resend, resent), (Invoke, Invoker).
3. Ability to match misspelled words: Consider bug JBAS1552 which consists of a misspelled word ”Managment”.
This bug modifies several files present in the ”Management” directory. These two words share several character N-grams such as ”Man”, ”ana” , ”nag” , ”men”,
”ent” , ”Mana”, ”anag”, ”ment”, ”Manag” etc and hence
a character n-gram based algorithm will be able to
give a good similarity score between these two words
(without requiring a language-specific spell-checker).
Few more motivating examples derived from evaluation dataset are: (behaiviour, behavior), (cosist, consist), (releated, related), (Releation, Relation), (chekc,
check), (posible, possible), (Becouse, Because), (Managment, Management), (Princpal, Principal), (Periode, Period).
4. Ability to match words and their short-forms: Bug reporters frequently use short-forms and abbreviation in
bug reports such as spec for specification, app for application etc. Character-level approaches are robust
against use of such short forms, because a short-form
shares several character N-grams with their expanded
form. We found various such examples in our dataset
such as: (Config, Configuration), (Auth, Authentication), (repl, replication), (impl, implementing), (Spec,
Specs, Specification), (Attr, Attribute), (Enum, Enumeration), (Sub, Subject).
5. Ability to match words joined using special characters:
Terms in bug reports are sometimes joined using various special characters. These special characters are
used by bug reporters in different context, such as in-
Proceedings of ISEC '12, Feb. 22-25, 2012 Kanpur, UP, India
S. No.
File/Directory
Name
1
local,HALocalManaged- <-local-txConnectionFacdatasource>
tory.java
cmp
cmp2
2.
3.
4.
5.
6.
7.
8.
9.
Title/Description
Description
Word based technique will require parsing of ”<-local-” and camel
case break to match file ”HALocalMangedConnectionFactory” and directory ”local”.
To match ”cmp”(desired directory) with ”cmp2”, a word based technique will require to remove digit from ”cmp2”.
EjbModule
EjbModule. addInter- ”EjbModule.addInterceptor” is a source code fragment which is giving
ceptor
information about file method that need to be fixed, but to match
it with desired file (EjbModule) a word level analysis technique will
require to split it on dot(.) to separate method (addInterceptor ) from
file name( EjbModule).
JDBCEJBQLCompiler EJBQLToSQL92CompilerThese two strings consists of so many matching character N-grams,
but still its difficult for a word based technique to match these words,
as most of the stemming or parsing algorithm wont be able to extract
individual concepts.
ScopedSetAttribute
”ScopedSet...”
Unit There exists considerable overlap between these two strings but still a
TestCase
tests
word based technique will require careful parsing of the word ”ScopedSet” and also need to apply camel casing break on string ”ScopedSetAttributeTestCase” followed by stemming of word ”Tests”, to match
these two strings.
TimerImpl
[TimerImpl]
To match these two strings word based technique require knowledge
of ”[”, ”]” and to do the parsing accordingly.
ModuleOptionContainer <jaas:module-option
Column 3 is showing fragment of configuration file present in one of
name = ”unauthen- the bug reports. Configuration files requires a different parsing mechticatedIdentity”
> anism, in this case will require removal of ”:”, ”-” and other words conguest </jaas:module- catenated with ”module-option”, to match the similar concept present
option>
in source code file and bug report.
class-loading
class loader
Word based technique will require to remove hyphem(”-”) from file
name and also need to perform stemming on loader and loading, to
match file name (”class-loading”) and string (”class”, ”loader”).
UsernamePassword
Password/User
In this case word based technique not only require to perform tokenizaLoginModule
tion based on ”/” but also need to match ”User” and ”Username”.
Table 2: Illustrative examples of concepts present in bug reports and file-name and path-name (derived by
examining the experimental dataset) demonstrating the advantages of the character n-gram based analysis
over word-based analysis
serting emphasis ”ScopedSet...”, [TimerImpl], or joining compound words such as ”module-option”. Wordlevel matching requires knowledge of such facts and
will require a domain specific tokenization (text preprocessing) tool. In contrast, character level analysis
techniques are able to detect concept matching because
of partial matching.
SIM (U, V ) =
M atch(u, v) × Length(u)
uU vV
|U | × |V |
M atch(u, v) =
1
0
if : u = v
Otherwise
(1)
1
2
3
4
5
|U | =
2
fu21 + fu22 + .. + fu2n
Norm.
W(T-F)
W(T-P)
W(D-F)
W(D-P)
SIM-EQ
SIM-TF
SIM-TP
SIM-DF
SIM-DP
N
N
N
N
N
0.25
1
0
0
0
0.25
0
1
0
0
0.25
0
0
1
0
0.25
0
0
0
1
Table 4: Five different configurations of weights (for
each of the four comparisons: T-F, T-P, D-F and DP) to study the predictive power and influence of
each of the four possible comparisons between bug
report attributes and source-code file attributes
(2)
2.1
Label
(3)
SIMSCORE (BR, F ) = W1 ∗ SIM (T − F ) + W2 ∗ SIM (T − P )+
W3 ∗ SIM (D − F ) + W4 ∗ SIM (D − P )
(4)
112
Similarity Function
We define a similarity function for computing the textual
similarity or relatedness between a bug report (query represented as a bag of character n-grams) and a source-code file
(document in a document collection represented as a bag of
character n-gram). Equation 1 is the formula for computing the similarity between two documents in the proposed
character n-gram based IR model. The function consists
of several tuning parameters as a result of which multiple
implementations of the similarity function are possible. We
experiment with different values of the configuration param-
Proceedings of ISEC '12, Feb. 22-25, 2012 Kanpur, UP, India
Label
JBOSS
APACHE
A
Start Issue ID
1 (JBAS-1)
B
Last Issue ID
8879
8879)
C
D
E
F
G
Reporting Date Start Issue
Reporting Date Last Issue
Number of Issue Reports Downloaded
Number of Issue Reports Unable to Download
Number of Issue Reports (Status = CLOSED/RESOLVED, Resolution =
DONE and containing SVN commit)
Number of Issue Reports of Type = BUG AND IN SET G
Number of Issue Reports in SET H AND containing at-least one modified
Java file
Number of Issue Reports in SET I AND a specific version
2004-11-7
2011-2-16
8072
807
2433
1
(GERONIMO1)
6143
(GERONIMO6143)
2003-08-20
2011-09-06
4092
2051
1915
1114
1090
1269
939
576
(VERSION = 4.x)
Number of Issue Reports in SET J AND for which MODIFIED file Found in
specified version
569
736
(VERSION = 1.x,
2.x, 3.x)
637
H
I
J
K
(JBAS-
Table 3: Details regarding the experimental dataset from JBoss and Apache projects
Distribution of Bug Reports Across Number of Files Modified
70
Percentage of Bug Report
Percentage of Bug Reports
60
50
40
30
20
10
0
(1)
(2)
(3)
(4)
(5)
Number of Files Modified
(6)
(>=7)
Figure 2: Distribution of the number of files modified in JBOSS dataset
eters to learn the impact of various variables used for computing textual similarity.
Let U and V represent a vector of character n-grams. For example, U can be a bag of character n-gram derived from the
title of the bug report and V can represent a bag of character n-gram derived from the source code file-name. Similarly,
character n-grams from bug report description and directory
path can be assigned as U and V respectively. The numerator of SIM (U, V ) first compares every element in U with
every element in V and measures the number of matches.
The value of n is added to the cumulative sum in case of a
match (character n-grams being exactly equal). The higher
the number of matches, the higher is the similarity score.
Matches containing longer strings contributes more towards
the sum (as the length of the string is added) in contrast to
strings which are shorter.
The denominator of SIM (U, V ) is the length normalizing
113
factor. The normalization factor in the denominator is used
to remove biases related to the length of the title and description in bug reports as well as the file-name and pathname for source-code files. We perform experiments with
and without normalization to study the influence of length
normalization on the overall accuracy of the retrieval model.
Variable fui represents the frequency of an n-gram (ui ) in
the bag of character n-grams. SIMSCORE (BR, F ) represents the similarity score between a bug report and the
source-code file. As shown in Equation 4, the overall similarity is computed as a weighted average of the similarity
between title and description of the bug report with the
path-name and file-name of the source-code file. T represents title of the bug report, D represents description of bug
report, F represents source-code file-name and P represents
file path-name. SIM (T − F ) represents character n-gram
textual similarity between the title (bug reports) and filename (source code file) using Equation 1. W 1, W 2, W 3 and
W 4 are weights (tuning parameters to increase or decrease
the importance of each of the four similarity comparisons)
such that their sum is equal to 1.
3.
EVALUATION METRICS
We measure the performance (predictive accuracy) of the
proposed approach using two evaluation metrics: SCORE
and MAP (Mean Average Precision). SCORE and MAP
metric are employed as evaluation metric by previous studies on fault localization using Information Retrieval Based
techniques [3][4][5].
3.1
SCORE
SCORE metric denotes the percentage of documents in
the repository (search space) that need not be investigated
by the bug-fixer for the task of fault localization. Accuracy
and SCORE are directly proportional. A higher SCORE
value means higher accuracy. Assume file level granularity
and consider a situation where the size of the search space
is equal to 1000 files. A SCORE of 80% (or 0.80) means
Proceedings of ISEC '12, Feb. 22-25, 2012 Kanpur, UP, India
Performance evaluation data using SCORE metrics for the proposed model with different tuning parameters
100
Quartiles
Median
90
Max: 99.98
80
Median:99.03
Max: 99.98
Q3: 99.96
Median:94.77
Max: 99.98
Max: 99.98
Q3: 99.92
Q3: 98.25
Median:97.03
Q3: 99.93
Max: 99.98
Q3: 99.23
Median:94.08
Median:85.56
Q1: 90.24
Q1: 70.38
70
Q1: 60.36
SCORE (0-100%)
60
50
Q1: 46.46
Q1: 43.80
40
30
20
10
0
Min: 0.33
Min: 0.33
Min: 0.20
Min: 0
All Weight Equal=0.25
W(T-F)=1
W(T-P)=1
W(D-F)=1
W(D-P)=1
Similarity function (not normalized) with different weights across four combinations: T-F, T-P, D-F, D-P
Min: 0.08
Figure 3: A box-and-whisker diagram displaying minimum, lower quartile (Q1), median (Q2), upper quartile
(Q3), and maximum SCORE values for the proposed IR model with five different tuning parameters (similarity function is not normalized) to study the predictive power of each of the four feature combinations (T-F,
T-P, D-F and D-P)[JBOSS Dataset]
that the fault resides in 20% (200 files in a search space of
1000 files) of the document (file) collection. Mathematical
formulation of SCORE metric is presented in equation 5,
where ‘m’ is number of files that a bug modifies, ‘N’ is the
total number of files present in the search space, and f iRank
is the rank assigned to ith file. SCORE is a standard and
common evaluation metrics for measuring the effectiveness
of bug-localization approaches [3][4][5].
SCORE =
1−
max(f 1Rank ,f 2Rank ,...,f mRank )
N
Pi denotes the precision at tth relevant file retrieved and
M denotes the total number of relevant file for a bug. For
example, consider a bug report (query) which modifies three
files (number of relevant files is equal to three). Assume the
system assign ranks 2, 4 and 7. In this example, the AP is
calculated as shown in equation 7. MAP is calculated by
taking the mean of AP values for all the bug reports in the
dataset.
AP =
× 100
(5)
3.2
AP =
MAP
Mean Average Precision or MAP (computed for a set of
queries i.e., for the set of bug reports in the evaluation
datasaet) is equal to the mean of the Average Precision
scores for each query in the experimental dataset. The proposed IR model computes the similarity score between the
query and each of the file in the document collection and
returns a ranked list of files based on the computed numeric
score. Average precision consists of computing the precision of the system at the rank of every relevant document
retrieved. MAP is a well know metric to measure retrieval
performance for IR systems [13][7].
Equation 6 presents the general formula for AP wherein
114
(P1 + P2 + ... + PM )
M
4.
1
2
+
2
4
+
3
7
3
(6)
(7)
EXPERIMENTAL DATASET
In order to validate our hypothesis, we perform an empirical evaluation on dataset downloaded from the issue tracking
system of two popular open source projects: JBOSS1 and
APACHE GERONIMO2 . The dataset is publicly available
as a result of which the experiments performed in this work
can be replicated for further improvements in the technique
and enable comparison with other approaches. We conduct experiments on dataset belonging to two open source
projects to remove any project-specific bias and investigate
1
2
https://issues.jboss.org
https://issues.apache.org/jira/browse/GERONIMO
Proceedings of ISEC '12, Feb. 22-25, 2012 Kanpur, UP, India
Performance evaluation data using SCORE metrics for the proposed model with different tuning parameters
100
Quartiles
Median
90
Max: 99.97
Q3: 98.61
Max: 99.96
Max: 99.97
Q3: 98.77
Q3: 96.37
Max: 99.96
Q3: 96.82
Max: 99.97
Q3: 99.32
Median:82.81
80
Median:83.90
Median:78.92
Median:93.70
Median:74.36
SCORE (0-100%)
70
60
Q1: 61.30
50
40
Q1: 38.74
Q1: 32.31
Q1: 32.07
30
Q1: 24.83
20
10
Min: 0.43
Min: 0.16
Min: 0.22
Min: 0.057
Min: 0
0
All Weight Equal=0.25
W(T-F)=1
W(T-P)=1
W(D-F)=1
W(D-P)=1
Similarity function (not normalized) with different weights across four combinations: T-F, T-P, D-F, D-P
Figure 4: A box-and-whisker diagram displaying minimum, lower quartile (Q1), median (Q2), upper quartile
(Q3), and maximum SCORE values for the proposed IR model with five different tuning parameters (similarity function is not normalized) to study the predictive power of each of the four feature combinations (T-F,
T-P, D-F and D-P)[APACHE Dataset]
the generalization power of the solution. Table 3 displays
the details of the experimental dataset. We implemented a
crawler using JIRA API3 and downloaded the relevant data
(bug report metadata, title description, modified files etc).
In JIRA, the subversion commits (revision id, files changes,
log messages, timestamp, author id) are integrated to the
issue tracker and hence explicit traceability links between
the changed files and associated bug report is available.
Table 3 shows details regarding the experimental dataset
downloaded from JBoss and Apache open source projects.
We extract 569 bug reports from the JBoss project and 637
bug reports from the Apache project satisfying certain conditions. As shown in the Table 3, one of the conditions is
that the issue report should be of type BUG (as the focus
of the work is fault localization) and not other types of issues such as feature requests, task or sub-task. Furthermore
we extract bug reports which are CLOSED and fixed. In
JIRA issue tracking system, the issue tracker and the SVN
commits are integrated and hence the ground-truth (all the
files modified for a bug report) is available. The groundtruth (actual value) is compared with the predicted value to
compute the effectiveness of the proposed IR model. Bug
reports have a field called as affected version in which the
bug reporter specifies the version number of the application
that caused the failure. We download bug reports belonging to specific versions (for which sufficient bug reports were
3
http://docs.atlassian.com/software/jira/docs/api/latest/
115
available for experimental proposes) for JBoss and Apache
(refer to Table 3 for details). We download the source code
for various versions mentioned in the dataset as the fault localization is performed on the specific version on which the
bug was reported. For example for JBoss dataset consists
of a total of 20 versions. In our case, the size of the search
space is equal to the number of Java files in the software. For
the 20 versions in Jboss dataset, the minimum value for the
number of files (size of the search space) is, 6212 maximum
is: 9202 and average is: 7947.7. For the Apache dataset
28 versions are downloaded (minimum size of search space
is 1428, maximum size is 3182 and the average value is 2305).
5.
EXPERIMENTAL RESULTS
Figure 3 and 4 displays descriptive statistics using a boxplot (five-number summary: minimum value, lower-quartile
(Q1), median (Q2), upper quartile (Q3) and maximum value)
of SCORE values under five different tuning parameters for
the JBOSS and APACHE experimental dataset. In addition
to the 25th , 50th and 75th percentile, Figure 3 and 4 also
reveals the degree of dispersion or spread in the SCORE
values. We experiment with five tuning parameters (SIMEQ, SIM-TF, SIM-TP, SIM-DF and SIM-DP as described
in Table 4) to throw light on the predictive power of the
four predictor variables (title and description in bug report
with filename and path in version control system). Figure 3
Proceedings of ISEC '12, Feb. 22-25, 2012 Kanpur, UP, India
50
50
MAP Distribution
45
45
40
37.60
Percentage of Reports
Percentage of Reports
40
35
30
25
20
16.16
15
10.19
10
10.19
6.50 7.73
5
35
30
25
20
15
10.51
10
4.74 3.86
0.70
1.75
8.47
10.67
9.57
5.33
4.23 3.61 3.45
5
0.17 0.35
0
0.94 0.47 0.31
0
.025 .05 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
.025 .05 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
MAP Range
MAP Range
Figure 5: Distribution of average precision values for
the set of bug reports in the JBOSS evaluation dataset.
The average precision value is computed at all recall
levels (ranks at which each relevant document is retrieved).
0.1
Figure 6: Distribution of average precision values for
the set of bug reports in the APACHE evaluation
dataset. The average precision value is computed at all
recall levels (ranks at which each relevant document is
retrieved).
0.1
Precision Recall Curve [Not Normalized]
Precision Recall Curve [Normalized]
0.09
0.08
0.08
(0.482, 0.076)
0.07
(0.387, 0.060)
0.06
0.05
0.04
(0.549, 0.036)
(0.467, 0.030)
0.03
(0.550, 0.018)
(0.670, 0.012)
(0.626, 0.011)
(0.728, 0.005)
(0.758, 0
0.01
(0.340, 0.076)
0.07
(0.286,0.066)
0.06
0.05
(0.448, 0.045)
0.04
(0.391,0.040)
(0.525, 0.029)
0.03
(0.473,0.026)
(0.606, 0.021)
0.02
Precision Recall Curve [Not Normalized]
Precision Recall Curve [Normalized]
0.09
Average Precision
Average Precision
MAP Distribution
42.38
0
0.02
(0.627, 0.019)
(0.560,0.017)
(0.695, 0.010)(0.754, 0.010)
0.01
0
0.3
0.4
0.5
0.6
Average Recall
0.7
0.8
0.1
0.2
0.3
0.4
0.5
0.6
Average Recall
0.7
0.8
0.9
Figure 7: Precision-Recall Curve (average precision at Figure 8: Precision-Recall Curve (average precision at
various average recall values) for the normalized as well various average recall values) for the normalized as well
as not-normalized setting [JBOSS Dataset]
as not-normalized setting [APACHE Dataset]
and 4 reveals the five-number descriptive statistics and distribution of the SCORE values for each of the five tuning
parameter.
We observe that for the SIM-EQ (the weight for all the four
predictor variables being equal or 0.25 each) experimental
setting, the median value for the SCORE metric is 99.03%
for the JBOSS dataset (refer to Figure 3). This indicates
that for 50% of the bug reports in the JBOSS dataset, the
proposed solution was able to decrease the search space by
99.03% (which means that the bug fixer needs to investigate about 1% of the source files to localize the bug). The
median value for the SCORE metric (SIM-EQ experimental
setting) is 93.70% for the APACHE GERONIMO project.
The box-and-whisker diagram of Figure 3 reveals that the
textual similarity between the bug report description and
the file-name (JBOSS dataset) is relatively better predictor
(Q1:70.38%, Q2:97.03%, Q3:99.92%) in contrast to the other
three predictors. Figure 3 and 4 shows bug report descrip-
116
tion is a better predictor than the bug report title. Figure
3 and 4 indicates that for 25% of the bug reports (using
SIM-EQ experimental setting) the value of SCORE metric
is above 99.93% and 99.32% for JBOSS and APACHE respectively.
While Figure 3 and 4 presents descriptive statistics on the
predictive accuracy of the method using SCORE metrics,
Figure 5 and 6 presents the performance results using the
average precision metrics. We measure the effectiveness of
our approach using both the metrics to provide multiple
perspectives. The histogram in Figure 5 and 6 displays the
distribution of average precision values for the set of bug
reports in the JBOSS and APACHE evaluation dataset respectively. The average precision value is computed at all
recall levels (ranks at which each relevant document is retrieved). Consider a situation where a bug report modifies
three files. Assume that the search result rank for each of
the files are: 5, 10 and 30 respectively. The average preci-
Proceedings of ISEC '12, Feb. 22-25, 2012 Kanpur, UP, India
Precision and Recall Correlation
1
(Nr=10)
(Nr=25)
(Nr=50)
(Nr=100)
(Nr=250)
0.8
(Nr=10)
(Nr=25)
(Nr=50)
(Nr=100)
(Nr=250)
0.8
0.6
Precision
Precision
Precision and Recall Correlation
1
0.4
0.2
0.6
0.4
0.2
0
0
0
0.2
0.4
0.6
Recall
0.8
1
0
0.2
0.4
0.6
Recall
0.8
1
Figure 9: Scatter plot displaying the precision and re- Figure 10: Scatter plot displaying the precision and
call value for each bug report (fixed and pre-defined recall value for each bug report (fixed and pre-defined
values of Nr or Top K search results) [JBOSS Dataset] values of Nr or Top K search results) [APACHE]
Figure 11 presents another complementary perspective to
examine the effectiveness of the proposed bug localization
method. Figure 11 displays a histogram in which the X-
117
60
[Jboss]
[Apache]
50
Percentage of Reports
sion (at each recall point) in this case will be: [1/5 + 2/10 +
3/30]/3 = 0.167. The similarity function is not normalized
and the four weights (T-F, T-P, D-F, D-P) are all equal to
0.25 (SIM-EQ configuration parameter as described in Table 4). The X-axis denotes the range (0.3 means from 0.2 to
0.3). Figure 5 reveals that for 16.16% of the bug reports in
the JBOSS dataset, the average precision value is between
0.9 to 1.0. Figure 6 shows that for 10.67% of the bug reports in the APACHE dataset, the average precision value
is between 0.9 to 1.0.
Figure 7 and 8 shows the precision recall curve (for predefined values of Nr or Top K rank) for the JBoss and Apache
dataset respectively. The precision values in Figure 5 and
6 is computed at each recall point (the average precision
value is reported) whereas in Figure 7 and 8 the precision
and recall values are computed for pre-defined values of Top
K (where K = 10, 25, 50, 100 and 250). For example, if
2 out of 5 relevant documents (impacted source code files)
are retrieved amongst the Top 50 search results (Nr=50),
then the precision at Nr=50 is 2/50 = 0.04 and the recall
is 2/5 = 0.4. We conduct experiments to compute the precision and recall value for each bug report at all the five
pre-defined Nr (JBoss and Apache dataset, normalized and
not-normalized setting) and then compute the average precision and recall for various Nr values (SIM-EQ setting). As
illustrated in Figure 7 and 8, the observed trend is that the
recall value increases as Nr increases and the average precision declines. We observe that the average precision and
recall is 0.060 and 0.387 respectively with not-normalization
setting for the JBoss dataset. The average precision and recall is 0.019 and 0.627 respectively with normalization setting for the Apache dataset. Figure 7 and 8 reveals that the
effectiveness of the system (in terms of precision and recall)
is more in length normalization setting in comparison to the
not-normalization setting. While Figure 7 and 8 is a plot of
the average values, Figure 9 and 10 are scatter plots which
each display the precision and recall for all the bug reports.
40
30
20
10
0
Rank 1
Rank 2
Rank 3
Rank 4
Rank 5
Figure 11: Percentage of bug reports in the JBOSS
and APACHE dataset in which at-least one of the
modified file(s) is retrieved as Rank 1-5
axis represents document rank 1 to 5 (in the search results)
and the Y-axis represents percentage of bug reports in the
JBOSS and APACHE dataset. We observe that the percentage of bug reports in the JBOSS and APACHE dataset
in which at-least one of the modified file(s) is retrieved as
Rank 1 is 40% and 26% respectively. This shows that there
is a significant (above 25%) percentage of bug reports for
which the solution was able to retrieve at-least one of the
files in the impact-set as the first search result (Rank 1).
6.
CONCLUSIONS
We conclude that degree of textual similarity between title
and description of bug reports with source code file-names
and directory paths can be used for the task of fault localization. Experimental evidences demonstrate that lowlevel character n-gram based textual features are effective
and have some unique advantages (robustness towards noisy
text, natural language independence, ability to match con-
Proceedings of ISEC '12, Feb. 22-25, 2012 Kanpur, UP, India
cepts present in programming languages and string literals)
in comparison to world-level features. We perform experiments on publicly available real-world dataset from two popular open-source projects (JBoss and Apache) and investigate the overall effectiveness of the proposed model as well
as impact of length normalization and each of the four predictor variables (title-filename, description-filename, titlepath, and description-path). Empirical evidences reveal that
length normalization improves average precision and recall
and description-filename pair has more predictor power in
contrast to the other three pairs. The results (measured in
terms of SCORE and MAP metrics) of static analysis technique for the problem of fault localization using character
n-gram based information retrieval model are encouraging.
7.
[10]
[11]
[12]
ACKNOWLEDGEMENT
This research was supported by The Department of Science and Technology (DST) FAST Grant awarded to Dr.
Ashish Sureka.
[13]
8.
REFERENCES
[1] G. Antoniol, G. Canfora, G. Casazza, and A. de Lucia.
Identifying the starting impact set of a maintenance
request: A case study. In Proceedings of the
Conference on Software Maintenance and
Reengineering, CSMR ’00, pages 227–, 2000.
[2] Gerardo Canfora and Luigi Cerulo. Impact analysis by
mining software and change request repositories. In
Proceedings of the 11th IEEE International Software
Metrics Symposium, pages 29–, 2005.
[3] Holger Cleve and Andreas Zeller. Locating causes of
program failures. In Proceedings of the 27th
international conference on Software engineering,
ICSE ’05, pages 342–351, 2005.
[4] Zachary P Fry. Fault localization using textual
similarities. MCS Thesis, University of Virginia, 2010.
[5] James A. Jones and Mary Jean Harrold. Empirical
evaluation of the tarantula automatic
fault-localization technique. In Proceedings of the 20th
IEEE/ACM international Conference on Automated
software engineering, ASE ’05, pages 273–282, 2005.
[6] Stacy K. Lukins, Nicholas A. Kraft, and Letha H.
Etzkorn. Source code retrieval for bug localization
using latent dirichlet allocation. In Proceedings of the
2008 15th Working Conference on Reverse
Engineering, pages 155–164, 2008.
[7] Voorhees Ellen M. Variations in relevance judgments
and the measurement of retrieval effectiveness. In
Proceedings of the 21st annual international ACM
SIGIR conference on Research and development in
information retrieval, SIGIR ’98, pages 315–323, 1998.
[8] Andrian Marcus, Vaclav Rajlich, Joseph Buchta,
Maksym Petrenko, and Andrey Sergeyev. Static
techniques for concept location in object-oriented
code. In Proceedings of the 13th International
Workshop on Program Comprehension, pages 33–42,
2005.
[9] Amir Moin and Mohammad Khansari. Bug
localization using revision log analysis and open bug
repository text categorization. In Pär Ågerfalk,
Cornelia Boldyreff, Jesús González-Barahona, Gregory
Madey, and John Noll, editors, Open Source Software:
118
New Horizons, volume 319 of IFIP Advances in
Information and Communication Technology, pages
188–199. Springer Boston, 2010.
Brent D. Nichols. Augmented bug localization using
past bug information. In Proceedings of the 48th
Annual Southeast Regional Conference, ACM SE ’10,
pages 61:1–61:6, 2010.
F Peng, D Schuurmans, and S Wang. Language and
task independent text categorization with simple
language models. Proceedings of the 2003 Conference
of the North American Chapter of the Association for
Computational Linguistics on Human Language
Technology NAACL 03, (June):110–117, 2003.
Denys Poshyvanyk, Andrian Marcus, Vaclav Rajlich,
Yann-Gael Gueheneuc, and Giuliano Antoniol.
Combining probabilistic ranking and latent semantic
indexing for feature identification. In Proceedings of
the 14th IEEE International Conference on Program
Comprehension, pages 137–148, 2006.
Shivani Rao and Avinash Kak. Retrieval from
software libraries for bug localization: a comparative
study of generic and composite text models. In
Proceeding of the 8th working conference on Mining
software repositories, MSR ’11, pages 43–52, 2011.
© Copyright 2026 Paperzz