On the Use of Positional Proximity in IR

2014 IEEE CSMR-WCRE Conference
On the Use of Positional Proximity
in IR-Based Feature Location
Emily Hill
Bunyamin Sisman, Avinash Kak
Department of Computer Science
Montclair State University
Montclair, NJ, USA
[email protected]
School of Electrical and Computer Engineering
Purdue University
West Lafayette, IN, USA
{bsisman, kak}@purdue.edu
Abstract—As software systems continue to grow and evolve, locating code for software maintenance tasks becomes increasingly
difficult. Recently proposed approaches to bug localization and
feature location have suggested using the positional proximity of
words in the source code files and the bug reports to determine
the relevance of a file to a query. Two different types of
approaches have emerged for incorporating word proximity and
order in retrieval: those based on ad-hoc considerations and those
based on Markov Random Field (MRF) modeling. In this paper,
we explore using both these types of approaches to identify over
200 features in five open source Java systems. In addition, we
use positional proximity of query words within natural language
(NL) phrases in order to capture the NL semantics of positional
proximity. As expected, our results indicate that the power of
these approaches varies from one dataset to another. However, the
variations are larger for the ad-hoc positional-proximity based
approaches than with the approach based on MRF. In other
words, the feature location results are more consistent across the
datasets with MRF based modeling of the features.
Index Terms—feature location, source code search, software
maintenance
I. I NTRODUCTION
As software systems continue to grow and evolve, locating
code for software maintenance tasks becomes increasingly
difficult. Feature or concern location techniques can be used
to identify locations in source code that may be relevant
to maintenance tasks [1], and numerous IR-based feature
location approaches have been proposed [2], [3], [4], [5],
[6], [7], [8], [9], [10]. Recently proposed approaches to bug
localization and feature location have suggested using the
positional proximity of words in the source code files and the
bug reports to determine the relevance of a file to a query [11],
[4]. In this paper, we take a systematic look at the impact
different varieties of positional proximity information have on
feature location effectiveness.
In this work, we interpret positional proximity within a
window of terms as well as within natural language (NL)
phrases. Specifically, we apply the formal Markov Random
Field (MRF) framework to calculate how frequently query
words occur within a window of terms in a method body. As a
simple ad-hoc approximation to this technique, we also define
a “window” to be a program structure statement delimited
by semicolons or curly braces. Finally, we also account for
positional proximity within NL phrases extracted from method
calls. For instance, the method call cart.add(item) encodes the
NL phrase “add item to cart”. The final approach looks for
occurrences of the query terms within a NL phrase extracted
from a method call or signature.
Our work explores using both these types of approaches to
identify over 200 features in five open source Java systems.
As expected, the results indicate that the power of these
approaches varies from one dataset to another. However, the
variations are larger for the ad-hoc positional-proximity based
approaches than with the approach based on MRF. In other
words, the feature location results are more consistent across
the datasets with MRF based modeling of the features.
II. P OSITIONAL P ROXIMITY A PPROACHES
A. Markov Random Fields (MRF)
In prior work, we have applied positional proximity using
Markov Random Fields (MRF) to improve the effectiveness
of bug localization [12]. MRF modeling was used previously
by Metzler and Croft [13] as a means to improving the
performance of IR algorithms for retrieval from text corpora.
As presented in [12], [13], with MRF, you model the interterm dependencies for retrievals by constructing a dependency
graph G that contains one node for the method being evaluated
for its relevance to the query, with the other nodes representing
the query terms. Typically, we denote the node that stands for
the method by m and the other nodes by the query terms
Q = {q1 , q2 , ..., q|Q| }. Assuming that such a graph has the
cliques C1 , C2 , .., CK , the joint distribution P (Q, m) over the
method m and the query terms is uniquely defined as
P (m, Q) =
K
1 Y
φ(Ck )
Z
k=1
rank
=
K
X
log(φ(Ck ))
(1)
k=1
where ψ(Ck ) = log(φ(Ck )) is a potential function defined
over the cliques of the graph and Z is the normalization constant. This formulation of the dependency between a method
and the query terms can be used to assign retrieval scores to
the methods with respect to a given query while taking into
account the order and the proximity of the terms in the query
and in the source code. This is done by assuming the Markov
property that given a method, the probability of a term depends
only on the neighboring terms. As to what these neighboring
terms are, that is dictated by the inter-term connectivities used
in the graph.
If there exist no inter-term edges in the graph, that is
tantamount to saying that we do not care about any dependencies between the terms. Referred to as the full-independence
(FI) assumption, this is the same as the bag-of-words (BOW)
assumption in IR. At the other end of the spectrum, we can
assume that there is an edge between every pair of terms in the
graph. That is referred to as the full-dependency (FD). Under
the FD assumption, the relevancy degree of a method to a
query is measured by the frequencies for all possible pairings
of the terms in a query matching the frequencies for all similar
pairings in the method.
There are obviously many possibilities between the FI and
the FD assumptions, the most popular being the sequentialdependency (SD) modeling because its semantics best capture
word order and proximity in a natural language and due to its
computationally efficiency. The SD assumption is captured by
assuming an edge between the consecutive pairs of nodes that
represent the query terms in the graph. The SD assumption
is a more general way of capturing the spatial proximity
assumption used in [11].
With that general introduction to MRF modeling, for the
discussion in this paper, m would stand for a method whose
relevance is being evaluated vis-a-vis a given query Q. The FI
and SD assumptions are shown diagrammatically in Figure 1
for the case of a query that has just three terms. With the FI
variant, the graph consist of only 2-node cliques that contain a
method and a query term. Associating the following potential
function with each clique in Figure 1(a) results in the wellknown Dirichlet Language Model (DLM):
ψF I (qi , m) = λF I log(
tf (qi , m) + µP (qi |C)
)
|m| + µ
(2)
where we have used Dirichlet smoothing to account for the
missing query terms in the method under consideration [14].
P (qi |C) denotes the probability of the term qi in the whole
collection, tf (qi , m) is the term frequency of qi in a method
m, |m| denotes the length of the method in terms of the total
number of tokens it contains, and µ the Dirichlet smoothing
parameter.
The SD variant, on the other hand, has 3-node cliques
that contain two consecutive query terms and the method in
addition to the 2-node cliques that is common to both modeling
approaches. We now employ the following potential function
for the 3-node cliques that contain a method m and two
consecutive query terms qi−1 and qi :
ψSD (qi−1 , qi , m) = λSD
tfW (qi−1 qi , m) + µP (qi−1 qi |C) (3)
)
log(
|m| + µ
where tfW (qi−1 qi , m) is the number of times that the terms
qi−1 and qi appear in the same order as in the query within
a window length of W ≥ 2 in the method. Note that the
m
q1
m
q2
q3
(a) Full Independence
q1
q2
q3
(b) Sequential Dependence
Figure 1. Markov Networks for Dependence Assumptions for a method and
a query with three terms.
model constant λF I has no impact on the rankings with the
FI assumption. However, we use this parameter in SD model
together with λSD to combine the scores obtained with the 2node cliques and the 3-node cliques by enforcing λF I +λSD =
1 [12].
B. Program Structure
Sometimes simple is better. Our next positional proximity
approach simply looks for co-occurring query words within
the same statement or comment (leading method comments
are processed as one contiguous block). We approximate a
statement as a line of code ending with a semicolon or curly
brace ([;{}]), ignoring line breaks. If a statement contains at
least two query words, then the number of occurrences of each
query word is incremented. At the end of each method, these
frequencies are used to calculate a tf-idf score.
Based on our experience working with tf-idf for the purposes of feature and concern location, we do not use the
default configuration of local, global, and length normalization that was most successful in natural language document
retrieval in Salton’s original comparison [15]. Instead, we use
a variation of tf-idf. Given a method m and query Q:
|Q|
X
tf idf (m, Q) =
(1 + ln(tf (qi , m))) ∗ ln
i=1
N
df (qi , C)
(4)
where tf (qi , m) is the term frequency of query word qi in
method m, df (qi , C) is the number of methods in the program
containing qi , and N is the number of methods in the program.
Before calculating word frequencies, we apply camel case split
rules and a stemmer specialized for software [16].
We also include a second variant on this approach that only
includes tf-idf values if more than one query word appears
in a method. The purpose of these approaches is to confirm
whether any positional proximity information, no matter how
crude, improves over baseline search techniques that do not
account for positional proximity.
C. Natural Language Semantics
In prior work [4], a positional proximity search technique
was proposed, based on the natural language (NL) semantics
of phrases occurring in source code structure. Also known as
phrasal concepts, which are concepts expressed as groups of
words such as noun phrases (e.g., ‘cell attributes for DB’)
or verb phrases (e.g., ‘get current task from list’), phrasal
concepts have been used to improve search accuracy [4] by
more highly weighting occurrences of query words within
the same phrasal concept. In this approach, we automatically
extract phrasal concepts for a given method signature or call
using the Software Word Usage Model (SWUM) [17]. SWUM
automatically identifies NL phrase structures such as actions
and themes from arbitrary method calls and signatures. When
extracting information from method calls, information about
both the actual and formal parameters are used (whereas signatures only have formal parameter information). For example,
given the signature addToList(Item i), SWUM would identify
the action as “add” and the theme as “item”.
III. C ASE S TUDY
In this study, we investigate the impact of positional proximity on search effectiveness by comparing 8 approaches:
MRF-SD (the SD variant) and its baseline DLM (described in
Section II-A); a statement-based proximity approach (STMT),
and method-based proximity approach (MTHD), and their
baseline TF-IDF (see Section II-B), two NL phrase-based techniques (NL-Sig and NL-Body), as well as a hybrid approach
from prior work (SWUM) [4]. NL-Sig uses phrasal concepts
extracted from method signatures only, while NL-Body uses
phrasal concepts extracted from method calls in the entire
method body. SWUM is a hybrid approach that combines NLSig with TF-IDF applied to the method body.
A. Subject Features and Queries
To compare the approaches, we need a ground-truth (i.e.,
gold) set of queries and features for which to search. In
this study, we use two sets of subject features and humanformulated queries: a set of action-oriented features from four
Java programs, and a larger set from the documentation of a
single Java program.
1) Action-oriented feature set: The first set comprises 8 of
9 features and queries from a previous concern location study
of 18 developers searching for action-oriented concerns [9].
One of the techniques in the study, Google Eclipse Search
(GES) [18], uses keyword-style queries suitable for input to
the search techniques used in this study. For one feature no
subject was able to formulate a query returning any relevant
results, leaving us with 8 features in our study. For each
feature, 6 developers interacted with GES to formulate a query,
resulting in a total of 48 queries, 29 of which are unique.
The features are mapped at the method level, and contain
between 5–12 methods each. The programs contain 23–75
KLOC, with 1500–4500 methods [9]. Note that the average
number of words per query is only 1.792, and range in length
from 1–5 words.
2) Documentation-based concern set: The second set
consists of 215 documented features from the 45 KLOC
JavaScript/ECMAScript interpreter and compiler, Rhino, from
a set of 415 [19]. Each feature maps to a subsection of the
documentation, which is used as the feature description. The
number of program elements (methods and fields) in each of
the 415 features varies from 1 to 334. For the purposes of
this study, we only consider features containing at least 10
program elements and no more than 110 elements, leaving us
with 215 concerns.
To obtain human-formulated queries for the set, we asked 8
volunteer software developers familiar with Java programming
to read the documentation for a subset of 80-81 concerns.
The developers had varying levels of programming and industry experience. The subjects were asked to formulate a
query containing words they thought were relevant to the
feature and would be the first query they would type into a
search engine such as Google when searching. They could
include specific identifiers as keywords if those were listed in
the documentation. The developers were randomly assigned
blocks of features such that 3 different subjects formulated
queries for each feature, yielding a total of 645 concern-query
combinations. It should be noted that these queries are slightly
longer, with an average length of 4.009 words, and ranging in
length from 1–9 words.
B. Methodology
Search effectiveness is typically calculated in terms of precision, which measures the proportion of relevant documents
retrieved out of all of the documents returned, and recall,
which measures the proportion of relevant documents retrieved
out of all the relevant documents for a given query. Precision
and recall are typically calculated at a particular rank. In
traditional IR, rank-based precision/recall measurements are
converted into the calculation of Average Precision (AP) [20],
[21], which is the area under the precision-recall curve that
results from the individual precision and recall values at different ranks. When AP is averaged over the set of all queries,
we get what is known as the Mean Average Precision (MAP),
which signifies the proportion of the relevant documents that
will be retrieved on average for a given query. A MAP of
0.2 means that, on average, a retrieval for a query returns one
relevant document for every 5 documents retrieved. The higher
the MAP, the more effective the retrieval algorithm.
While MAP values capture the effectiveness of a search
technique across all relevant ranks, in practice, developers
searching software want to see relevant results right at the
top of the list. We use the precision at rank 1 (P1) to capture
search effectiveness at the top of the list. It should be noted
that due to multiple documents having the same score, there
may be multiple methods at the top rank. We apply Tukey’s
Honest Significant Differences (HSD) to determine significant
differences between the techniques [22].
C. Threats to Validity
Because our focus is on Java software, these results may
not generalize to searching other programs or written in other
programming languages. Slight tweaks to the configurations
used in the study may result in slight differences in the results.
However, these results provide a relative baseline from which
to guide further investigations.
D. Results
Figures 2 and 3 show box plots of the MAP and precision
at rank 1 results for the 8 techniques in the study, sorted
0.6
0.6
(a) AOC Feature Set
STMT
TF−IDF
MTHD
MRF−SD
MTHD
STMT
NL−Body
TF−IDF
NL−Sig
0.0
DLM
0.0
MRF−SD
0.2
DLM
0.4
0.2
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
NL−Body
0.4
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
NL−Sig
0.8
●
●
●
●
●
SWUM
0.8
MAP
1.0
SWUM
MAP
1.0
(b) Rhino Feature Set
Figure 2. MAP values, sorted by decreasing mean. Techniques labelled with the same line are not significantly different at α = 0.05.
1.0
●
0.8
0.8
0.6
0.6
P1
0.4
0.2
0.4
0.2
NL−Sig
SWUM
STMT
TF−IDF
MTHD
DLM
NL−Body
MTHD
DLM
TF−IDF
STMT
NL−Sig
MRF−SD
(a) AOC Feature Set
MRF−SD
0.0
●
SWUM
0.0
●
●
●
●
NL−Body
P1
1.0
(b) Rhino Feature Set
Figure 3. Precision at rank 1 values, sorted by decreasing mean. Techniques labelled with the same line are not significantly different at α = 0.05.
by decreasing mean. The box represents the inner 50% of
the data, the heavy middle line represents the median, the
plus represents the mean, and ‘◦’ indicates outliers. In each
figure, the lines under the technique labels are used to indicate
statistical significance: techniques connected by a line are not
significantly different at α = 0.05.
As can be seen by the relative ordering of the techniques, the
best techniques vary widely between data sets. on the actionoriented feature set AOC, SWUM performs the best, followed
closely by MRF-SD and DLM. Recall that SWUM is a hybrid
approach that combines NL-Sig with TF-IDF. Although the
means for DLM and MRF-SD are not significantly different,
MRF-SD (i.e., positional proximity) offers a clear advantage
over DLM for 75% of the data set. We can see that the
simplistic statement and method level proximity approaches
perform worse than all the techniques, including TF-IDF.
The documentation-based Rhino features offer a different
perspective. In Figure 2(b) we see that NL-Body, MRF-SD,
DLM, MTHD, and to some extent TF-IDF all perform similarly well. The means and medians are quite close, with the
major differences showing in the distribution of effectiveness
from top 25% to the upper 50% of the data. For instance, TFIDF has fewer queries with MAP values above 0.6 and more
in the 0.2–0.4 range than the other techniques.
The precision values at rank 1 tell a similar story in
Figures 3(a) and 3(b). Of particular note is the success of the
hybrid technique, SWUM, on the AOC feature set. On this set,
over 75% of the queries have 100% precision at the top rank,
implying that the top ranked documents are overwhelmingly
relevant (but as the MAP values show, the remaining relevant
documents may not be so highly ranked).
Across both diverse data sets, MRF-SD and DLM per-
form consistently well. When analyzed separately, positional
proximity plays a role in the most successful techniques, but
different types of positional proximity are most effective. For
example, in the AOC features, semantic information in NL-Sig
combined with TF-IDF in the SWUM approach significantly
outperforms the NL-Body, STMT, and MTHD approaches.
In contrast, for the Rhino set, SWUM and NL-Sig perform
significantly worse than the other approaches. Although the
presence of multiple query words is important for relevance,
it is not yet clear whether positional proximity is the best way
to capture that information for the problem of feature location.
IV. R ELATED W ORK
Traditional feature location approaches apply bag of word
(BOW) techniques using a variety of information retrieval
mechanisms, such as LSI [5], LDA [23], ICA [24], FCA [8] or
by applying automated query reformulations [2]. A full survey
has been recently published [1].
In contrast, FindConcept [9] and SWUM [4] take advantage
of natural language (NL) positional proximity by utilizing
the semantics of words within phrasal concepts. Recent approaches have applied a multi-faceted approach to feature location [10], and taken advantage of structural dependencies [6].
V. C ONCLUSION
In this paper, we explored using multiple types of positional
proximity applied to the problem of feature location. We compared the formal Markov Random Fields (MRF) framework,
along with its well-known variant, the Dirichlet Language
Model (DLM), with natural language based semantic positional proximity and other ad-hoc approaches based on tf-idf.
In a study of over 200 features across 5 open source Java
systems, our results indicate that MRF, and its simple variant
DLM, are more consistently effective across multiple data sets.
The presence of multiple query words is important for
relevance, but it is not yet clear whether positional proximity
is the best way to capture that information for the problem
of feature location. Prior work has shown that MRF’s positional proximity significantly outperforms DLM’s simplified
model for the problem of bug localization applied at the file
level [12]. From this initial study, we see that the problem
of feature location at the method level does not show the
same consistent improvement from positional proximity as bug
localization. This is likely due to the differences in the problem
domain. Bug localization queries are typically much longer
than 2-4 keywords, often derived from the title or description
of the bug report. Plus, working at the file versus method level
of granularity significantly changes the number of words in a
document. Given the wide disparity in effective techniques for
each data set, more research is needed into how the best IRbased feature location approaches can be selected in advance
for a particular problem domain.
R EFERENCES
[1] B. Dit, M. Revelle, M. Gethers, and D. Poshyvanyk, “Feature location
in source code: A taxonomy and survey,” J. Software Maintenance &
Evolution: Research and Practice, vol. 25, no. 1, pp. 53–95, Jan 2013.
[2] S. Haiduc, G. Bavota, A. Marcus, R. Oliveto, A. De Lucia, and
T. Menzies, “Automatic query reformulations for text retrieval in
software engineering,” in Proc. 2013 International Conference on
Software Engineering, 2013, pp. 842–851.
[3] E. Hill, L. Pollock, and K. Vijay-Shanker, “Automatically capturing
source code context of NL-queries for software maintenance and reuse,”
in Proc. 31st International Conference on Software Engineering, 2009,
pp. 232–242.
[4] ——, “Improving source code search with natural language phrasal
representations of method signatures,” in Proc. 26th IEEE International
Conference on Automated Software Engineering, short paper, 2011, pp.
524–527.
[5] A. Marcus, A. Sergeyev, V. Rajlich, and J. I. Maletic, “An information
retrieval approach to concept location in source code,” in Proc. 11th
Working Conference on Reverse Engineering, 2004, pp. 214–223.
[6] M. Petrenko and V. Rajlich, “Concept location using program
dependencies and information retrieval (depir),” Inf. Softw. Technol.,
vol. 55, no. 4, pp. 651–659, 2013.
[7] X. Peng, Z. Xing, X. Tan, Y. Yu, and W. Zhao, “Improving feature
location using structural similarity and iterative graph mapping,” J.
Syst. Softw., vol. 86, no. 3, pp. 664–676, 2013.
[8] D. Poshyvanyk, M. Gethers, and A. Marcus, “Concept location using
formal concept analysis and information retrieval,” ACM Transactions
on Software Engineering and Methodology, vol. 21, no. 4, 2012.
[9] D. Shepherd, Z. P. Fry, E. Hill, L. Pollock, and K. Vijay-Shanker, “Using
natural language program analysis to locate and understand actionoriented concerns,” in Proc. 6th International Conference on AspectOriented Software Development, 2007, pp. 212–224.
[10] J. Wang, X. Peng, Z. Xing, and W. Zhao, “Improving feature location
practice with multi-faceted interactive exploration,” in Proc. 2013
International Conference on Software Engineering, 2013, pp. 762–771.
[11] B. Sisman and A. C. Kak, “Assisting code search with automatic query
reformulation for bug localization,” in Proc. 10th Working Conference
on Mining Software Repositories, 2013, pp. 309–318.
[12] B. Sisman and A. Kak, “Exploiting spatial code proximity and order for
improved source code retrieval for bug localization,” Purdue University,
Tech. Rep. TR-ECE-13-12, Oct 2013.
[13] D. Metzler and W. Croft, “A markov random field model for term dependencies,” in Proc. 28th annual international ACM SIGIR conference on
Research and development in information retrieval, 2005, pp. 472–479.
[14] B. Sisman and A. Kak, “Incorporating version histories in information
retrieval based bug localization,” in 2012 9th IEEE Working Conference
on Mining Software Repositories, 2012, pp. 50–59.
[15] G. Salton and C. Buckley, “Term-weighting approaches in automatic
text retrieval,” Information Processing and Management, vol. 24, no. 5,
pp. 513–523, 1988.
[16] A. Wiese, V. Ho, and E. Hill, “A comparison of stemmers on source code
identifiers for software search,” in Proc. 2011 16th IEEE International
Conference on Software Maintenance and Reengineering, 2011, pp.
496–499.
[17] E. Hill, “Integrating natural language and program structure information to improve software search and exploration,” Ph.D. Dissertation,
University of Delaware, Aug. 2010.
[18] D. Poshyvanyk, M. Petrenko, A. Marcus, X. Xie, and D. Liu, “Source
code exploration with Google,” in Proc. 22nd IEEE International
Conference on Software Maintenance, 2006, pp. 334–338.
[19] M. Eaddy, T. Zimmermann, K. D. Sherwood, V. Garg, G. C. Murphy,
N. Nagappan, and A. V. Aho, “Do crosscutting concerns cause defects?”
IEEE Trans. Soft. Eng., vol. 34, no. 4, pp. 497–515, 2008.
[20] C. D. Manning, P. Raghavan, and H. Schütze, Introduction to Information Retrieval. NY, NY, USA: Cambridge University Press, 2008.
[21] M. D. Smucker, J. Allan, and B. Carterette, “A comparison of
statistical significance tests for information retrieval evaluation,” in
Proc. sixteenth ACM conference on Conference on information and
knowledge management, 2007, pp. 623–632.
[22] B. A. Carterette, “Multiple testing in statistical analysis of systemsbased information retrieval experiments,” ACM Trans. Inf. Syst., vol. 30,
no. 1, pp. 4:1–4:34, 2012.
[23] S. Lukins, N. Kraft, and L. Etzkorn, “Source code retrieval for bug
localization using latent dirichlet allocation,” in Proc. 15th Working
Conference on Reverse Engineering, 2008, pp. 155–164.
[24] S. Grant, J. R. Cordy, and D. Skillicorn, “Automated concept location
using independent component analysis,” in Proc. 2008 15th Working
Conference on Reverse Engineering, 2008, pp. 138–142.