-2015 Dear Professor Di Penta and Reviewers Thank you for the

-2015
Dear Professor Di Penta and Reviewers
Thank you for the detailed and constructive reviews.
Following the two reviewers’ comments and advice, we significantly revised our
manuscript. Below we provide our responses to the reviewers’ comments. We are
happy to answer any further questions.
We would like to thank you for giving us this opportunity to improve the manuscript.
We look forward to hearing from you.
Yours Sincerely,
Xin, David, Ying, Jafar, Tien, and Xinyu
Responses to Reviewers’ Comments
1. Summary of Changes.
2. Changes to Specific Reviewer Comments
Reviewer #1:
1.1 Comment: The strengths of this paper include the impressive results and the
development of a topic-modeling framework that goes beyond the standard
application of LDA as a black-box topic-modeling tool. However, because of that fact
that most applications of LDA within software engineering are using LDA a black box
tool, care must be taken in introducing and explaining the new model to the reader.
For example, you may want to add a new section that begins with the standard LDA
model, introduces the features, and builds many standard LDA models as you do in
4.5.2. Explain the shortcoming of this approach from the perspective of what the
model means. Then introduce the multi-feature topic model. Explain how this betters
captures the generative model and why it is a preferred model.
Response: Thank you for the advice. We divided the “proposed approach” section into
4 sections, “Overall Framework”, “Topic Extraction with LDA”, “Topic Extraction with
MTM”, and “TopicMiner: An Incremental Learning Method”. We have added a new
section named “Topic Extraction with LDA” which introduces standard LDA model and
how to generate topics by using LDA. And put the content of “Inference of MTM” into
the
technical
report
which
can
be
downloaded
from
https://github.com/xinxia1986/bugTriaging/.
We also added the following sentence in the Topic Extraction with MTM” section:
“Notice LDA is a general topic model which do not consider the characteristics of bug
reports. Our MTM leverages the multi-features in bug reports to better generate the
topics from the bug reports.”
1.2 Comment: Another weakness of this paper is the assumption that a single sample
produced by Gibbs sampling is in fact the topic model. The topic model is the average
of many samples. Given the use of a single sample, there a threat is the validity of the
results. This threat can be characterized by running the experiment several times and
looking for the amount of change between different runs. On the other hand a more
sound approach would determine a technique for averaging the results of the Gibbs
sampling. It is not enough to run the sampler for several iterations. This merely
ensures that the sample comes for the correct topic distribution. Collecting several
samples and averaging them will ensure that the topics that are observed represent
the average rather than an outlier in the topic space, which will lead to consistent
results. Be aware that care must be taken when averaging due to the topic
exchangability problem.
Response: Thank you for the advice. To address this threat, we run MTM and LDA 10
times, and compute the average performance across the 10 times.
We have added
the following sentences in the experiment setup section:
“Moreover, since both MTM and LDA use Gibbs Sampling to generate the topics,
which will introduce the randomness of our approach. To reduce the randomness, we
run MTM and LDA 10 times, and we compute the average performance across the 10
times.”
We also changed all the experiment results for TopicMiner^{MTM}, TopicMiner^{LDA},
and other baseline approaches which also use LDA. Moreover, we have added a
section “Stableness of TopicMiner^{MTM}” under the “Discussion” section:
“Notice our MTM has ran 10 times, and the average top-1 and top-5 accuracies scores
are computed across the 10 times. Here, we would like to investigate that whether
the performance of TopicMinder^{MTM} would be varied when we run MTM multiple
times. Figures 10, 11, 12, 13, and 14 present the top-1 and top-5 accuracies for
TopicMiner^{MTM} with different ran times in GCC, OpenOffice, Netbeans, Eclipse,
and Mozilla, respectively. We notice that across the 5 figures, the performance of
TopicMiner^{MTM} is stable, and for different ran times, the difference are small.
Thus, we believe that our TopicMiner^{MTM} is a stable approach, the randomness
due to Gibbs Sampling has little impact to the performance of bug triaging.”
We also added a sentences in the “Conclusions and further works” section:
“We also plan to design a better algorithm which runs Gibbs Sampling multiple times,
and output the average topic distribution across the multiple times to further reduce
the effect due to outliers appear in a single Gibbs Sampling.”
1.3 Comment: For users the TopicMiner, the amount of training data required is of
interest. From the description, it appears that initially the system is trained on 10% of
the bug reports, which varies from 1300 to almost 8000 bugs. Then after each 10%
chunk is predicted, the topics are recomputed using the additional 10% of the data as
training data. Do you observe any improvement in performance as more training data
is added? How many bugs are sufficient for training? Is it dependent on the number
of features (i.e product-component combinations), which should be included in Table
2. Finally, with regards to features, how does your system cope when a new feature is
introduced?
Response: Thank you for the advice. We have added a new research question “RQ3:
What is the effect of varying the amount of training data to the performance of
TopicMiner^{MTM}?”, and added a new section 4.4.3 to answer this research question.
We have added the following paragraphs:
“RQ3: What is the effect of varying the amount of training data to the performance of
TopicMiner^{MTM}?
To evaluate the performance of TopicMiner^{MTM}, we use the longitudinal data
setup. With the number of folds increase, the amount of the training data increase. In
this research question, we investigate whether the performance of TopicMiner^{MTM}
increases with the amount of training data increase. To answer this research question,
we present the top-1 and top-5 accuracies for the 10 folds as shown in the experiment
setup section.”
“4.4.3 RQ3: Amount of Training Data
Tables 5 and 6 present the top-1 and top-5 accuracies for TopicMiner^{MTM} with
different amount of training data (fold 0 - fold 9). Note that in our longitudinal data
setup, we divide our data into 11 non-overlapping frames, thus one frame
corresponds to 9.09% (1/11) of the total number of bug reports. In fold 0, the amount
of training data is 9.09% of the total number of bug reports, and in fold 9, the amount
of training data is 90.09% of the total number of bug reports.
From the 2 tables, we notice for GCC and Netbeans, in general, the performance of
TopicMiner^{MTM} will increase as the amount of training data increase. For
OpenOffice, Eclipse and Mozilla, in general, the performance of TopicMiner^{MTM}
will decrease as the amount of training data increase.
Our collected Eclipse and Mozilla datasets are much larger than the other 3 datasets,
which contain 82,978 and 86,183 bug reports, and 1,898 and 1,813 candidate fixers,
respectively. With the amount of training data increase, the search space for the
candidate fixers will increase, and some fixers may leave the community, thus, the
performance of TopicMiner^{MTM} will decrease as the amount of training data
increase in Eclipse and Mozilla. Still, the performance of TopicMiner^{MTM} are
acceptable, the top-5 accuracy is above 0.7 for Eclipse and Mozilla in the 10 folds.
For OpenOffice, we notice in folds (such as folds 2, 5, and 9), the performance of
TopicMiner^{MTM} is decreased compared with the previous folds. We manually
check the dataset, we find in these folds, a number of new feature combinations are
introduced. For example, the product-component combination ``TestProdct-other''
are introduced in the frame 3 (fold 2), which makes TopicMiner^{MTM} recommend
wrongly fixers to the product-component combination.
Moreover, from the two tables, we notice that the number of features (i.e productcomponent combinations) do not have the direct impact to the performance of
TopicMiner^{MTM}. For example, in Mozilla, the number of product-component
combination is 777, but it achieves the lowest top-1 and top-5 accuracies compared
with the other 4 datasets. And in GCC, the number of product-component
combination is only 49, its top-1 accuracy is ranked 4 and its top-5 accuracy is ranked
3 among the 5 datasets.”
We also modified the Table 2 to include the product-component combination.
1.4 Comment: Also, is there any fall off in performance for features that don’t have
many bugs? This analysis should also be included to better understand the limitations
of the new approach.
Response: Thank you for the advice. We have added a section in the Discussion section
named “Impact of Different Product-Component Combinations”. We have added the
following paragraphs:
“Considering that some product-component combination have more bug reports,
while some product-component combination have less bug reports. We also check
whether there is any fall off in performance for product-component combinations
with less bug reports. Tables 7 and 8present the first 5 product-component
combinations which appear at least 10 times, and the top 5 product-component which
appear the most of the times in OpenOffice and Netbeans, respectively. The columns
correspond to the name of the product (Product), name of the component
(Component), number of the times that the product-component combination appear
in our collected data (# Comb.), top-1 accuracy for TopicMiner^{MTM} in the
combination (Top-1), and top-5 accuracy for TopicMiner^{MTM} in the combination
(Top-5).
From the two tables, we notice in general, with the number of bugs in the productcomponent combination increase, the top-1 and top-5 accuracies will also increase.
For example, in NetBeans, the top-1 and top-5 accuracies for the product-component
combination ``servicepluign-Code'' are 0.20 and 0.50, while the top-1 and top-5
accuracies for the product-component combination ``projects-maven'' are 0.73 and
0.98.”
1.5 Comment: As the paper is revised, attention needs to be paid to expressing
ideas clearing. For example, the notion of features is quite unclear because of your
desire to have a universal model with any number of features, but currently only use
two features.
Response: Thank you for the advice. To show our TopicMiner^{MTM} can work in
different features, we have added a new research question “RQ4: What is the
performance of TopicMiner^{MTM} with different input features?”. We have added
the following paragraphs:
“RQ4: What is the performance of TopicMiner^{MTM} with different input features?
By default, we choose product and component as two input features for MTM. For a
bug report, we have a number of other features such as product, component, and
reporter. In this research question, we investigate the performance of
TopicMiner^{MTM} with different input features. To answer this research question,
we evaluate the performance of TopicMiner^{MTM} with product only, component
only, reporter only, product-reporter combination, component-reporter combination,
and product-component-reporter combination, and denote them as T^{M_{p}},
T^{M_{c}}, T^{M_{r}}, T^{M_{p,r}}, T^{M_{c,r}}, and T^{M_{p,c,r}}, respectively.”
“7.4.4 RQ4: TopicMiner^{MTM} with Different Input Features
Table 6 presents the top-1 and top-5 accuracies for TopicMiner^{MTM} compared
with T^{M_{p}}, T^{M_{c}}, T^{M_{r}}, T^{M_{p,r}}, T^{M_{c,r}}, and T^{M_{p,c,r}}. We
notice among the 7 variants, TopicMiner^{MTM} (product-component combination
as the input features) achieves with the best performance. Moreover, all of the 7
variants show better performance than the baseline approaches as shown in Table 2.”
1.6 Comment: There is an additional oddity that comes from the fact that you really
only have one feature whose value is derived from two features. At least this is the
understanding that I came to.
Response: Thank you for the advice. Yes, in this paper, we use product-component
combination as the input feature which is derived from the product and component
features.
Xin: Actually, I am not sure how to address this comments.
1.7 Comment: There are many editorial changes to be addressed that are discussed in
the annotated document. One of these is the graphical model in Figure 7. It doesn’t
seem quite right. In particular fi doesn’t seem like in should be in the box for m, or if
it is, there is only one fi since each bug report as a single value for product-component.
Beyond that is unclear what the theta and phi matrices contain in the new model. In
traditional LDA, theta is a document-topic matrix while phi is a topic-word matrix.
What information do the matrices in this model represent?
Response: Thank you for the advice. Actually, Fm should be inside the box M. Here,
the box means replication. It is a simplified way to represent a probabilistic graphical
model. M is the number of bug reports. For each report, we have a model with the
same structure in box M. For each report, there is a set of features. So we should put
fi in box M to denote that there is a set of features for every report. We use multiple
fi in M in order to give a general representation. But it is also OK to just the vector
form of fi to denote it. We changed the Figure 8 as follows:
We also added the following paragraphs to explain the theta and phi matrices contain
in the new model:
“In our MTM model, we have the theta and phi matrixes. In theta matrix, each row is
a topic distribution to a feature vector. This is different from traditional LDA as it
assumes that each document has a topic distribution. Our model assumes that each
feature combination Fm has a topic distribution. A document’s topic distribution is
then computed based on its feature combination’s topic distribution. The phi matrix
contains the same information as in LDA. It is a topic-word distribution matrix. Each
row in it is a word distribution of one topic.”
Reviewer #2:
2.1 Comment: My main concern is that the paper, in the current state, reads like an
AI paper, not like a software engineering one. For instance, I don’t understand why
there is an extensive explanation of the Gibbs sampling function. It would have been
a lot easier if you just specify the input to that algorithm and the output. I think it is
out of scope to explain in such details the Gibbs estimation algorithm.
Response: Thank you for the advice. To improve the readability of the paper, we
have move the “Inference of MTM” section into a technical report which can be
downloaded from https://github.com/xinxia1986/bugTriaging/.
2.2 Comment: I had considerable difficulty in understanding the model. There are
topics and features and terms, but there seem to be too many vectors and
probabilities floating around. The idea of figure 7 is nice, but it does not convey very
much. I strongly recommend that you prepare a UML diagram showing the
relationships between all the concepts you introduce (ie bug reports, experts, topics,
features, words/terms, topic-word vectors, topic assignment vectors, feature-topic
vectors, topic distribution vectors, and perhaps some more).
The topic model is hard to follow. Perhaps some better naming conventions would
help. Why is a feature called fi and not simply f? Why "phi" for the topic-word vector?
Why do features sometimes have the model m as a superscript and sometimes not?
Why "theta" for the feature-topic vector? Why "K" for the number of topics? Why "k"
for a topic? All of these rather poorly chosen names make it just harder top follow
the model. Figure 7 (without any annotations) simply does not help. A UML diagram
of the domain concepts annotated with the names of the vectors would probably help
much more. Also referring to the new bug report as “new” is also confusing.
Response: Thank you for the advice. We changed the graph model of MTM by using
Fm instead of fi to denote the feature-combinations, and we changed the
corresponding paragraphs which use fi.
We also added a figure to explain the relationships for the variables in MTM, and we
added the following paragraphs:
“Figure 9 presents the relationships of variables in our model and how our model
works. For each bug report $m$, it is associated with a feature combination
$\mathbf{F}_m$ and a list of words $w_m$. A set of bug reports is input into the
Multi-feature Topic Model (MTM). There are two sets of parameters in MTM, which
are topic-word vectors $\{\phi_1, \phi_2,\cdots, \phi_K\}$ and feature-topic vectors
$\{\theta_{\mathcal{F}_1}, \theta_{\mathcal{F}_2}, \cdots,\theta_{\mathcal{F}_I}\}$.
They are all unknown and will be learned by MTM based on the input bug reports.
After learning these variables, MTM is then able to assign a topic to each word, this
assignment vector for bug report $m$ is denoted as $z_m$. In the end, the well
learned MTM can be used to recommend bug report to potential fixers. More details
will be introduced in the rest of this paper.”
2.3 Comment: I had some problems understanding what you mean by "inferring
topics". What you actually infer are clusters of related terms, and the probabilities of
those terms belonging to those clusters. The clusters correspond to the notion of a
topic, but they lack a name. This could be better explained. In the paper at times you
suggest that you infer the topics themselves, and other times it sounds like that users
must input them (actually they just enter the number of topic clusters).
Response: Thank you for the advice. Yes, what we infer is actually cluster of related
terms. We added the following sentences in the paper:
“Notice in MTM, a topic is a terminology used to describe a cluster of related words.
The users do not need to input topics at all as our model is unsupervised, i.e., we do
not need to define the name of the topics (aka. clusters) in advance.”
2.4 Comment: I am missing an insight into why exactly your approach improves on
prior work. Is it due to using probabilities of terms belonging to topics? Is it because
you add the dimension of features (you only use two of them), or is it the machine
learning approach that incremental assigns probabilities?
Response: Thank you for the advice. We have added the following paragraphs in
Section 7.4.1 “RQ1: Accuracy of TopicMiner^{MTM}”:
“Compared with the above baseline approaches, our TopicMiner^{MTM}
recommends bug fixers by using topic distributions of
bug reports instead of
term distribution of bug reports. The terms in bug reports have the synonym and
polysemy problems, while the topic distribution of bug reports address these
problems by clustering similar terms into topics. Thus, our TopicMiner^{MTM} can
achieve a better performance than Bugzie.
Besides, LDA is a general purpose topic modelling technique, and the bug reports are
semi-structured which contain not only the natural language description of the bugs,
but also the meta-features such as product and component. Leveraging the metafeatures can better capture the topic distributions for bug reports in the same product
and component. Thus, our TopicMiner^{MTM} can achieve a better performance
than the bug triaging approaches which use LDA.
Furthermore, our TopicMiner is an incremental learning approach. We update the
model whenever a new assigned bug report is come. In this way, our model can adapt
to the real-time changes from the open source community, and further improve the
performance of bug trigaing.”
2.5 Comment: Page 3 line 58, why do you ignore developers’ comments? I guess bug
triaging is a continuous activity and every time a new element is added to the bug
report, the TopicMiner should re-evaluate the recommended list of developers. I’ve
seen bug reports with very short description but with a long discussion thread. I
assume TopicMiner will fail with such types of reports.
Response: Thank you for the advice. It is really interesting to investigate continuous
bug triaging, but in this paper, we focus on the process of bug triaging when a bug is
submitted to bug tracking system. Also, followed the previous studies of bug triaging,
we use the summary and description of bug reports to recommend bug fixers. We
have added the following sentences in the Section “Overall Framework”:
“We ignore any developer discussion since it is not available at the time an assignment
is made. Moreover, previous studies also use description and summary texts from the
reports to recommend bug fixers [6], [33],[30], [24], [17], [9].”
We also added a sentence in the “Conclusions and Future Works” Section:
“We also plan to consider more contents of bug reports such as discussion comments
to do continuous bug triaging.”
2.6 Comment: p 4 "K refers to the number of topics which need to be input by end
users." There is a critical typo in this sentence: instead of "need" (ie the topics) you
probably mean "needs" (ie the number of topics). This caused no end of confusion for
me.
Response: Thank you for the advice. We have rephrased the sentence as “K refers
to the number of topics which needs to be input by end users”.
2.7 Comment: Why do you consider only the product and component in the model?
What happens when those features are missing or when they change?
Response: Thank you for the advice. We have added the experiments which not only
consider the product and component features, but also reporter features. We have
added a new research question “RQ4: What is the performance of TopicMiner^{MTM}
with different input features?”. We have added the following paragraphs:
“RQ4: What is the performance of TopicMiner^{MTM} with different input features?
By default, we choose product and component as two input features for MTM. For a
bug report, we have a number of other features such as product, component, and
reporter. In this research question, we investigate the performance of
TopicMiner^{MTM} with different input features. To answer this research question,
we evaluate the performance of TopicMiner^{MTM} with product only, component
only, reporter only, product-reporter combination, component-reporter combination,
and product-component-reporter combination, and denote them as T^{M_{p}},
T^{M_{c}}, T^{M_{r}}, T^{M_{p,r}}, T^{M_{c,r}}, and T^{M_{p,c,r}}, respectively.”
“7.4.4 RQ4: TopicMiner^{MTM} with Different Input Features
Table 6 presents the top-1 and top-5 accuracies for TopicMiner^{MTM} compared
with T^{M_{p}}, T^{M_{c}}, T^{M_{r}}, T^{M_{p,r}}, T^{M_{c,r}}, and T^{M_{p,c,r}}. We
notice among the 7 variants, TopicMiner^{MTM} (product-component combination
as the input features) achieves with the best performance. Moreover, all of the 7
variants show better performance than the baseline approaches as shown in Table 2.”
We also added the following sentences in Section “5.1 Modeling a Bug Report”:
“The reason we choose these two features since developers have to assign values to
these two feature when they submit a bug (i.e., these two fields are not null), and they
are stable (i.e., only a small proportion of bug reports have their product and
component features get reassigned before the final bug fixer is assigned).
We
analyze the bug reports of GCC, OpenOffice, Mozilla, and Eclipse, and find that for
86.52%, 85.22%, 79.83% and 85.33% of the bug reports of the respective software
projects the product and component fields are finalized first before the final bug fixer
is assigned.”
2.8 Comment: Please give some examples of topics. A running example would help
considerably. (I am guessing that a "topic" corresponds to a cluster of related terms,
and would not easily be identifiable as a "topic" to a human, but you don't state this
clearly, so the reader is left wondering.)
Response: Thank you for the advice. We added a running example in the paper, and
added the following paragraphs:
“Notice in MTM, a topic is a terminology used to describe a cluster of related words.
The users do not need to input topics at all as our model is unsupervised, i.e., we do
not need to define the name of the topics (aka. clusters) in advance. Our model learns
it automatically. Most of these “topic” are meaningful to a human based on our
observations and previous works. Our model assumes that words in a document come
from some underlying topics. We need to discover these underlying topics and the
topic assignments of words. Then, we can represent a document with this discovered
information instead of only tokens appearing in documents.
Our model can be regarded as a simulation of how a developer writes a bug report.
For example, suppose a developer finds a bug in the user interface component of the
product firefox. To edit a report to describe the bug, he first picks some topics
according to the component user interface and the product firefox. These topics can
be composed of a topic about user interface with words page, bar, symbol, etc., a topic
about interface operation with words click, open, close, etc., a topic about browser
with words Internet, website, connect, etc., and some other topics. To write down a
word, the developer needs to first determine which topic he is describing using this
words, then he picks a word from this determined topic and write it down. This
process continues until he finishes the report.
These topics are what we intend to learn by using our model. Instead of assuming that
each topic contains only a group of words, we assume that each topic is a distribution
over words. Those words with high probabilities can represent a topic better than
others. Our model tries to learn these topics automatically. Intuitively, they can also
be regarded as clusters of words.
In reality, the way a developer edits a bug report may differ from what we assume.
However, pervious works on topic models have proven that they can learn meaningful
topics automatically. So, in our paper, we design a model under the context of bug
triaging and apply it to real bug reports dataset. Both its performance and the learned
topics show that it is useful.”
2.9 Comment: In page 5 line 33, the sentence is wrong. MTM extracts the topic
distribution vector from a bug report, not the other way around, right?
Response: Thank you for the advice. Sorry to make the confusion. The whole
paragraph describe the generation process of MTM, i.e., how to generate a bug report
by leveraging MTM. Thus, the sentence is right.
Actually, our inference of MTM is based on the generation process of MTM. The
inference process can be viewed as given a bug reports with all of its text, how to infer
the topic distribution of the bug report.
2.10 Comment: p 5 "we set the number of iterations to 500." Why do you pick such
an arbitrary number and not use a fitness function to decide when to stop? Later you
speak of "convergence". Wouldn't it make sense to apply the same principle here?
Response: Thank you for the advice.
paper:
We added the following senetences in the
“In this work, following [10], to ensure the convergence of topic distributions, we set
the number of iterations to 500. We also find that there are little difference when
we set the number of iterations more than 500 (See Section 8.5).”
We also added a new section named “Impact on Different Number of Iterations”:
“By default, we set the number of iterations for MTM as 500. In this section, we also
investigate the performance of TopicMiner^{MTM} with different number of
iterations. We set the number of iterations as 100 - 1,000, and every time increase it
by 100. Figures 15, 16, 17, 18, and 19 present the top-1 and top-5 accuracies for
TopicMiner^{MTM} with different ifferent number of iterations in GCC, OpenOffice,
Netbeans, Eclipse, and Mozilla, respectively. We notice that when increase the
number of iterations from 100 to 500, the performance of TopicMiner^{MTM} is
increased; when increase the number of iterations from 500 to 1,000, the
performance of TopicMiner^{MTM} is stable. In practice, we will need more time to
run MTM if we set a high number of iterations. Thus, we recommend the developers
to set the number of iterations as 500.”
2.11 Comment: p 6 "we present the inference of topics for a new bug report." Now it
seems you infer the topics themselves. But p 8 "A topic is assigned to a feature
combination if it is assigned to a word inside a document with that feature
combination." So now it seems that the topics are not inferred after all. "In the
inference phase, we infer the topic distribution of a new bug report m_new." In the
end, are topics inferred or not? (Again, I am guessing that the "topics" are just
clusters of terms, and may not correspond to what humans would call topics, but you
must be clear about it.)
Response: Thank you for the advice. Thank you for the comment. The sentence “A
topic is assigned to a feature combination …” is actually used to describe how our
Gibbs Sampling algorithm works. The assignment is not done by human. It is done
during the sampling process.
2.12 Comment: p 11 The removal of fixers that appeared less than 10 times and terms
that appear less than times is not justified. Actually this curation will bias the results
in your favour since TopicMiner needs a lot of terms and a lot of bug fixers to get good
results.
Response: Thank you for the advice. We added a new section “Impact on the
Preprocessing of Terms and Fixers” in the paper, and we added the following
paragraphs:
“In the preprocessing of our datasets, we exclude bug fixers who appear less than 10
times to reduce noise [33], [9], and we also remove terms which appear less than 10
times to reduce noise, and speed up the bug triaging process. In this section, we also
investigate the performance of TopicMiner^{MTM} with all of the terms and fixers in
the five datasets, and denote it as TopicMiner^{MTM}_{all}. Table 11 presents the
top-1 and top-5 accuracies of TopicMiner^{MTM}_{all}. On average,
TopicMiner^{MTM}_{all} achieves top-1 and top-5 accuracies to 0.5321 and 0.7736.
We notice that TopicMiner^{MTM} achieves better performance than
TopicMiner^{MTM}_{all}. This is due to the noises if we do not preprocess the data by
removing some less appeared terms, and less appeared fixers.”
2.13 Comment: Actually it would have been interesting to see the incremental results
of each fold of the evaluation and compare different approaches at different time
frames. I would assume that TopicMiner performs badly in the beginning and then
improves as more data is used for training, right?
Response: Thank you for the advice. We have added a new research question “RQ3:
What is the effect of varying the amount of training data to the performance of
TopicMiner^{MTM}?”, and added a new section 4.4.3 to answer this research question.
We have added the following paragraphs:
“RQ3: What is the effect of varying the amount of training data to the performance of
TopicMiner^{MTM}?
To evaluate the performance of TopicMiner^{MTM}, we use the longitudinal data
setup. With the number of folds increase, the amount of the training data increase. In
this research question, we investigate whether the performance of TopicMiner^{MTM}
increases with the amount of training data increase. To answer this research question,
we present the top-1 and top-5 accuracies for the 10 folds as shown in the experiment
setup section.”
“4.4.3 RQ3: Amount of Training Data
Tables 5 and 6 present the top-1 and top-5 accuracies for TopicMiner^{MTM} with
different amount of training data (fold 0 - fold 9). Note that in our longitudinal data
setup, we divide our data into 11 non-overlapping frames, thus one frame
corresponds to 9.09% (1/11) of the total number of bug reports. In fold 0, the amount
of training data is 9.09% of the total number of bug reports, and in fold 9, the amount
of training data is 90.09% of the total number of bug reports.
From the 2 tables, we notice for GCC and Netbeans, in general, the performance of
TopicMiner^{MTM} will increase as the amount of training data increase. For
OpenOffice, Eclipse and Mozilla, in general, the performance of TopicMiner^{MTM}
will decrease as the amount of training data increase.
Our collected Eclipse and Mozilla datasets are much larger than the other 3 datasets,
which contain 82,978 and 86,183 bug reports, and 1,898 and 1,813 candidate fixers,
respectively. With the amount of training data increase, the search space for the
candidate fixers will increase, and some fixers may leave the community, thus, the
performance of TopicMiner^{MTM} will decrease as the amount of training data
increase in Eclipse and Mozilla. Still, the performance of TopicMiner^{MTM} are
acceptable, the top-5 accuracy is above 0.7 for Eclipse and Mozilla in the 10 folds.
For OpenOffice, we notice in folds (such as folds 2, 5, and 9), the performance of
TopicMiner^{MTM} is decreased compared with the previous folds. We manually
check the dataset, we find in these folds, a number of new feature combinations are
introduced. For example, the product-component combination ``TestProdct-other''
are introduced in the frame 3 (fold 2), which makes TopicMiner^{MTM} recommend
wrongly fixers to the product-component combination.
Moreover, from the two tables, we notice that the number of features (i.e productcomponent combinations) do not have the direct impact to the performance of
TopicMiner^{MTM}. For example, in Mozilla, the number of product-component
combination is 777, but it achieves the lowest top-1 and top-5 accuracies compared
with the other 4 datasets. And in GCC, the number of product-component
combination is only 49, its top-1 accuracy is ranked 4 and its top-5 accuracy is ranked
3 among the 5 datasets.”
2.14 Comment: The approach assumes that when a bug report is issued, it is final. I
am not sure how the algorithm can consider the change of the elements of the bug
reports.
Response: Thank you for the advice. We added a new section “Impact on the Initial
Assignment of Product and Component” in the paper, and we added the following
paragraphs:
“We analyze the bug reports of GCC, OpenOffice, Netbeans, Mozilla, and Eclipse, and
find that for 86.52%, 85.22%, 38.03%, 79.83% and 85.33% of the bug reports of the
respective software projects the product and component fields are finalized first
before the final bug fixers are assigned. In this section, we also investigate the
performance of TopicMiner^{MTM} by using the bug reports with their initial
assignment of product and component fields, i.e., we use the product and component
fields which are assigned when the bug reports are submitted, and we denote it as
TopicMiner$^{MTM}_{initial}$.
Table 12 presents the top-1 and top-5 accuracies of TopicMiner$^{MTM}_{initial}$.
On average, TopicMiner$^{MTM}_{initial}$ achieves top-1 and top-5 accuracies to
0.5504 and 0.8123. We notice that TopicMiner$^{MTM}_{initial}$ shows similar
performance as TopicMiner$^{MTM}$. Also, for some projects such as GCC and
Eclipse, TopicMiner$^{MTM}_{initial}$ shows a slight better performance than
TopicMiner$^{MTM}$. However, in Netbeans, TopicMiner$^{MTM}$ shows better
performance than TopicMiner$^{MTM}_{initial}$, this is due to there is a large
proportion of bug reports whose product and component get reassigned before the
final bug fixers are assigned, i.e., 61.97%.”
2.15 Comment:
How are the terms are ordered in the used vectors?
Response: Thank you for the advice. We added the following sentences in the paper:
“These terms are in the same order as in the original bug report. Notice that the term
order does not influence our model as we treat a document as a bag of words.”
2.16 Comment:
What happens when a certain term can belong to multiple topics?
Response: Thank you for the advice. We added the following paragraph:
“A term can be assigned to multiple topics. The topic assignment of a term is affected
by the other words in the same document. Assume that term w is important to both
topic UI and topic interface operation. If w appears in a document with many words
about UI, it is more likely to be assigned to UI topic. If it appears in a document with
many words related to interface operation. It will be more likely to be assigned to topic
interface operation.”
2.17 Comment: Where did formulas 1 and 3 come from?
Response: Thank you for the advice. These two formulas are derived from the
sampling formula of Gibbs sampling. More details can be found in our technical report:
https://github.com/xinxia1986/bugTriaging/
2.18 Comment: In the generative process of MTM, what are “Dir” and “Mult"?
Response: Thank you for the advice. Dir refers to Dirichlet distribution, which is used
as the prior distribution for topic models. Mult refers to multinomial distribution. We
also changed the corresponding paragraphs which replace “Dir” and “Mult” as the full
name of the distributions.
2.19 Comment:
Page 6 line 53, how many iterations exactly and why?
Response: Thank you for the advice. We added the following sentences in the paper:
“The iteration process would terminate after a large number of iterations. In this work,
following [10], we set the number of iterations to 500.”
2.20 Comment: In algorithm 1, what is “Mult(1/k, …., 1/k)"?
Response: Thank you for the advice.
Mult refers multinomial distribution.
Mult(1/K, …, 1/K) means to sample a numbe from 1, 2, … K with a the probability for
each value as 1/K. In other words, this is to randomly pick a number from 1, 2,… K. We
put these details into our technical report.
2.21 Comment: Typos:
- p 1 line 31, job —> task
- p 1 line 36 "many other [kinds of] useful information"
- p 1 line 42 "a way to deal with synonym[s]"
- p 2 line 56, "and use" —> ", and we use"
- p 3 line 40 "VSM [x do] not"
- p 3 line 2 "different [categorizes -> categories]"
- p 4 line 37 "eventually [assigns -> assign] the new"
- p 5 line 25, not a sentence. Probably replace the fluster with a comma.
- p 9 line 59 "Let Tdf [refers -> refer] to"
- p 10 line 10 "let θ[t] [denotes -> denote] an entry"
- p 11 line 7 "the expertise of these developers [are -> is] hard to predict"
- p 13 line 27, insure —> ensure
Response: Thank you for the advice. We fixed the above typos.
2.21 Comment: Typos in the attachment
Response: Thank you for the advice. We fixed these typos.
2.22 Comment: How do you know Qleg was the best person to fix the bug?
Response: Thank you for the advice. We added a footnote in the paper:
“We check the bug assignment history and commit logs to identify Oleg Krasilnikov as
the bug fixer.”
2.23 Comment: Page 1, Lines 58- 60. Define the features, rewrite the senetcnes
Response: Thank you for the advice. [Xin: not clear how to address this comment]
2.24 Comment: Are these typos yours or in the original report
Response: Thank you for the advice. Yes, these typos are from the original report, and
we use standard notation to identify them.
2.25 Comment: Page 3, Lines 55-57. Are these the only features?
Response: Thank you for the advice. We added a footnote as follows:
“In this paper, by default, we use product and component as the features. We also
investigate other features such as reporter. More details can be found in Section 7.4”
2.26 Comment: Are product and component really two separate features? For
instance, is it meaningful to discuss the component independently of a product?
Figure is misleading since the Topic Miner has more input then the topic distributions.
For instance the developer and bug information must be inputs during training. Is
anything else?
How do you cope with new topics that emerge over time? It doesn't appear that you
recreate your topic models using your new data...
[Xin: not clear how to address this comment]
2.26 Comment: It sounds like you are interested in a feature-topic matrix analogous
to the document(bug)- topic matrix produced by LDA. Do you have LDA generate the
feature-topic matrix or do you use the bug-topic matrix to produce the feature-topic
matrix?
Response: Thank you for the advice. We add a new section “Topic Extraction with LDA”
into the paper. We also added the following paragraph to make it clear:
“In our MTM model, we have the theta and phi matrixes. In theta matrix, each row is
a topic distribution to a feature vector. This is different from traditional LDA as it
assumes that each document has a topic distribution. Our model assumes that each
feature combination Fm has a topic distribution. A document’s topic distribution is
then computed based on its feature combination’s topic distribution. The phi matrix
contains the same information as in LDA. It is a topic-word distribution matrix. Each
row in it is a word distribution of one topic.”
2.27 Comment: Page 5, Lines 59-60, What is feature combination?
Response: Thank you for the advice. We added the following sentences:
“Feature combination refer to the combination of multiple features. In this paper, by
default, we use the product and component features, and combine them as the
feature combination.”
2.28 Comment: This graphical representation doesn't seem quite right. 1) fi_e should
be fi_I 2) isn't there a theta_fi for each fi? Doesn't that lead to:
Response: Thank you for the advice. We changed the figure as:
2.29 Comment:
Page 5, Line 52. If it is input, why are you deriving it
Response: Thank you for the advice. We mean TopicMiner will use 𝜃𝑚 as input, but
𝜃𝑚 is derived from MTM. We rephrase the sentence as:
“Since TopicMiner takes as input the topic distribution vector
𝜃𝑚 from 𝑧𝑚 by using MTM.”
𝜃𝑚 , we also derive
2.30 Comment: Page 6, Line 9. Is it a new variable?
Response: Thank you for the advice. Voc is not a new variable, it represents a common
vocabulary from bug report collections. The definition of Voc can be found in page 6.
We also added the “a common vocabulary” before “Voc”.
2.31 Comment: Pages 6 – 9, for all the comments in the Section “Inference of MTM”.
Response: Thank you for the advice. As suggested by Reviewer 1 and 2, we move the
whole section into a technical report, which can be found in
https://github.com/xinxia1986/bugTriaging/
2.32 Comment: How many product-component combinations? Isn't that what your
feature actually is?
Response: Thank you for the advice. We modified the Table 2 to include the productcomponent combination.
2.32 Comment: Need to study how product components are introduced. What do you
do if a new product-component is introduced?
Response: Thank you for the advice. For a new product-component which never see
before, our model will work follows LDA.
[Xin: not clear how to address this comment]
2.33 Comment: Note that alpha and beta are standard settings for these variables
with a citation. Where did 11% come from?
Response: Thank you for the advice. We rephrased the sentence as:
“For the training phase of MTM followed by a previous study [10], we set the
maximum number of iterations to 500, and the parameters ∂ and β to 50/T (where
T is the number of topics) and 0.1, respectively. By default, we set the number of topics
T to 11% of the number of distinct terms in the training data, since we empirically find
TopicMiner^{MTM} achieves the best performance under this setting.”
2.34 Comment: Need a transition to indicate these are comparison approaches? Are
they appropriate settings for those approaches?
Response: Thank you for the advice. We added the following sentences:
“We compare TopicMiner^{MTM} with a number of baseline approaches, i.e., Bugzie
[34], LDA-KL [31], LDA-SVM [31], LDA-Activity [24], and Yang et al.’s approach [42].”
And
“The settings for the baseline approaches are the appropriate
these papers.”
settings proposed in
2.35 Comment: Does performance improve as more training data is introduced? An
adopter would need to know the necessary size of the initial training data.
Response: Thank you for the advice. We have added a new research question “RQ3:
What is the effect of varying the amount of training data to the performance of
TopicMiner^{MTM}?”, and added a new section 4.4.3 to answer this research question.
We have added the following paragraphs:
“RQ3: What is the effect of varying the amount of training data to the performance of
TopicMiner^{MTM}?
To evaluate the performance of TopicMiner^{MTM}, we use the longitudinal data
setup. With the number of folds increase, the amount of the training data increase. In
this research question, we investigate whether the performance of TopicMiner^{MTM}
increases with the amount of training data increase. To answer this research question,
we present the top-1 and top-5 accuracies for the 10 folds as shown in the experiment
setup section.”
“4.4.3 RQ3: Amount of Training Data
Tables 5 and 6 present the top-1 and top-5 accuracies for TopicMiner^{MTM} with
different amount of training data (fold 0 - fold 9). Note that in our longitudinal data
setup, we divide our data into 11 non-overlapping frames, thus one frame
corresponds to 9.09% (1/11) of the total number of bug reports. In fold 0, the amount
of training data is 9.09% of the total number of bug reports, and in fold 9, the amount
of training data is 90.09% of the total number of bug reports.
From the 2 tables, we notice for GCC and Netbeans, in general, the performance of
TopicMiner^{MTM} will increase as the amount of training data increase. For
OpenOffice, Eclipse and Mozilla, in general, the performance of TopicMiner^{MTM}
will decrease as the amount of training data increase.
Our collected Eclipse and Mozilla datasets are much larger than the other 3 datasets,
which contain 82,978 and 86,183 bug reports, and 1,898 and 1,813 candidate fixers,
respectively. With the amount of training data increase, the search space for the
candidate fixers will increase, and some fixers may leave the community, thus, the
performance of TopicMiner^{MTM} will decrease as the amount of training data
increase in Eclipse and Mozilla. Still, the performance of TopicMiner^{MTM} are
acceptable, the top-5 accuracy is above 0.7 for Eclipse and Mozilla in the 10 folds.
For OpenOffice, we notice in folds (such as folds 2, 5, and 9), the performance of
TopicMiner^{MTM} is decreased compared with the previous folds. We manually
check the dataset, we find in these folds, a number of new feature combinations are
introduced. For example, the product-component combination ``TestProdct-other''
are introduced in the frame 3 (fold 2), which makes TopicMiner^{MTM} recommend
wrongly fixers to the product-component combination.
Moreover, from the two tables, we notice that the number of features (i.e productcomponent combinations) do not have the direct impact to the performance of
TopicMiner^{MTM}. For example, in Mozilla, the number of product-component
combination is 777, but it achieves the lowest top-1 and top-5 accuracies compared
with the other 4 datasets. And in GCC, the number of product-component
combination is only 49, its top-1 accuracy is ranked 4 and its top-5 accuracy is ranked
3 among the 5 datasets.”
2.36 Comment: Page 12, The text should include total time for different numbers of
bug reports. This small number at the individual bug is going to be magnified when
you have a reasonable number of bugs?
Xin: Cannot address it. Since we use longitudinal data setup. For different folds, the
number of bug reports in different training set are different. Thus, we cannot give the
overall time…
2.37 Comment: Page 12, T Needs more detail. Does this leave out the features? It
doesn't have to...
Xin: I don’t know what the reviewer means…
2.38 Comment: Missing the lack of multiple samples threat when using Gibbs
Sampling.
Response: Thank you for the advice. To address this threat, we run MTM and LDA 10
times, and compute the average performance across the 10 times. We have added
the following sentences in the experiment setup section:
“Moreover, since both MTM and LDA use Gibbs Sampling to generate the topics,
which will introduce the randomness of our approach. To reduce the randomness, we
run MTM and LDA 10 times, and we compute the average performance across the 10
times.”
We also changed all the experiment results for TopicMiner^{MTM}, TopicMiner^{LDA},
and other baseline approaches which also use LDA. Moreover, we have added a
section “Stableness of TopicMiner^{MTM}” under the “Discussion” section:
“Notice our MTM has ran 10 times, and the average top-1 and top-5 accuracies scores
are computed across the 10 times. Here, we would like to investigate that whether
the performance of TopicMinder^{MTM} would be varied when we run MTM multiple
times. Figures 10, 11, 12, 13, and 14 present the top-1 and top-5 accuracies for
TopicMiner^{MTM} with different ran times in GCC, OpenOffice, Netbeans, Eclipse,
and Mozilla, respectively. We notice that across the 5 figures, the performance of
TopicMiner^{MTM} is stable, and for different ran times, the difference are small.
Thus, we believe that our TopicMiner^{MTM} is a stable approach, the randomness
due to Gibbs Sampling has little impact to the performance of bug triaging.”
We also added a sentences in the “Conclusions and further works” section:
“We also plan to design a better algorithm which runs Gibbs Sampling multiple times,
and output the average topic distribution across the multiple times to further reduce
the effect due to outliers appear in a single Gibbs Sampling.”
Reviewer #3:
3.1 Comment: First of all, the authors did not mention at all a very similar work:
Geunseok Yang, Tao Zhang, Byungjeong Lee: Towards Semi-automatic Bug Triage and
Severity Prediction Based on Topic Model and Multi-feature of Bug Reports. COMPSAC
2014: 97-106
In this work - for the first time - the authors introduced the concept of multi-feature
analysis of bug reports. They also analysed the impact of each features on the accuracy
of the bug triaging approach obtaining that "component" is the most important
feature even if the best accuracy is achieved with a combination of several different
features. Thus, the authors have to better explain the differences between TopicMiner
and the approach proposed by Yang and Lee. Which are the peculiarities of
TopicMiner? In addition, the approach by Yang and Lee need to be included as a
further baseline.
Response: Thank you for the advice. We added Yang et al.’s approach as the baseline.
We added the following paragraphs in the RQ1 and also the related work section:
“Tables 2 compare the performance of TopicMiner^{MTM} with the baselines in terms
of top-1 and top-5 accuracies, respectively. From the table, we notice the
improvement of our method over Bugzie, LDA-KL, SVM-LDA, LDA-Activity, and Yang et
al.'s approach are substantial. Across the 5 projects, TopicMiner^{MTM} on average
improves top-1 and top-5 prediction accuracies of Bugzie by 128.48% and 53.22%,
LDA-KL by 262.91% and 105.97%, SVM-LDA by 205.89% and 110.48%, LDA-Activity by
377.60% and 176.32%, and Yang et al.'s approach by 59.88% and 13.70%,
respectively.”
In the related work section, we added the following paragraphs:
“Yang et al. also use LDA to extract topics from bug reports, and find bug reports
related to each topic [41]. For a new bug report, their approach first decides the topics
of the bug report. Then they utilize multi-feature (i.e., component, product, priority
and severity) to identify similar reports that have the same multi-feature with the new
bug report, and recommend developers based on the similar reports. Our approach is
different from Yang et al.'s approach. First, we design a specific topic modeling named
MTM which incorporate the multi-feature information into the topic model while Yang
et al. only use LDA. Second, we also propose an incremental approach to recommend
developers while Yang et al.'s approach is not incremental, and based on the similar
bug reports. Third, Yang et al. use severity and priority fields in the bug reports,
however in practice, most of the bug reports set their severity and priority values as
the default value [39].”
We also changed Table 3 to include the model construction time, and
recommendation and update time for Yang et al.’s approach:
3.2 Comment: Another problem is related with the readability of the paper. The first
part is really hard to read. A lot of details are provided on MTM. Of course, this is fine.
However, a bird’s eye view of the approach is completely missing. In other words it is
not clear how features are used by MTM. For instance, it is not clear at all why when
several LDA models are built for each group of similar bug reports (where similarity is
computed considering the bug report features) is not enough to achieve the results
achieved with TopicMiner-MTM (see Section 4.5.2). The authors should better
emphasize and describe the peculiarities of TopicMiner-MTM in order to convince the
reader that clustering bug reports based on the selected features and build specific
LDA models for each cluster is not enough to obtain good results. Of course the results
achieved highlight the benefits of TopicMiner-MTM as compared with a naïve
approach based on clustering. But I would like to understand why such results are
achieved.
[Xin: I am not sure how to address this comment]
Response: Thank you for the advice. We also consider a new baseline named
{TopicMiner}^{LDA}_{Local} clustering bug reports based on the selected features and
build specific LDA models. We added a new section “TopicMiner^{LDA} and
{TopicMiner}^{LDA}_{Local}”:
“TopicMiner can be paired with various topic models. To further validate the benefit
of our new topic model MTM, we pair TopicMiner with LDA (TopicMiner^{LDA}) and
compare its performance with TopicMiner^{MTM}. To pair TopicMiner with LDA, we
simply modify the first step of TopicMiner, described in Section 6, to use LDA instead
of MTM. Moreover, we also group bug reports according to their product-component
combination. The bug reports of same product-component combination are grouped
into the same group. Next, for each group, we use LDA to extract the topic
distributions for the bug reports, and build a TopicMiner model under these topic
distributions. For a new bug report, we first get its product-component combination,
and use the corresponding TopicMiner model to recommend fixers. We refer to this
baseline as TopicMiner^{LDA}_{Local}.
Table 9 presents the top-1 and top-5 accuracies for TopicMiner^{MTM}.,
TopicMiner^{LDA},
and
{TopicMiner}^{LDA}_{Local}.
We
notice
that
TopicMiner^{MTM}. achieves better performance than TopicMiner^{LDA} and
TopicMiner^{LDA}_{Local}. Across the 5 projects, TopicMiner^{MTM}.on average
improves top-1 and top-5 prediction accuracies of TopicMiner^{LDA} by 62.43% and
21.46%, aand TopicMiner^{LDA}_{Local} by 53.87% and 19.67%, respectively.”
3.3 Comment: I have also some concerns related to the empirical evaluation. The
study is large and generally well conducted. However, the description of the results is
quite arid. The authors just report the numbers achieved without providing any
qualitative analysis aiming at justifying the results achieved. In order words, could be
interesting to provide some examples aiming at describing some scenarios where
TopicMiner is able to overcome the other approaches and also some examples with a
different scenario, where TopicMiner is less accurate than the baseline approaches. In
this way, the authors have the possibility to explain in which circumstances an
approach is better than the others. As a final comment related to this point, I would
like to see also a discussion of negative results. For instance, could be interesting to
provide some explanation on why AC is the techniques that achieved the worst results.
Or, why on Mozilla and OpenOffice the accuracy of all the techniques is sensibly lower
than the accuracy achieved on the other systems.
Response: Thank you for the advice. We added the following paragraphs in RQ1:
“In OpenOffice and Mozilla, we notice the performance of TopicMiner^{MTM} does
not work as well as in GCC, Netbeans, and Eclipse. We manually check the datasets,
we find in OpenOffice and Mozilla, there are more developers leave or join the
communities, which increase the difficulties to recommend fixers.
Compared with the above baseline approaches, our TopicMiner^{MTM} recommends
bug fixers by using topic distributions of bug reports instead of
term distribution
of
bug reports. The terms in bug reports have the synonym and polysemy problems,
while the topic distribution of bug reports address these problems by clustering similar
terms into topics. Thus, our TopicMiner^{MTM} can achieve a better performance
than Bugzie.
Besides, LDA is a general purpose topic modelling technique, and the bug reports are
semi-structured which contain not only the natural language description of the bugs,
but also the meta-features such as product and component. Leveraging the metafeatures can better capture the topic distributions for bug reports in the same product
and component. Thus, our TopicMiner^{MTM} can achieve a better performance
than the bug triaging approaches which use LDA.
Furthermore, our TopicMiner is an incremental learning approach. We update the
model whenever a new assigned bug report is come. In this way, our model can adapt
to the real-time changes from the open source community, and further improve the
performance of bug trigaing.
We notice LDA-Activity does not work as well as other baseline approaches. We find
LDA-Activity create an activity profile for each developer, and the activity profile
contains the review, assign, and resolve activities for a developer. Since our task is to
recommend a fixer, and for a large-scale dataset, if we consider more activities such
as review and assign, it will also increase the size of candidate bug fixers.
Figures 10 and 11 present two bug reports from OpenOffice. Both of these two bug
reports are in the product porting and component code, and assigned to foskey.
Although the terms in these two bug reports are different, they both describe some
configuration bugs. In this way, topic modeling based approach such as
TopicMiner^{MTM} achieves a better performance than the term based approach
such as Bugzie. Moreover, since LDA does not consider the specific topic distribution
under different product-component combinations, these LDA based approaches do
not work well for these two bug reports. We manually check the topic distribution for
these bug reports, we find the values of these two bug reports for the {\tt
configuration} topic are small. For our MTM, we consider the topic distribution for
different product-component combinations. And for these two bug reports, we find
the values of these two bug reports for the configuration topic are the largest
compared to other topics. Thus, our TopicMiner^{MTM} can recommend bug fixers
for bug reports with the same product-component combinations, and also describe
similar topics.
We also notice our TopicMiner^{MTM} does not work as well as other baseline
approaches for the bug reports whose product-component combinations appear less
than 5 times. For example, in the fold 0 of Mozilla, there is only one bug report with
product Directory and component LDAP C SDK. Our TopicMiner^{MTM} cannot
recommend bug fixer for this bug report, while other approaches such as Bugzilla and
LDA-SVM can recommend the fixer.
”
3.4 Comment: The authors empirically derived the optimal value of the number of
topics for TopicMiner. However, they did not follow the same procedure for the other
LDA-based techniques. It is well-known that the number of topics varies from different
repositories and that is quite difficult to identify a LDA configuration that works well
on different repositories. Thus, also for the other techniques an empirical tuning of
the number of topics is required as well in order to avoid any bias during the
comparison. Note that the number of topics plays a crucial role, as indicated by the
results reported by the authors when tuning the number of topics for TopicMiner.
Similar values for the number of topics can result in quite different levels of accuracy.
For instance, on GCC is possible to increase the accuracy of more than 15% by simple
changing the number of topics from 9% to 11%.
Response: Thank you for the advice. We added more experiments for these LDAbased approaches, i.e., LDA-KL, SVM-KL, LDA-Activity, and Yang et al.'s approach. We
added the following paragraphs:
“Moreover, for other LDA-based approaches (i.e., LDA-KL, SVM-KL, LDA-Activity, and
Yang et al.'s approach), we also set the number of topics as 11% of the number of
distinct terms in a training dataset. Notice the number of topics may affect the
performance of these approaches. To answer this research question, we vary the
number of topics to be 1% -- 15% of the number of distinct terms in a training dataset,
and compare the performance of TopicMiner^{MTM} with LDA-KL, SVM-KL, LDAActivity, and Yang et al.'s approach.”
“Figures 9, 10, 11, 12, and 13 present the top-1 and top-5 accuracies of
TopicMiner^{MTM} compared with LDA-KL (KL), SVM-KL (SVM), LDA-Activity (AC), and
Yang et al.'s approach (Yang) with various numbers of topics for the 5 datasets. We
notice our TopicMiner^{MTM} shows the best performance when increase the
number of topics from 1% to 15% number of distinct terms in the training dataset.
In general, up to a certain point, the performance of TopicMiner^{MTM} increases as
the number of topics increases, after that point, the performance then either remains
stable or decreases. In our experiment, the number of topics corresponding to 11% of
the number of distinct terms achieves the best performance. Also, LDA-KL, SVM-KL,
LDA-Activity, and Yang et al.'s approach show the similar trends as TopicMiner^{MTM}
when increase the number of topics, and these baseline approaches achieves the best
performance when the number of topics is 11% of the number of distinct terms.”
3.5 Comment: In the context of the empirical evaluation, the authors performed the
longitudinal data setup used also in other related studies. However, I would like to see
whether it is worthwhile to consider the whole history of a software system when
training the model. For instance, when considering the final fold, all the bug reports
in the frame 0-9 are used to train the model that is evaluated on the frame 10.
However, as observed in the bug prediction area, it is not always worthwhile to
consider the all history of software system to train the bug prediction model. Is this
consideration valid also for bug triaging? Based on the findings achieved in a recent
published paper
Ramin Shokripour, John Anvik, Zarinah Mohd Kasirun, Sima Zamani:
A time-based approach to automatic bug report assignment.
Journal of Systems and Software 102: 109-122 (2015)
such a consideration seems to hold also in the context of bug triaging. Thus, I would
like to see which are the accuracy of the bug triaging models when the training set is
reduced to just the last two frames or the last frame? I think that such an analysis
could be a further interesting contribution of this paper.
Response: Thank you for the advice. We added a new section “Effect on the Last
Frame” under the “Discussion” section:
“In our previous section, we use the same longitudinal data setup described in [34] [9].
Shokripour et al. find that training a prediction model by using the whole history of a
software system may cause the loss of performance [30]. Here, we would like to
investigate whether it is the same case in the bug triaging. In each fold, we reduce the
training set to include the last frame. For example, in fold 9, we only use the bug
reports in frame 9 to build the prediction model, and test using bug reports in frame
10.
Table 7 compares the performance of TopicMiner^{MTM} with the baselines in terms
of top-1 and top-5 accuracies, respectively. We notice still our TopicMiner^{MTM}
shows substantial improvement over the baseline approaches. Moreover, we notice
that TopicMiner^{MTM} using only the last frame achieve a better performance
than TopicMiner^{MTM} using all of the historical bug reports. On average across
the 5 projects, the top-1 and top-5 accuracies for TopicMiner^{MTM} using only the
last frame are 0.5864 and 0.8313, while these scores for TopicMiner^{MTM} using
all of the historical bug reports are 0.5607 and 0.8251, respectively.”