Automatically Mining Software-Based,
Semantically-Similar Words from
Comment-Code Mappings
Matthew J. Howard, Samir Gupta, Lori Pollock, K. Vijay-Shanker
University of Delaware
Computer and Information Sciences
save data
saving data
1
save data
saving data
1
save data
saving data
Missed Code: writeData( )
Reason: Vocabulary Mismatch Problem
„save‟ and „write‟ are synonymous in software
1
save data
saving data
Missed Code: writeData( )
Reason: Vocabulary Mismatch Problem
„save‟ and „write‟ are synonymous in software
Research Question:
Can we automatically mine semantically-similar words from code?
1
Why Mine from Code?
Semantic similarity information helps:
•
•
•
•
Code search
Code clone identification
Code recommendations
…
2
Why Mine from Code?
Semantic similarity information helps:
•
•
•
•
Code search
Code clone identification
Code recommendations
…
However, semantic-similarity techniques for English do not work well for
software [Sridhara et.al., 2008]
Examples:
< undo, restore>
< fire, notify>
< add, register>
2
Problem:
Semantically-Similar Word Mining
Input Code
Miner
Software-Based
Similar Pairs
3
Problem:
Semantically-Similar Word Mining
Input Code
Miner
Software-Based
Similar Pairs
Word Focus: VERBS describe main method action in OOP
We believe developers utilize synonyms of verbs more often than nouns.
3
Problem:
Semantically-Similar Word Mining
Input Code
Miner
Software-Based
Similar Pairs
Word Focus: VERBS describe main method action in OOP
We believe developers utilize synonyms of verbs more often than nouns.
Closest related work: Yang and Tan, 2012
Their Objective:
Generate pairs within a single context
Our Objective:
Generate word pairs relevant across all software
3
Insight of Approach
By programmer convention,
Leading Comments describe the same intent as Method Signatures
Implies a semantic relation between identified actions from both contexts
4
Insight of Approach
By programmer convention,
Leading Comments describe the same intent as Method Signatures
Implies a semantic relation between identified actions from both contexts
Examples
/* Serialize this object to an output stream. */
void writeToFile(OutputStream out) {…}
/* Responds to a breakpoint event. */
void _handleBreakpointEvent(Event e) {…}
4
Insight of Approach
By programmer convention,
Leading Comments describe the same intent as Method Signatures
Implies a semantic relation between identified actions from both contexts
Examples
/* Serialize this object to an output stream. */
void writeToFile(OutputStream out) {…}
/* Responds to a breakpoint event. */
void _handleBreakpointEvent(Event e) {…}
Simplest Case!
4
Challenges
5
Challenges
Determining descriptive comments is non-trivial:
/* Called every n bytes,
where n is defined by
Settings */
Explanatory
5
Challenges
Determining descriptive comments is non-trivial:
/* Called every n bytes,
where n is defined by
Settings */
Explanatory
/* Method used to create
an about menu item for
use in this application*/
Descriptive
5
Challenges
Determining descriptive comments is non-trivial:
/* Called every n bytes,
where n is defined by
Settings */
Explanatory
/* Method used to create
an about menu item for
use in this application*/
Descriptive
/* debug output for
player trackers */
TODO/Descriptive ???
5
Challenges
Determining descriptive comments is non-trivial:
/* Called every n bytes,
where n is defined by
Settings */
Explanatory
/* Method used to create
an about menu item for
use in this application*/
Descriptive
/* debug output for
player trackers */
TODO/Descriptive ???
And locating the main action (if it exists) from a comment is also difficult:
5
Challenges
And locating the main action (if it exists) from a comment is also difficult:
/* This method is called from within
the constructor to initialize the form.
WARNING: Do NOT modify this
code. The content of this method is
always regenerated by the Form
Editor. */
5
Challenges
Abundant data is readily available, thus we focus on mining QUALITY pairs
We favor Precision over Recall
5
Phases of Automation
set of methods
with leading
comments
Identify methods
with descriptive
leading comments
method
signatures
leading
comments
Extract main action
from descriptive
comment
Extract main action
from associated
signature
word pairs
Analyze & Rank
comment-code
word pairs
automatically
generated
semantically-similar
word pairs
6
Phases of Automation
set of methods
with leading
comments
Input
Identify methods
with descriptive
leading comments
method
signatures
leading
comments
Extract main action
from descriptive
comment
Extract main action
from associated
signature
word pairs
Analyze & Rank
comment-code
word pairs
automatically
generated
semantically-similar
word pairs
6
Phases of Automation
set of methods
with leading
comments
Input
Identify methods
with descriptive
leading comments
Phase I
method
signatures
leading
comments
Extract main action
from descriptive
comment
Extract main action
from associated
signature
word pairs
Analyze & Rank
comment-code
word pairs
automatically
generated
semantically-similar
word pairs
6
Phases of Automation
set of methods
with leading
comments
Input
Identify methods
with descriptive
leading comments
Phase I
method
signatures
leading
comments
Phase II
Extract main action
from descriptive
comment
Extract main action
from associated
signature
word pairs
Analyze & Rank
comment-code
word pairs
automatically
generated
semantically-similar
word pairs
6
Phases of Automation
set of methods
with leading
comments
Input
Identify methods
with descriptive
leading comments
Phase I
method
signatures
leading
comments
Phase II
Extract main action
from descriptive
comment
Extract main action
from associated
signature
word pairs
Phase III
Analyze & Rank
comment-code
word pairs
automatically
generated
semantically-similar
word pairs
6
Phases of Automation
set of methods
with leading
comments
Input
Identify methods
with descriptive
leading comments
Phase I
method
signatures
leading
comments
Phase II
Extract main action
from descriptive
comment
Extract main action
from associated
signature
word pairs
Phase III
Output
Analyze & Rank
comment-code
word pairs
automatically
generated
semantically-similar
word pairs
6
Approach:
Determining Descriptive Comments
Leading
Comment
Phrase
Segmenter
Stanford
POS
Tagger
Descriptive
Phrase
Filter
Descriptive/
Non-Descriptive
Comment
Actions
Approach:
Determining Descriptive Comments
Descriptive
Phrase
Filter
Descriptive Phrase Filtering Examples
/* If there is a camera
bound to this view, copy the
coordinates from it. */
Descriptive
Ignore IF-clauses
(“If…”)
Descriptive/
Non-Descriptive
Approach:
Determining Descriptive Comments
Descriptive
Phrase
Filter
Descriptive Phrase Filtering Examples
/* If there is a camera
bound to this view, copy the
coordinates from it. */
Descriptive
/* Method called by the framework
immediately after RRD update
operating finishes. This method will
synchronize in-memory cache with
the disk content */
Descriptive
Ignore IF-clauses
(“If…”)
Ignore contextual past-tense phrases
(“called from…”, “performed by…”)
Descriptive/
Non-Descriptive
Approach:
Determining Descriptive Comments
Descriptive
Phrase
Filter
Descriptive/
Non-Descriptive
Descriptive Phrase Filtering Examples
/* If there is a camera
bound to this view, copy the
coordinates from it. */
Descriptive
/* Method called by the framework
immediately after RRD update
operating finishes. This method will
synchronize in-memory cache with
the disk content */
/* Overrides
hashCode*/
Non-Descriptive
Descriptive
Ignore IF-clauses
(“If…”)
Ignore contextual past-tense phrases
(“called from…”, “performed by…”)
Language Keywords
(“return”, etc.)
Approach:
Extracting Main Actions from
Comments and Method Signatures
Descriptive Comment:
Identified
Actions
Comment
Action
Filter
Main
Action
8
Approach:
Extracting Main Actions from
Comments and Method Signatures
Descriptive Comment:
Comment
Action
Filter
Identified
Actions
Main
Action
Example
Strategy
/* Internal method for updating the
linkage of this graph. */
Analyze POS Tags
/* Performs deconstruction. */
Vague for Synonym
/* Uncomments all lines between start
and end inclusive. */
Word Frequency in Software
Templates
8
Approach:
Extracting Main Actions from
Comments and Method Signatures
Descriptive Comment:
Comment
Action
Filter
Identified
Actions
Main
Action
Example
Strategy
/* Internal method for updating the
linkage of this graph. */
Analyze POS Tags
/* Performs deconstruction. */
Vague for Synonym
Word Frequency in Software
/* Uncomments all lines between start
and end inclusive. */
Method Signature:
POS tagger for code: Gupta, et. al. at ICPC 2013
Templates
8
Leading Comment
Example of Process
/* Many HTML components can have
an accesskey attribute which defines
a hot key for keyboard navigation.
This method verifies that all the
accesskey attributes on the specified
page are unique. */
Method Signature
void assertAccessKeyAttrsUnique(Page p){…}
9
Leading Comment
Example of Process
/* Many HTML components can have
an accesskey attribute which defines
a hot key for keyboard navigation.
This method verifies that all the
accesskey attributes on the
specified page are unique. */
Method Signature
void assertAccessKeyAttrsUnique(Page p){…}
Descriptive Phrases:
This method verifies that all the accesskey
attributes on the specified page are unique.
9
Leading Comment
Example of Process
/* Many HTML components can have
an accesskey attribute which defines
a hot key for keyboard navigation.
This method verifies that all the
accesskey attributes on the
specified page are unique. */
Method Signature
void assertAccessKeyAttrsUnique(Page p){…}
Descriptive Phrases:
This method verifies that all the accesskey
attributes on the specified page are
unique.
Possible Main Actions:
verifies
attributes
specified
9
Leading Comment
Example of Process
/* Many HTML components can have
an accesskey attribute which defines
a hot key for keyboard navigation.
This method verifies that all the
accesskey attributes on the
specified page are unique. */
Method Signature
void assertAccessKeyAttrsUnique(Page p){…}
Descriptive Phrases:
This method verifies that all the accesskey
attributes on the specified page are
unique.
Possible Main Actions:
verifies
Good!
attributes
specified
Noun Phrase
Past Tense
9
Example of Process
Leading Comment
/* Many HTML components can have
an accesskey attribute which defines
a hot key for keyboard navigation.
This method verifies that all the
accesskey attributes on the
specified page are unique. */
Method Signature
void assertAccessKeyAttrsUnique(Page p){…}
Descriptive Phrases:
This method verifies that all the accesskey
attributes on the specified page are
unique.
Possible Main Actions:
verifies
Good!
attributes
specified
Noun Phrase
Main Action:
Past Tense
verifies
9
Example of Process
Leading Comment
/* Many HTML components can have
an accesskey attribute which defines
a hot key for keyboard navigation.
This method verifies that all the
accesskey attributes on the
specified page are unique. */
Descriptive Phrases:
This method verifies that all the accesskey
attributes on the specified page are
unique.
Method Signature
void assert
assertAccessKeyAttrsUnique(Page
AccessKeyAttrsUnique(Pagep){…}
p){…}
Possible Action Terms from Tagger:
assert
access
Possible Main Actions:
verifies
Good!
attributes
specified
Noun Phrase
Main Action:
Past Tense
verifies
9
Example of Process
Leading Comment
/* Many HTML components can have
an accesskey attribute which defines
a hot key for keyboard navigation.
This method verifies that all the
accesskey attributes on the
specified page are unique. */
Descriptive Phrases:
This method verifies that all the accesskey
attributes on the specified page are
unique.
Possible Main Actions:
verifies
Good!
attributes
specified
Noun Phrase
Main Action:
Method Signature
void assert
assertAccessKeyAttrsUnique(Page
AccessKeyAttrsUnique(Pagep){…}
p){…}
Possible Action Terms from Tagger:
assert
access
Main Action:
assert
Past Tense
verifies
9
Example of Process
Leading Comment
/* Many HTML components can have
an accesskey attribute which defines
a hot key for keyboard navigation.
This method verifies that all the
accesskey attributes on the
specified page are unique. */
Descriptive Phrases:
This method verifies that all the accesskey
attributes on the specified page are
unique.
Possible Main Actions:
verifies
Good!
attributes
specified
Noun Phrase
Main Action:
Method Signature
void assert
assertAccessKeyAttrsUnique(Page
AccessKeyAttrsUnique(Pagep){…}
p){…}
Possible Action Terms from Tagger:
assert
access
Main Action:
assert
Past Tense
verifies
Semantically-Similar Pair
< verify, assert >
9
Evaluation:
Can we automatically mine semantically-similar words from code?
10
Evaluation:
Can we automatically mine semantically-similar words from code?
36 open-source Java programs
Approx. 3
million lines of code
112.2K total methods
20.2K methods documented with leading comments
10
Phase Evaluations
How well does our automatic system determine descriptive comments?
How well does our automatic system identify the main action word from a
descriptive comment?
How accurate is the automatic extraction of the main action from a
method signature?
11
Phase Evaluations
How well does our automatic system determine descriptive comments?
87% agreement with annotators
How well does our automatic system identify the main action word from a
descriptive comment?
How accurate is the automatic extraction of the main action from a
method signature?
11
Phase Evaluations
How well does our automatic system determine descriptive comments?
87% agreement with annotators
How well does our automatic system identify the main action word from a
descriptive comment?
95% agreement when 2 annotators agreed with each other
How accurate is the automatic extraction of the main action from a
method signature?
11
Phase Evaluations
How well does our automatic system determine descriptive comments?
87% agreement with annotators
How well does our automatic system identify the main action word from a
descriptive comment?
95% agreement when 2 annotators agreed with each other
How accurate is the automatic extraction of the main action from a
method signature?
91% agreement with 1+ annotators
11
Evaluation of Mined
Semantically-Similar Pairs
How well do our word-pairs reflect reasonable semantically-similar words
in computer science?
How well do we recall word-pairs from a human-annotated gold set?
12
Evaluation of Mined
Semantically-Similar Pairs
How well do our word-pairs reflect reasonable semantically-similar words
in computer science?
Frequency
Threshold
%Mined Pairs
Humans Accepted
5-6
49%
7-9
77%
10+
79%
How well do we recall word-pairs from a human-annotated gold set?
12
Evaluation of Mined
Semantically-Similar Pairs
How well do our word-pairs reflect reasonable semantically-similar words
in computer science?
Frequency
Threshold
%Mined Pairs
Humans Accepted
5-6
49%
7-9
77%
10+
79%
How well do we recall word-pairs from a human-annotated gold set?
19/24 pairs recalled from [Sridhara et. al., 2008]
12
Sample Mined Pairs
Add – Register
Export – Dump
Fire – Notify
Find – Search
Finalize – Clean
Parse – Read
Handle – Respond
Fill – Draw
Clone – Create
Init – Initialize
Undo – Restore
Quit – Close
Paint – Draw
Display – Print
Reset - Clear
Do not appear in WordNet!
Matthew J. Howard, Samir Gupta, Lori Pollock, K. Vijay-Shanker
This material is based upon work supported by the National Science Foundation under Grant No. CCF 0915803
© Copyright 2026 Paperzz