Mining Source Code Change History
for Program Understanding
Chadd Williams
University of Maryland
Problem
How much do you know about your 10
year old code base?
– What types of bugs have been most common?
Implicit rules build up over time
– What do you do with a return value from a
function?
– Didn’t someone rewrite the matrix objects?
• how do you apply a transformation to an
image now?
Failure understand implicit rules leads
to bugs
– 32% of bugs detected during maintenance1
[1] Matsumura, T., Monden, A., Matsumoto, K., The Detection of Faulty Code Violating
Implicit Coding Rules, IWPSE ’02
University of Maryland
Source Code Change History
We can discover important properties
of the code by looking at code changes
– every change is committed
– changes highlight misunderstood code
– changes highlight new code
Studying each commit gives fine-grain
knowledge
– how quickly does a property emerge?
– how fast is a property adopted?
– how often is it used later?
University of Maryland
Applications
Bug finding
– what types of bugs have been fixed in the past?
– what functions were involved?
– Return Value Check Bug Finder
Code writing
– how do we use that API?
– how do we access that data
structure?
– Function Usage Pattern Miner
University of Maryland
open(f)
tmp = cnt = 0
while(cnt < sz & tmp != -1)
tmp = read(f,sz)
if(tmp != -1)
cnt += tmp
close(f)
Return Value Check Bug
Returning error code and valid data
from a function is a common C idiom
– the return value should be
checked before being used
– lint checks for this error
This type of bug pattern
has a high false positive
rate
– no error value returned
– no useful return value
int foo(){
…
if( error ){
return error_code;
}
…
return data;
}
…
value = foo();
newPosition + = value; // ???
Build a bug checker
– improve its results with data from CVS
University of Maryland
Goal
Which are most likely true errors
– where has the source code been changed to add
such a check?
– look at each revision of each file in CVS
– flag a function as involved in a return value
check in the CVS repository
CVS commit
value = foo();
newPosition + = value; // ???
value = foo();
if( value != Error) // Check
newPosition + = value;
Produce a ranking of the errors
– group warnings by called function
– rank functions that most likely need their
return value checked higher
University of Maryland
HistoryAware Ranking
Split functions into two groups
– flagged with a likely bug fix in a commit
– not flagged with a likely bug fix in a commit
Rank by how often the function’s return value
is checked in the latest version
– current context
0.99
Ranked by current
context data
0.10
0.99
Ranked by current
context data
0.51
University of Maryland
Flagged with likely
bug fix in CVS
Not flagged with likely
bug fix in CVS
Case Studies
Does the HistoryAware ranking push likely
bugs to the top?
Apache web server
1,129 C source files
41,000 CVS commits
Wine, OSS Windows API
3,099 C source files
70,000 CVS commits
Compare HistoryAware Ranking to Naïve
Ranking
– current context
Inspection criteria for warnings
– functions flagged with a bug fix in a commit
– functions with return value checked >50% in
current context
University of Maryland
Results - Apache
Warnings
Likely Bugs
False Positive Rate
CVS Bug Fix flagged functions
284
101
64%
Non-CVS Bug Fix flagged functions
283
70
75%
Total
567
171
70%
Statistical Significance
– Chi-square test finds the
difference between the
false positive rate of the
CVS bug fix flagged
functions and functions
check > 50% in the current
context to be significant
University of Maryland
Precision
0.8
Naive Ranking
HistoryAware Ranking
0.6
0.4
0.2
0
0
20
40
60
80
Inspected Warnings
100
120
Results - Wine
Warnings
Likely Bugs
False Positive Rate
CVS Bug Fix flagged functions
778
260
67%
Non-CVS Bug Fix flagged functions
1537
285
81%
Total
2315
545
76%
Statistical Significance
– Chi-square test finds the
difference between the
false positive rate of the
CVS bug fix flagged
functions and functions
check > 50% in the current
context to be significant
University of Maryland
Precision
Naive Ranking
1
HistoryAware Ranking
0.8
0.6
0.4
0.2
0
0
20
40
60
80
100
Inspected Warnings
120
140
160
Function Usage Pattern Miner
System specific rules that source code
must follow
Function Usage Pattern
– how functions are invoked with respect to each
other in the source code
HDC hdc = BeginPaint( hwnd, &ps );
if( hdc )
DrawIcon( hdc, x, y, hIcon );
EndPaint( hwnd, &ps );
Called After
mdi = HeapAlloc(GetProcessHeap());
if (!mdi)
HeapFree(GetProcessHeap(), 0, cs);
Conditionally Called After
Find new instances of patterns added to
the source code
University of Maryland
Our Tool
Analyze each revision of each file
– record instances of the function usage patterns
Find new instances of the patterns
– instances of a pattern in a revision of a file
where that instance was not found in the
revision immediately prior
– per file, not per function
University of Maryland
Filtering
Lots of instances identified in the Wine
software repository
– 50 million
Preliminary filtering heuristic
– only look at pairs of functions that are
separated by no more than 10 source lines
• minimal control flow information computed
– many APIs contain functions that are called in
quick succession
– error handling code is close to the error
producing function
University of Maryland
Transitive Patterns
called after may be a transitive pattern
– only a binary pattern
– allow larger patterns to be built
– may need to add more context information
5
2
1
Patterns Identified
SelectObject called after BeginPaint
SetTextColor called after SelectObject
TextOutA
called after SetTextColor
6
4
DeleteObject called after TextOutA
EndPaint
called after DeleteObject
3
University of Maryland
Preliminary Case Study
Mined Wine CVS repository
– 2,175 unique patterns added to the code 10 or
more times
– 65 unique patterns added 100 or more times
Different categories of function pairs
–
–
–
–
Debug functionality
Heap management
Paired functionality
Error Handling
wine_tsx11_lock();
XInternAtoms(thr_dis(), names, cnt, 0, atoms );
wine_tsx11_unlock();
if (RegOpenKeyA(HKEY, name, &key)) {
TRACE(message);
RegCloseKey(key);
SetLastError(NOT_FOUND);
University of Maryland
Called After Pattern
1,253 unique patterns added 10 or more
times
New Instances
Category
> 99 99 - 25 24 - 10
Obvious patterns
– serves to validate our
results
Surprising patterns
– point to interesting
relationships between
functions
Debug
17
80
278
Heap
14
16
16
GUI
3
22
271
Paired
Functionality
0
8
39
Error
Handling
0
9
30
wndClass.hCursor = LoadCursorA (0, (LPSTR)IDC_ARROW);
RegisterClassA (&wndClass);
RtlDeleteCriticalSection(&det->waiters_count_lock);
…
HeapFree(GetProcessHeap(), 0, det);
University of Maryland
Conditionally Called After
922 unique patterns added 10 or more times
Error handling code
Category
– conditionally report error
– which functions need
Debug
errors handled
Heap
Debug code
– conditionally call a debug
function
New Instances
> 99
99 - 25 24 - 10
14
95
341
7
8
11
Paired
Functionality
0
6
26
Error
Handling
0
3
34
if (!(hModule = LoadLibraryExA(fileName, 0, LLDF)))
WINE_ERR("LoadLibraryExA (%s) failed, %ld\n", fileName, GetLastError());
University of Maryland
University of Maryland
RtlHeapFree Called After RtlHeapAlloc
Value: 8
dlls/kernel/heap.c
dlls/ntdll/loader.c
University of Maryland
Future Work
Apply our tool to more projects
Track removed usage patterns
Better filtering heuristic
– control flow based
– data flow based
How do we use the patterns
hdc = BeginPaint( hwnd, &ps );
we find?
if( hdc )
DrawIcon( hdc, x, y, hIcon );
– documentation
EndPaint( hwnd, &ps );
– feed patterns to static source
code checkers to find violations
University of Maryland
Demo
Demo of the visualization tool tomorrow
University of Maryland
© Copyright 2026 Paperzz