PPT

CS590 Z
Matching Program Versions
Xiangyu Zhang
Problem Statement
Suppose a program P’ is created by modifying P.
Determine the difference between P and P’. For an
artifact c’ in P’, decide if c’ belongs to the difference,
if not, find the correspondence of c’ in P.

•
•
Static mapping
Non-trivial


•
Name comparison?
What if
Clone analysis, comparison checking
CS590Z
Motivations

Validate compiler transformations

Facilitate regression testing

Reverse obfuscation

Information propagation

Debugging

Code plagiarism detection

Information Assurance
CS590Z
Approaches
Static Approaches

•
•
•
•
•
•
•

Entity name based
String based (MOSS)
AST based (DECKARD)
CFG based (JDIFF)
PDG based (PDIFF)
Binary based (BMAT)
Log based (editor plugin, comparison checking)
Dynamic Approaches (not today)
CS590Z
Static Approaches
Entity name matching

•
•
Model a function/field as tuples
Coarse grained matching
String matching

•
•
Diff (CVS, Subservion)
Longest common subsequence (LCS)



•
Available operations are addition and deletion
Matched pairs can not cross one another
Programs are far more complicated than strings
 Copy, paste, move
CP-Miner (scale to linux kernel clone detection)

Frequent subsequence mining
CS590Z
MOSS
Code plagiarism detection

•
It also handles other digital contents
Challenges

•
•
•
White space (variable name)
Noise (“the”, “int i”);
Order scrambling (paragraph reorders)
Problem statement

•
Given a set of documents, identify substring matches that
satisfy two properties:


If there is a substring match at least as long as the guarantee threshold
t, then this match is detected;
Do not detect any matches shorter than the noise threshold, k.
CS590Z
MOSS
k-gram

•
A continuous substring of length k
CS590Z
MOSS
Incremental hashing

•
•
Hashing strings of length k is expensive for large k.
“rolling” hash function

The (i+1)th k-gram hash = F (the ith k-gram hash, …)
CS590Z
MOSS
Fingerprint selection

•
A subset of hash values
•
Our goals: find all matching substrings >t; ignore matchings
<k)
One of every tth hash values
0 mod p
•
•
CS590Z
MOSS
Winnowing

•
•
•
Observation: given a sequence of hashes h1,…hn, if n>t-k,
then at least one of the hi must be chosen
Have a sliding window with size w=t-k+1
In each window select the minimum hash value, break ties
by select the rightmost occurrence.
CS590Z
MOSS
Algorithm

•
•
•
•
Build an index mapping fingerprints to locations for all
documents.
Each document is fingerprinted a second time and the
selected fingerprints are looked up in the index; this gives
the list of all matching fingerprints for each document.
Sort (d,d1,fx), (d, d2,fy) by the first two elements.
Matches between documents are rank-ordered by size
(number of fingerprints)
CS590Z
MOSS
Advantages

•
Guarantee to detect any >t substring matches
Limitations

•
Minor edits fail MOSS.

•
x= a*b + c vs. z= c + a*b
Insertion, deletion
CS590Z
AST based matching
[YANG, 1991, Software Practice and Experience]

•
•
•
•

Given two functions, build the ASTs
Match the roots
If so, apply LCS to align subtrees
Continue recursively
Fragile
CS590Z
DECKARD (ICSE 2007)
CS590Z
DECKARD
Advantages

•
•
Scalability
Insensitive to minor structural changes such as reordering,
insertion, deletion
Limitations

•
•
Structural similarity only
Insertion that incurs structure change.
CS590Z
CFG matching
Hammock graph (JDIFF ,ASE 2004)

•
•
•
•
Match classes by names
Match fields by types
Match methods by signatures
Match instruction in methods by hammock graphs

A hammock is a single entry single exit subgraph of a CFG.
CS590Z
CFG matching
Pros

•
Orthogonal

•
Can be combined with other matching techniques
Simple
Cons

•
Coarse grained matching only

•
Not good at clone detection
In case of code transformation
CS590Z
Semantic Based Matched

Using PDG (SAS’01)
CS590Z
Semantic Based
CS590Z
Semantic Based
Pros

•
•
Non-contiguous, intertwined, reordered
Insensitive to code transformations.
Cons

•
Scalability

•
Points-to analysis
Starting from a matching pair seems to be a problem
CS590Z
Wrap Up
For clone detection

•
Maybe structural / text similarity is a good idea
For whole program matching / method matching with
code transformations

•
Semantic based is more appropriate
Scalability

•
PDG < CFG | AST < STRING < NAME
CS590Z