Analysis of programming code similarity by using

Performance evaluation of
plagiarism detection method based
on the intermediate language
Vedran Juričić
Tereza Jurić
Marija Tkalec
Plagiarism detection method

Method for detecting plagiarism in source code for
.Net languages




C#
Visual Basic.Net
C++
…

Identify similar code fragments

Determine similarity between source files

Based on intermediate language
2
Plagiarism detection
First
Second
1. using System.Text;
1. using System.Text;
2. namespace Test
{
3. class Math
{
4.
public double GetMaximum(double[] Input)
{
5.
double result = Input[0];
6.
foreach (double temp in Input)
{
7.
if (temp>result)
8.
result = temp;
}
9.
return result;
}
}
}
2. namespace Test
{
3. class Math
{
4.
public double GetMaximum(double[] Input)
{
5.
double result = Input[0];
6.
for (int i=0;i<Input.Length;i++)
{
7.
if (Input[i]>result)
8.
result = Input[i];
}
9.
return result;
}
}
}
Similarity = Number of overlapping lines / Total number of lines
= 6 / 9 = 66,66%
3
But…
First
Second
1. using System.Text;
1. using System;
2. namespace Test
{
3. class Math
{
4.
public double GetMaximum(double[] Input)
{
5.
double result = Input[0];
6.
foreach (double temp in Input)
{
7.
if (temp>result)
8.
result = temp;
}
9.
return result;
}
}
}
2. namespace OtherTest
{
3. class MyClass
{
4.
public double ReturnMaximum(double[] Array)
{
5.
double current = Input[0];
6.
for (int j=0;j<Input.Length;j++)
{
7.
if (Input[j]>current)
8.
current = Input[j];
}
9.
return result;
}
}
}
Similarity = Number of overlapping lines / Total number of lines
= 0 / 9 = 0,00%
4
Problems

Modification of variable names, types, constants

Modification of class member definitions

Line and command reordering

…

Solution



Detail analysis
Complex preprocessing
For each supported language
5
Our solution

Convert from source language to low-level
language (Common Intermediate Language)

By using existing tools



Compiler
Disassemler
Tools exist for all .Net languages
6
Our solution
using System.Text;
namespace Test
{
class Math
{
public double GetMaximum(double[] Input)
{
double result = Input[0];
foreach (double temp in Input)
{
if (temp>result)
result = temp;
}
return result;
}
}
}
C# language
C# compiler
.method public hidebysig instance float64
GetMaximum(float64[] Input) cil managed
{
// Code size
61 (0x3d)
.maxstack 2
.locals init (float64 V_0, float64 V_1, float64 V_2,
float64[] V_3, int32 V_4, bool V_5)
IL_0000: nop nop
IL_0001: ldarg.1ldarg.1
IL_0002: ldc.i4.0ldc.i4.0
IL_0003: ldelem.r8
ldelem.r8
IL_0004: stloc.0 stloc.0
IL_0005: nop nop
IL_0006: ldarg.1ldarg.1
IL_0007: stloc.3 stloc.3
…..
…
IL_0037: ldloc.0 ldloc.0
IL_0038: stloc.2 stloc.2
IL_0039: br.s br.s
IL_003b
IL_003b: ldloc.2 ldloc.2
IL_003c: ret
ret
} // end of method C::GetMaximum
Common Intermediate Language
7
Plagiarism detection system

Evaluate the performance

Analyze and compare behavior to most commonly
used plagiarism detection systems:



MOSS
JPlag
CodeMatch
8
Tested systems

MOSS




Developed in 1994.
Commonly used in computer science faculties
Supports 26 programming languages
JPlag



Developed in 1996.
Commonly used in education
Supports C, C++, C# and Java
9
Tested Systems

CodeMatch




Developed in 2003.
Commercial software
Supports 26 languages
ILMatch (our system)


Developed in 2010.
Supports all .Net languages (currently 59 languages)
10
Testing

6 test categories

50 test cases covering common code modification
techniques

Evaluation methods


Precision, recall
F-measure
11
Results
MOSS
JPlag
Highest F-measures
CodeMatch
ILMatch
12
Positive

No impact






User comments
Code formatting
Modification of variable and class names
Modification of class members
Changing data types
Some impact


Replacing expressions and loops
Rewritting code in different language
13
Further work

Significant impact




Reordering operands
Reordering class members
Adding redundant statements and variables
Improvements in comparison algorithm
14