Detection of Malware of Code Clone using String Pattern Back

Indian Journal of Science and Technology, Vol 9(33), DOI: 10.17485/ijst/2016/v9i33/95880, September 2016
ISSN (Print) : 0974-6846
ISSN (Online) : 0974-5645
Detection of Malware of Code Clone using String
Pattern Back Propagation Neural Network Algorithm
Simarleen Kaur and Arvinder Kaur*
Computer Science and Engineering Department
Chandigarh University, Mohali -140413, Punjab, India; [email protected], [email protected]
Abstract
Background/Objectives: Malware is progressing at a faster pace so the identification of malware is a vital area in
modernized world where information technology is rapidly emerging. This paper emphasizes on enhancement of
performance parameters for malware detection of source code clones using proposed clone detection algorithm. Methods/
Statistical Analysis: The approaches defined by researchers didn’t consider data types, variables while clone detection.
To fulfill the goal of proposed work, malware detection of clone clone and achieve better results the approach adopted is
implementation of a clone detection algorithm ‘String Pattern Back Propagation Neural Network’ to determine the code
clones and matching them with malware signatures in the repository for computation of performance parameters. Findings:
The identification of malware is proceeded by utilizing java projects having different window size (20,40). The source code
files are put into modularization phase to extract functions from different classes. Code clones are determined by applying
the implemented algorithm for the evaluation of malware signatures. It was observed that employed approach results
into better performance with high accuracy of 96.97% and hence, the approach developed proved to be deterministic and
efficient. The paper provides an overview of state of the art and focuses on enhanced performance in terms of precision,
recall and F-measure in case of Java language where the data types, variables, comments in the application are also given
priority to detect code clones as compared to existing research malware binaries for achieving better performance.
Applications/ Improvements: To handle the tremendous range of malicious code, the approach can be applied in varied
multiple languages to detect the number of clones in an application or a system and achieve greater outcomes.
Keywords: Code Clone, Clone Detection Algorithm, Malware, Malware Analysis, Reverse Engineering
1. Introduction
Nowadays, software is becoming very important for
every system as it is used for various purposes. It collects
details and performs various function which is related to
e-learning, mobile banking, education etc.
Software Engineering is further classified as:
• Reverse Engineering
• Forward Engineering
Reverse Engineering is also known as Back
Engineering, here knowledge or design details is retrieved
from existing product or application then regenerating
new product or application based on retrieved information. It basically includes defragmenting something that
is computer program, mechanical device etc. and their
*Author for correspondence
components are analyzed and studied in detail. This practice is basically used in older industries is now used in
computer era both in hardware and software. In Software
reverse engineering machine code of program is reversed
back to source code that was written in some programming language. Reverse engineering includes the steps1
mentioned below:
1. Component of system is identified and their relationship is found
2. Representing system in other form
3. Then system is represented in physical form.
Forward engineering involves developing high level
model by using details at lower level. We have to move
step by step in order to achieve one’s goals. In the software
domain, Chikofsky and Cross in2 define reverse engineer-
Detection of Malware of Code Clone using String Pattern Back Propagation Neural Network Algorithm
ing as ‘‘the process of analyzing a subject system to create
representations of the system in another form or at a
higher level of abstraction’’. The analysis of results or outcomes to get better understanding of software products
leads to the process of reverse engineering.
The term reverse engineering1 refers to the analysis of
a software system for the a) identification of the software
product components and their correlation or association b) creation of system representation into another
configuration or an abstraction at upper level c) creation
of visible component representation of that software
product. Hence, software reverse engineering originates
deep understanding, knowledge and real facts about the
software system. Finally, the initial stage of reverse engineering begins with software system or a legacy source
code as depicted in Figure 1. The existing source code is
modified and restructured and clear source code is given.
The subsequent engineering activities are performed on
the clean source code and the reading and understanding of source code becomes easier. Extraction of abstract
details is the elementary portion of the reverse engineering process. The meaningful information is fetched from
the software system after analyzing the source code.
1.1 Data Flow Diagrams
Reverse engineering initiates from abstraction process at
lower level that comprises of source code analysis to reach
higher level abstraction (Software requirement specifications and design documents or UML diagrams). The data
flow diagrams for the origination of reverse engineering
techniques for the analysis of malware projects to determine code clones are as follows:
The DFD at level 0 (Figure 2) depicts the basic functions and methods of reverse engineering. Firstly, the need
is to retrieve the program specification and its require-
ments by using the functions of reverse engineering. In
the final stage, specifications are extracted by passing
input as a pertinent software and outcome as requirement
specification.
Figure 2. Level 0 DFD.
In level 1 DFD (Figure 3), there is repository for
storing all the essential details of abstraction extracted
by applying reverse engineering techniques. First there
is need of software which is relevant. Then source code
is analyzed, extracted, parsed to fetch functions, classes
and other details of strings. Database has indexed data
set functions extracted. They are represented in required
format. Retrieval functions use the pertinent software
to acquire program specification, design document and
other software requirements.
Arfacts
Recognio
Pernent Soware
Product
Extractors
Compiler
Reverse
Engineering
Techniques
Repository
Analyzers &
Visualizers
Debugger
Program
Specificaon
Figure 3. Level 1 DFD.
In this level 2 DFD (Figure 4), source code and design
files are merged to pass into analysis phase. First there
is the need of existing software. User goes to an analysis phase or interface of source code. It parses the file set
Figure 1. Evolutionary Development of Reverse Engineering
2
Vol 9 (33) | September 2016 | www.indjst.org
Indian Journal of Science and Technology
Simarleen Kaur and Arvinder Kaur
and semantics of files. One of the many tools can be used
to work on the source code of pertinent software. Some
existing tools of reverse engineering are listed that can be
used. Those tools associate the database and then reverse
techniques are applied to software code.
Type 1: Exact Clones - Those clones that are identical
code segments except for variations in whitespace, structure, design, and comments.
Type 2: Renamed Clones - Syntactically identical code
fragments except for variations in literals, identifiers, layout. They also contain features of exact clones.
Type 3: Gapped Clones - Replicated fragments with
later on modifications like addition, removal and changing of statements. In addition, they carry features of
renamed clone as well.
Type 4: Semantic or Logical Clones - These are developed by different syntactic variants but all the code
fragments perform the same computation.
1.4 Malware
Figure 4. Level 2 DFD.
1.2 Code Clone in Software
Code clones or simply clones are usually referred terms
for sequences of duplicate code, and the process of automation for determining the redundancy or duplication in
source code is called clone detection.
1.3 Clone Definition
The revolution in information technology has resulted
large scale software projects to contain significant code
duplication which is the outcome of copy and paste activities. Code cloning hinders software maintenance process
and baffle the quality of a software. A code clone is a segment of code that is identical to some other code portion
located in the source file. In software cloning, copy and
paste operations3 are performed widely by doing modifications in the source code files at lower level and high
level. Due to which mirror replicas of these codes are
formed, named as Code clones. Research in this domain
has proved that code cloning has great effects on the
maintenance phase of software life cycle. The problem
of detecting the duplicated code still pertains. Major
concern is to explore various clone detection tools and
techniques to remove the software clones.
The clones can be categorized based on the textual
and functional similarity. Code fragment are said to be
similar if they have the identical text in the source code or
carrying similar functionality among them. The first class
of clone is obviously a clone that is copied and pasted into
some other location.
Vol 9 (33) | September 2016 | www.indjst.org
Malware remains a vulnerable, treacherous and constant
threat and its achievement has spawned a host of ameliorate identification and interception methodologies.
Security gadgets such as virus scanners search for
characteristics byte sequence to distinguish malicious
code. The techniques employed for detection are used to
determine the quality of detector. A good malware detection technique must be able to pinpoint malicious code
that is concealed or implanted in the original code and
should have some proficiency for investigation of yet
unknown malware.
Malicious software4,5 which is very familiarly known
as Malware, refers to software specifically fabricated
to enter into the computer system without the owner’s
adequate consent to gain unauthorized access of system
resources and perform malicious activities. It consists
of programming i.e. malware code being entered by the
developer itself in order to deny operations to be performed and accumulate vital information that leads to
loss of privacy, exploitation and other abusive behavior.
The term is coined by merging the words ‘malicious’ and
‘software’ that comprises of computer viruses, spyware,
botnets, rootkits, trojans and many more. The antiquity
of malicious code initiated with ‘Computer Virus’, a terminology6 first established by Cohen in 1983. Malicious
software is the biggest threat in today’s digitized world as
it continues to grow at an alarming rate and evolving with
complexity. The malware fabrication has become a multibillion dollar industry in these few years as it is growing
tremendously. Hence, malicious software operations have
become more and more cloak and dagger, making their
detection more challenging.
Indian Journal of Science and Technology
3
Detection of Malware of Code Clone using String Pattern Back Propagation Neural Network Algorithm
Malicious software is a catch-all term that is employed
to cause harm intentionally to a computer system.
Confidentiality, integrity and availability are the three
elements affected by malware7. Malware can be categorized into various types like Viruses, Trojans, Rootkits,
Backdoors, Spyware, Worms, Adware, Botnets etc. on the
basis of infection mechanism and behavior. All the classes
of malware are briefly described below:
1.5 Virus
Computer virus is a type of malware program that replicates by attaching itself to other programs with harmful
intent. It gets appended to an executable files with the
human participation in order to perform replication phenomenon. Hence, malicious code is written by its authors
and is man-made. Virus is a piece of code that is injected
onto your system without the permission of the owner.
Viruses are self-replicating programs8 that can cause
major destruction within the host machine without his
own consent.
1.6 Trojan Horse
Trojan horse is a program that appears to be benign
but performs malicious activity. The famous story of
Trojan horse in Greek mythology was the reason behind
the derivation of the term Trojan horse. It is one of the
non-replicating malware type9,10 that drops a malicious
payload and shows it performs a desired operation
which is not a reality. The target of the applications or a
computer system containing Trojan malware is to steal
the user’s confidential and sensitive data like password
without the knowledge of the owner of the system i.e.
unauthorized access. One of the Trojan malware, named
as SMS Trojan, is the one which has reportedly affected
most of the android devices in 2012. Its aim is to send the
SMS to premium rate numbers without the permission of
user causing the financial loss to the user. OpFake36 is the
most famous Trojan-SMS which is given second position
in terms of popularity. It is named on Opera Mini Mobile
Browser as it is being considered as hoax downloader of
it. The famous examples of Android Trojan are: Ackposts,
Acnetdoor, Adsms etc.
1.7 Worms
A computer worm is a self-replicating programs or computer software that is capable of sending copies of itself
4
Vol 9 (33) | September 2016 | www.indjst.org
to other nodes or computer systems by accessing the network invisibly without any user intervention5. They use
bandwidth causing harm to entire network. Worms do
not need the support of any file as viruses do. Sasser, My
Doom, Blaster, Melissa etc are some examples of worms.
The applications with the worm malware spreads the virus
to all the devices that are connected with the infected
device either through network or through removable
media.Worm performs it by creating similar or exact copy
of itself into the connected devices11. Android Obad OS is
an example of the Bluetooth worm.
1.8 Botnet
The application with this malware makes the user’s device
available to be controlled by remote server without the
consent of the user. Once the user’s device comes under
the control of remote server then it can lead to attack such
as transferring the sensitive information to remote server,
automatically downloading of malicious application in
the device, service attacks etc. Gemini and Beanroot12 are
some of the example of Android Botnet. Remotely controlled autonomous software is kind of Botnet malware.
It is usually a zombie program which is controlled for any
network infrastructure. Botnets are generally categorized
into three kinds35 i.e. centralized structure, decentralized
structure and hybrid structure.
1.9 Aggressive Adware
Adware, short form of advertising-supported software
is a kind of malware that delivers advertisements automatically. Some common instances of adware contains
pop-up ads on websites or flash on screens and advertisements that are depicted by software after installation of
malicious software. Activities performed by this malware
are stealing the bookmarks, sending unnecessary notifications, creating a shortcut on screen etc. This adware is
famous for sending unnecessary ads on the device which
hinders the efficient usage of device.
1.10 Ramsonware
Ramsonware is a kind of malware that necessarily holds a
computer system captive while demanding a ransom. It is
a kind of “scareware” as it forces the person to pay certain
amount or fee by scaring them. The strategy of this virus
is that either it locks the entire device or it locks some files
with password until a ransom amount is paid through
Indian Journal of Science and Technology
Simarleen Kaur and Arvinder Kaur
online payment mode. This causes the huge financial
loss of the user as they get scared and pay the demanded
amount. It restricts the access of the user to computer
either by locking down the system or encrypting files on
the hard drive.
1.11 Trapdoor/Backdoor
Trapdoor10 is a collective term for a program that bypasses
the security check. This malware permits a malicious user
to perform operations on the affected computer that can
undermine the carried out actions. These operations
proved to be very harmful and causes a serious threat to
the system. They provide allowance to ruin all the relevant information, capture secure and private data and
delete files on the hard disk. Backdoor examples include
Bionet and Orifice.
1.12 Rootkits
Rootkits are developed to take access of infected machine
by acquiring administrator access of the system. The
behavior of trojan horse and backdoor are merged9 to
build a program known as Rootkit and additionally modifies other programs of the operating system. They parade
Trojan behavior by substituting the original version of
a file with an infected copy and backdoor behavior by
authorizing attackers to access a system remotely. Unlike
Trojans and backdoors it also modifies operating system
programs. Rootkits are further divided into two types
on the basis of operating environment: User Mode and
Kernel Mode.
1.13 Spyware
Spyware is meant to monitor and collect personal information about the logged in user whichever page he hits
or any specific email address or website, any key press is
recorded or any private transaction he made or so. It is
likely to enter in the computer system when any trial software is installed after downloading or when the system is
not in use.
There is a long list of malware and this list is expanding with each passing year. The security researchers are
working hard to move over from this alarming situation.
Certain type of efforts have been made like improving
the GUI, providing warning to user, removing malicious applications from official play stores etc. Despite of
all these efforts, the malware in the user’s device is con-
Vol 9 (33) | September 2016 | www.indjst.org
tinuously evolving. The sources of malware scattering
are social networks, pirated software, removable media,
emails and websites.
1.14 Malware Detection Techniques
Malware detection techniques are beneficial for shielding the computer system from various types of infection,
protecting it from loss of secret or private information.
These are used to detect the malware which is increasing
enormously at a faster rate. Malware has exponentially
increase since 2005. 100% malware growth is anticipated in 2016 by Webroot. Techniques can be classified
into three types a) Signature based detection b) Anomaly
based detection c) Specification based detection. Figure
5 illustrates the relationship between the various types
of malware detection techniques. Each of the detection
techniques can employ one of three different approaches:
static, dynamic, or hybrid.
Figure 5. Malware analysis and detection.
1.14.1 Signature-based Detection
Signature-based detection attempts to maintain a database of malware signatures for malware detection by
comparing with patterns stored in database repository.
Ideally, a signature should be capable of identifying any
kind of malware exhibiting the malicious behavior specified by the signature. These signatures are generated by
understanding and observing the disassembled code of
malware binary. Analysis of disassembled code is carried
out and extraction of features is done which are further
used in constructing a database of malware family13. The
main advantage of signature based malware detection is
that it is capable of identifying known instances of malware efficiently but on the other hand, it cannot detect
zero-day attacks i.e. unknown and new malware instances
as these signatures are not available for any type of malware.
Indian Journal of Science and Technology
5
Detection of Malware of Code Clone using String Pattern Back Propagation Neural Network Algorithm
1.14.2 Anomaly-based Detection
Anomaly-based detection usually take care of behavior
analysis of known and unknown malware. A training
(learning) phase and a detection (monitoring) phase are
the two areas where anomaly based detection is feasible
for occurrence. A key benefit of anomaly-based detection
is its ability to detect zero-day attacks7. It is also known
as behavior-based detection. As defined in14 zero-day
attacks are attacks that are previously unknown to the
malware detector, identical to zero-day exploits. The two
basic drawback of this technique is its high false alarm
rate and the complexity associated in finding what features should be erudite in the learning phase.
1.14.3 Specification-based Detection
Specification-based detection is a derivative of anomalybased detection that tries to overpower the typical high
false alarm rate linked with most anomaly-based detection
Techniques. Specification-based techniques15 leverage
some program specifications or rule set of what is valid
behavior in order to describe the intended behavior of
malicious code or security critical programs. Programs or
applications violating the specification are known anomalous and usually, malicious. It comprises of examining
program executions and identifying behavior deviation
from actual specification. This technique is identical
to behavior based detection but it is based on manually implemented specifications rather than replying on
machine learning approaches. The potential benefit is that
it also helps to detect both known and unknown malware
instances. Implementation of detailed specification is
time consuming.
Malware analysis is the process of examining the
purpose and functionality of the malware. The malware
analysis can be done based on the three categories: Static,
Dynamic and Hybrid Analysis of malware. The goal of
analysis of malware is to understand the functionality of a
piece of malware code for the protection of organization’s
network. There exists numerous malware detection algorithms and techniques. A number of them have a great
efficiency and accurateness with more complication than
others. On the other hand, selecting one of them based
on some criterion such as computational time, memory
tradition, complexity of the algorithm, and its precision
in different condition is a challenging problematic task.
Based on that purpose of this research is to learn, to con-
6
Vol 9 (33) | September 2016 | www.indjst.org
sider and to develop a String Back Propagation Neural
Network Algorithm and compare that with the existing
research outcomes in terms of accuracy in different scenarios. The survey of different techniques employed by
authors is as follows:
In16 described in detail on neural network based malware detection system.
In17 described a tool which was used to extract an
object model from the class files of a Java program.
In18 proposed a qualitative juxtapose and assessing of
the present state-of-the-art in clone detection approaches
and tools, and standardize the large quantity of information into a coherent conceptual structural layout.
Also they did the classification, comparison and evaluation of the techniques and tools in respect to two varied
dimensions. Initially, they categorized and compared
approaches based on a number of facets, each of which
has a set of (possibly overlapping) parameters or features.
On the other hand, they qualitatively evaluated the stratified techniques and tools with respect to a taxonomy of
editing scenarios fabricated to model the creation of
Type-1, Type-2, Type-3 and Type-4 clones.
Code obfuscation technique was proposed in19 which
prevents reverse engineering of software applications.
Obfuscation prevents a key algorithm of application and
data structures from unauthorized access. Some malicious users used same approach to insert malware into
program or software. They proposed an analysis system
that detect lexical and string obfuscation in Java malware.
The distinction is done between obfuscated and nonobfuscated malware by identifying set of eleven features
that characterizes obfuscated code, and hence use them to
train a machine learning classifier. Static analyzer is used
to extract features which examines bytecode. Chi-squared
statistics is used for evaluating robustness of each feature.
In20 proposed a technique in which they use code
clone search in which they correlated previously analysed
malware with new malware so that the similarities can be
identified between them. Hence prevent reanalyzing code
fragment which are found earlier. Also, they developed a
tool named BinClone for the identification of code clone
fragments.
In21 designated the almost-perfect clones as “nearmiss” clones. Mainly proposed simple technique to find
“near miss” clone. They used standard lexical comparison tools coupled with language-specific extractors to
locate potential clones. The technique merges both compiler-based and lexical methods for finding clones. The
Indian Journal of Science and Technology
Simarleen Kaur and Arvinder Kaur
extraction of potential clones is the unique stage in the
system which is dependent on a particular language. Also,
they described a novel clone-detection technique based
on identification and reduction of potential clones using
dynamic clustering and code normalization.
In22 developed a unique algorithm Back Propagation
Neural Network Algorithm which is one of the familiar
approach in this world of information technology.
In23 introduced new attack called “shadow attack” to
evade current behavior-based malware detectors by partitioning one piece of malware into multiple “shadow
processes”.
The novel technique for android security was proposed in24 called SCSdroid (System Call SequenceDroid),
that transforms the thread grained system call sequences
which was initialized by computer systems and applications. Same family’s malicious common subsequence
from system calls sequence of MRAs was extracted by
SCSdroid. Hence common subsquence extracted from
system employed to distinguish evaluated application and
there is no need to require original application.
In25 proposed a technique in which they discovered
triggering relation on the request of network and they
rely on knowledge containing relevant information of
structure in order to investigate theft or to identify any
malicious activity that are not responsible of any legitimate cause. Temporal and casual relationship between
two events is defined by triggering relation. In order to
infer triggering relation they designed and compared rule
and learning based methods. They introduced a userintention based security policy for pointing any malicious
malware activities based on a triggering relation graph.
They used DARPA dataset and 7 GB real-world network
traffic for evaluating solution. Results evaluated that
analysis can successfully detect various malware activities
which includes spyware, malware etc.
In26 proposed a new technique which detects clones in
source code in order to enhance security of software system. Here they developed mining algorithm which they
considered program structure and define measures which
rely on similarities which are incorporated in sequential
structured text for retrieving similar fragment in source
code.
In27 proposed a technique for detection of browser
extension types by employing the Hidden Markov Model
(HMM). They trained multiple HMMs on a different
compilers and malware generators. Then provided a score
against this model for the malware samples and further
Vol 9 (33) | September 2016 | www.indjst.org
this models then segregated into clusters based on the
awarded scores. Hence clustered results would able to categorize malware samples into their appropriate families
with good accuracy. Hence no malware families in the test
set were used to generate the HMMs, research proves that
the implemented approach can effectively classify previously unknown malware in some scenarios. Finally, this
clustering strategy could serve as a useful tool in malware
analysis and classification.
In28 presented a classification architecture namely
Malware Evaluator that altered malware encyclopedias
like Trend Micro, Symantec into an automated classifier that clustered species corresponding to taxonomic
features and helped in detection and classification of
zero-day attacks based on learning and generalization
potential in comparison to other existing approaches.
Their framework treated malware categorization as a
supervised learning task and built learning models for
taxonomic features with gradient boosting decision trees
(GBDTs)as well as support vector machines (SVMs) and
eventually visualized malware categorizations with selforganizing maps. They also deployed word stemming and
stopword removal techniques for feature space reduction
along with the tokenization process which generated attributes of malware strains. Malware Evaluator has revealed
that Trojan, Infector, Backdoor and Worm remarkably
contribute to the malware community and impose serious threats on the Internet ecosystem. Finally, it helped in
defending against risks and recognize zero-day attacks in
real-world scenarios.
In29 focused on evaluation of the ability of static code
analysis tools for diagnosis of security vulnerabilities.
Static Analysis of source code is termed as a way of figuring out the software risks and errors. Also for discovering
security vulnerabilities software analyzers were used.
They implemented scenarios on the basis of Juliet benchmark test suite which permitted to automatically estimate
execution of large quantity of test cases that covered
entire range of C/C++ and JAVA susceptibilities, to find
out tools’s performance both per CWE and across CWE’s
all over, and to manage the enquiry and assessment of
outcomes. The experimental approach was implemented
for identification of static code analyzers to detect security vulnerabilities.
A hardback approach30 was initiated for retrieving
the information by employing Speech Recognition and
Neural Network. They utilized Marcovian method for
refurbishing the sampling weight produced from the
Indian Journal of Science and Technology
7
Detection of Malware of Code Clone using String Pattern Back Propagation Neural Network Algorithm
input speech. Then filtering technique known as Kalman
was utilized for the feature vector extraction and the classification was done on the basis of rationalizing energy.
In31 soft computing approaches were aggrandized for
the evaluation of the performance of procedures and their
optimization. They proposed a unique hybrid technique
for controlling and monitoring the arm of malleable
robots by employing fuzzy logical algorithms, neural
network and other optimization algorithms related to
swarms.
In32 a unique technique for malware processing is
introduced that employed Bayesian approach and an
algorithm named as Nymble algorithm. The fabrication
of secure DTN is accomplished by the use of Bayesian
technique and the algorithm is used for the removal of
malware.
In33 proposed the development of a reverse engineering approach for the obstruction of the access to the
resources and grants the permission to use of requirement base method for accessing various resources.
In34 addressed the issue of mobile security as one of
the major domain which is rapidly growing in terms of
mobile technology. For that, and distinctly to address the
malware threats, numerous malware analysis methods
were developed as per investigation to identify, stratify
and defend against malicious code and mobile threats.
Hence, they proposed a feature-rich hybrid anti malware
system, which was named as Andro-Dumpsys. The system was built to leverage volatile memory acquisition
for Precise Malware Detection And Categorization. The
system developed is based on technique called similarity matching. Andro-Dumpsys can distinguish malicious
applications and classify malware applications into
identical behavior groups based on unique behavior characteristics. The results demonstrated that the system built
was very reliable and showed good performance while
detecting malware. It was also examined that it performed
well in classifying malware families with low false positives and false negatives. Furthermore, Andro-Dumpsys
permits us to find zero-day malware.
2. Proposed Methodology
In this research Java source code projects and applications
are employed for segregating the project into different
functions among varied classes in order to determine the
code clone in the source files. The code clones are iden-
8
Vol 9 (33) | September 2016 | www.indjst.org
Figure 6. Flow chart of methodology.
tified by utilizing the String Pattern Back Propagation
Neural Network Algorithm which is being developed as
a target goal of the research. Then these code clones are
used as a signature to determine the accuracy of the malware signatures which are maintained in the repository
for detection of malware.
2.1 Methodology
The identification of precision, recall and accuracy from
the results obtained is used for comparison with the existing research. Steps followed for determining the precision,
recall and accuracy of implemented algorithm to evaluate
performance of the developed system are defined (Figure
6). The methodology was segregated into following steps:
Step 1: Firstly, we will design user panel using Java
swings programming where user have two options for
uploading different type of project and adding malware
Indian Journal of Science and Technology
Simarleen Kaur and Arvinder Kaur
patterns. One is Malware project patterns and second is
Application project. a) Malware project is uploaded to store pattern of
code in database
b) Application project is uploaded to check vulnerability in the project after searching the code clones by
applying implemented algorithm “String Pattern Back
Propagation NN Algorithm” and matching with malware
signatures.
Step 2: This phase includes the implementation of
string pattern back propagation neural network algorithm to find clone in the project.
Step 3: In this phase we will pass every module of project to algorithm to find malware pattern inside code of
project.
Step 4: Finally results will be validated and compared
with other other algorithms using factor computation.
3. Results and Discussion
2.2 Outline of Algorithm
Precision is the fraction of retrieved documents that
are relevant to the above query.
Recall in information retrieval is the fraction of the
documents that are relevant to the query that are successfully retrieved.
The implementation of the following algorithm is done in
this research in order to find clones from repository.
Algorithm : String Pattern Back Propagation Neural
Network Algorithm
Dataset: Containing java source files project. First
user will upload the project in system.
1. begin
2. set x={r1,r2,rn} where x is an array of files.
3. set i=0,D=length of array;
4. Foreach i<D
5. read the line from file and split the line and store in
array list b 6. set k=0,length=b.length;
7. Foreach k <length
value[k]==malware[k] 8. if condition get true then set flag=true
and store in the line number in array t 9. if flag=false return to step number 2 10. if i>D
return false.
In first step of algorithm, project files are uploaded in
the system. Secondly, array of files is created, then loop is
created and line from individual file is read in fifth step.
At the same time lines in the file is splitted and stored
in array which is named as array list b. Then loop upto
length of each line runs and search malware. If malware
is detected instead of going to outer loop it will take inner
loop and detect the malware in same file.
Vol 9 (33) | September 2016 | www.indjst.org
The objective of the research was to evaluate the proposed
code detection algorithmic approach in terms of precision,
recall and F-measure. Experiments were conducted using
two different sets of java files. In our research the analysis
is conducted on two projects. One is of 20 files and other
consisting of 40 files which is referred as Window size.
Each analyzed using the String Pattern Back propagation
algorithm which is developed in the form of a system.
After application of algorithm code clones were determined which are being matched to malware signatures
present in the repository. Then we computed factors like
Precision, Recall and F-measure by using below formulas:
Precision= {relevant document}∩{retrieved document}
{retrieved document}
Recall= {relevant document}∩{retrieved document}
{relevant document}
F-measure is the combination of precision and recall
is the harmonic mean of precision and recall, the conventional F-measure is calculated as follows:
F-measure= 2* precision*recall
precision + recall
All these parameters are used to calculate the performance of the implemented system or proposed algorithm.
Table 1 illustrates the parametric values that were
evaluated by applying the proposed algorithm to the java
source files for the detection of malware patterns. The
results in table depicts the high accuracy of the implemented approach as compared to the existing work. This
proves the better outcomes of the system being developed
in this research. Hence, the algorithm is very much efficient in terms of malware detection of source code files.
The data set of two projects associated with the
research is put under clone detection approach and the
results generated are depicted in Figure 7 which shows
that precision value has increased in case of new developed algorithm and hence, it is efficient and reliable. The
window size 20 depicts that the input source file is of size
Indian Journal of Science and Technology
9
Detection of Malware of Code Clone using String Pattern Back Propagation Neural Network Algorithm
20. The value has enhanced from 81 to 97.25 when malware signatures are detected using String Pattern Back
Propagation NN Algorithm.
Table 1. Comparison of proposed algorithm with
existing work of an approach scalclone
Approach
String Pattern
Back Propagation
Neural Network
Algorithm
ScalClone System
Window Size
20
40
20
40
Precision
0.9725
0.9706
0.81
0.83
Recall
0.9670
0.9677
100
100
F-measure (%)
96.97
96.92
88
91
Figure 8. Recall.
Figure 9. F-measure.
Figure 7. Precision.
The first data set is an assortment of 20 java files and
computation of similar code fragments is easy as compared to project having 40 files. To obtain the precision
of the proposed algorithm, the identification of the code
clone is done and malware signatures are detected. Figure
8 illustrates the parametric values for recall for similar
code clones. Also, the F-measure values of the proposed
clone detection system are depicted in Figure 9. The
f-measure are consistently above 95% for both type of
data sets in comparison to existing research.
The computational values for f-measure have risen
from 88 to 96.97 in case of 20 window size and from 91
to 97.92 in second set of data source files which are 40 in
number.
Finally, the results of the proposed system are compared to the ScalClone, the existing research approach in
order to determine the best results possible in terms
10
Vol 9 (33) | September 2016 | www.indjst.org
of precision, recall and f-measure. This proves that the
developed system is effective. The proposed work results
are graphically depicted in Figure 10 in terms of precision, recall and accuracy after applying the implemented
algorithm.
Figure 10. Algorithmic results of proposed work.
4. Conclusion and Future Scope
As malware pose a treacherous threat to computer security and hampers accustomed functioning of system by
Indian Journal of Science and Technology
Simarleen Kaur and Arvinder Kaur
modifying the source code or appending some malicious
information in it. Hence malware detection using the code
clones is our major concern. Our main motive is to detect
malware from source code file or from other applications
which pose threat to computer security or distort computer functioning. Malware are increasing rapidly and
unintentionally used by unauthorized users to destroy
software or computer system. In this research paper we
have developed algorithm named as String Pattern Back
Propagation Neural Network Algorithm to determine the
code clones from the maintained repository of malware
signatures in the java projects. The proposed CLONE
DETECTION ALGORITHM is used to detect malware
and hence able to protect system from malicious attacks.
This algorithm is simple and easy to understand. We have
efficiently used this algorithm to detect malware and
results are already shown in this paper. There is one limitation in the current work as accomplished work helps in
identification of malware in case of one language only.
So this can further be pursued for detection of malware
in case of usage of multiple languages in a project or an
application.
5. References
1. Ref- M. J. Rekoff. On reverse engineering. IEEE Trans.
Systems, Man, and Cybernetics, pages 244–252, MarchApril 1985.
2. C.K. Roy and J.R. Cordy. A Survey on software clone detection research. Queens School of Computing TR, 541 : 115,
2007.
3. Bellon, Stefan, Rainer Koschke, Giuliano Antonial, Jens
Krinke, and Ettore Merlo. “Comparison and evaluation
of clone detection tools.” Software Engineering, IEEE
Transactions on 33, no. 9 (2007): 577-591
4. Vinod, P., R. Jaipur, V. Laxmi, and M. Gaur. “Survey on
malware detection methods.” In Proceedings of the 3rd
Hackers’ Workshop on Computer and Internet Security
(IITKHACK’09), pp. 74-79. 2009.
5. Royinghal, Priyank, and Nataasha Raul. “Malware detection module using machine learning algorithms to assist in
centralized security in enterprise networks.” arXiv preprint
arXiv:1205.3062 (2012).
6. Fred Cohen. Computer Viruses. PhD thesis, University of
Southern California, 1985.
7. Williamson, David. “Deconstructing malware: what it is and
how to stop it.”Information Security Technical Report 9, no.
2 (2004): 27-34.
8. Sridhara, Sudarshan Madenur, and Mark Stamp.
“Metamorphic worm that carries its own morphing
Vol 9 (33) | September 2016 | www.indjst.org
engine.” Journal of Computer Virology and Hacking
Techniques 9, no. 2 (2013): 49-58.
9. Ravula, Ravindar Reddy. “Classification of Malware using
Reverse Engineering and Data Mining Techniques.” PhD
diss., University of Akron, 2011.
10. Annachhatre, Chinmayee, Thomas H. Austin, and Mark
Stamp. “Hidden Markov models for malware classification.” Journal of Computer Virology and Hacking
Techniques 11, no. 2 (2015): 59-73.
11. Mathur, Kirti, and Saroj Hiranwal. “A survey on
techniques in detection and analyzing malware executables.” International Journal of Advanced Research in
Computer Science and Software Engineering 3, no. 4
(2013): 422-428.
12. Zhou, Yajin, and Xuxian Jiang. “Dissecting android malware: Characterization and evolution.” In Security and
Privacy (SP), 2012 IEEE Symposium on, pp. 95-109. IEEE,
2012.
13. Landage, Jyoti, and M. P. Wankhade. “Malware and
Malware Detection Techniques: A Survey.” In International
Journal of Engineering Research and Technology, vol. 2, no.
12 (December-2013). ESRSA Publications, 2013.
14. Shahzad, Khurram, and Steve Woodhead. “A Pseudo-Worm
Daemon (PWD) for empirical analysis of zero-day network worms and countermeasure testing.” In Computing,
Communication and Networking Technologies (ICCCNT),
2014 International Conference on, pp. 1-6. IEEE, 2014.
15. Pandey, Sudhir Kumar, and B. M. Mehtre. “A Lifecycle
Based Approach for Malware Analysis.” In Communication
Systems and Network Technologies (CSNT), 2014 Fourth
International Conference on, pp. 767-771. IEEE, 2014.
16. Saxe, Joshua, and Konstantin Berlin. “Deep neural network
based malware detection using two dimensional binary
program features.” In 2015 10th International Conference
on Malicious and Unwanted Software (MALWARE), pp.
11-20. IEEE, 2015.
17. Jackson, D. and Waingold, A., 2001. Lightweight extraction of object models from bytecode. IEEE Transactions on
Software Engineering, 27(2), pp.156-169.
18. Roy, Chanchal K., James R. Cordy, and Rainer Koschke.
“Comparison and evaluation of code clone detection
techniques and tools: A qualitative approach.” Science of
Computer Programming 74, no. 7 (2009): 470-495.
19. Kumar, Renuka, and Anand Raj Essar Vaishakh. “Detection
of Obfuscation in Java Malware.” Procedia Computer
Science 78 (2016): 521-529.
20. Farhadi, Mohammad Reza, Benjamin Fung, Philippe
Charland, and Mourad Debbabi. “BinClone: detecting code
clones in malware.” In Software Security and Reliability
(SERE), 2014 Eighth International Conference on, pp.
78-87. IEEE, 2014.
Indian Journal of Science and Technology
11
Detection of Malware of Code Clone using String Pattern Back Propagation Neural Network Algorithm
21. Cordy, James R., Thomas R. Dean, and Nikita Synytskyy.
“Practical language-independent detection of near-miss
clones.” In Proceedings of the 2004 conference of the
Centre for Advanced Studies on Collaborative research, pp.
1-12. IBM Press, 2004.
22. Cilimkovic, Mirza. “Neural networks and back propagation algorithm.”Institute of Technology Blanchardstown,
Blanchardstown Road North Dublin15 (2015).
23. Ma, Weiqin, Pu Duan, Sanmin Liu, Guofei Gu, and
Jyh-Charn Liu. “Shadow attacks: automatically evading
system-call-behavior based malware detection.” Journal in
Computer Virology 8, no. 1-2 (2012): 1-13.
24. Lin, Ying-Dar, Yuan-Cheng Lai, Chien-Hung Chen,
and Hao-Chuan Tsai. “Identifying android malicious
repackaged applications by thread-grained system call
sequences.” computers & security 39 (2013): 340-350.
25. Zhang, Hao, Danfeng Daphne Yao, Naren Ramakrishnan,
and Zhibin Zhang. “Causality reasoning about network
events for detecting stealthy malware activities.” computers
& security 58 (2016): 180-198.
26. Yoshihisa Udagawa. “A Novel Technique for Retrieving
Source Code Duplication” ICONS 2014 : The Ninth
International Conference on Systems, pp.172-177
27. Annachhatre, Chinmayee, Thomas H. Austin, and Mark
Stamp. “Hidden Markov models for malware classification.” Journal of Computer Virology and Hacking
Techniques 11, no. 2 (2015): 59-73.
28. Chen, Zhongqiang, Mema Roussopoulos, Zhanyan Liang,
Yuan Zhang, Zhongrong Chen, and Alex Delis. “Malware
characteristics and threats on the internet ecosystem.”
Journal of Systems and Software 85, no. 7 (2012): 16501672.
29. Goseva-Popstojanova, Katerina, and Andrei Perhinschi.
“On the capability of static code analysis to detect security
12
Vol 9 (33) | September 2016 | www.indjst.org
vulnerabilities.” Information and Software Technology 68
(2015): 18-33.
30. Sajeer, K., and Paul Rodrigues. “Novel Approach of
Implementing Speech Recognition using Neural Networks
for Information Retrieval.” Indian Journal of Science and
Technology 8, no. 33 (2015).
31. Khoobjo, E. “New Hybrid Approach to Control the
Arm of Flexible Robots by using Neural Networks,
Fuzzy Algorithms and Particle Swarm Optimization
Algorithm.” Indian Journal of Science and Technology 8, no.
35 (2015).
32. Jeyaseelan, WR Salem, and S. Hariharan. “Malware
Detection and Elimination using Bayesian Technique
and Nymble Algorithm.” Indian Journal of Science and
Technology 8, no. 34 (2015).
33. Ahmad, Dar Muneer, and Parvez Javed. “Security
Comparison of Android and IOS and Implementationof
User Approved Security (UAS) for Android.”Indian Journal
of Science and Technology 9, no. 14 (2016).
34. Jang, Jae-wook, Hyunjae Kang, Jiyoung Woo, Aziz
Mohaisen, and Huy Kang Kim. “Andro-dumpsys: antimalware system based on the similarity of malware
creator and malware centric information.” Computers &
Security(2016).
35. Sathish, Vidhya, and P. Sheik Abdul Khader. “Deployment
of proposed botnet monitoring platform using online malware analysis for distributed environment.” Indian Journal
of Science and Technology 7, no. 8 (2014): 1087.
36. Malik, Sapna, and Kiran Khatter. “System Call Analysis of
Android Malware Families.” Indian Journal of Science and
Technology 9, no. 21 (2016).
Indian Journal of Science and Technology