Indian Journal of Science and Technology, Vol 9(33), DOI: 10.17485/ijst/2016/v9i33/95880, September 2016 ISSN (Print) : 0974-6846 ISSN (Online) : 0974-5645 Detection of Malware of Code Clone using String Pattern Back Propagation Neural Network Algorithm Simarleen Kaur and Arvinder Kaur* Computer Science and Engineering Department Chandigarh University, Mohali -140413, Punjab, India; [email protected], [email protected] Abstract Background/Objectives: Malware is progressing at a faster pace so the identification of malware is a vital area in modernized world where information technology is rapidly emerging. This paper emphasizes on enhancement of performance parameters for malware detection of source code clones using proposed clone detection algorithm. Methods/ Statistical Analysis: The approaches defined by researchers didn’t consider data types, variables while clone detection. To fulfill the goal of proposed work, malware detection of clone clone and achieve better results the approach adopted is implementation of a clone detection algorithm ‘String Pattern Back Propagation Neural Network’ to determine the code clones and matching them with malware signatures in the repository for computation of performance parameters. Findings: The identification of malware is proceeded by utilizing java projects having different window size (20,40). The source code files are put into modularization phase to extract functions from different classes. Code clones are determined by applying the implemented algorithm for the evaluation of malware signatures. It was observed that employed approach results into better performance with high accuracy of 96.97% and hence, the approach developed proved to be deterministic and efficient. The paper provides an overview of state of the art and focuses on enhanced performance in terms of precision, recall and F-measure in case of Java language where the data types, variables, comments in the application are also given priority to detect code clones as compared to existing research malware binaries for achieving better performance. Applications/ Improvements: To handle the tremendous range of malicious code, the approach can be applied in varied multiple languages to detect the number of clones in an application or a system and achieve greater outcomes. Keywords: Code Clone, Clone Detection Algorithm, Malware, Malware Analysis, Reverse Engineering 1. Introduction Nowadays, software is becoming very important for every system as it is used for various purposes. It collects details and performs various function which is related to e-learning, mobile banking, education etc. Software Engineering is further classified as: • Reverse Engineering • Forward Engineering Reverse Engineering is also known as Back Engineering, here knowledge or design details is retrieved from existing product or application then regenerating new product or application based on retrieved information. It basically includes defragmenting something that is computer program, mechanical device etc. and their *Author for correspondence components are analyzed and studied in detail. This practice is basically used in older industries is now used in computer era both in hardware and software. In Software reverse engineering machine code of program is reversed back to source code that was written in some programming language. Reverse engineering includes the steps1 mentioned below: 1. Component of system is identified and their relationship is found 2. Representing system in other form 3. Then system is represented in physical form. Forward engineering involves developing high level model by using details at lower level. We have to move step by step in order to achieve one’s goals. In the software domain, Chikofsky and Cross in2 define reverse engineer- Detection of Malware of Code Clone using String Pattern Back Propagation Neural Network Algorithm ing as ‘‘the process of analyzing a subject system to create representations of the system in another form or at a higher level of abstraction’’. The analysis of results or outcomes to get better understanding of software products leads to the process of reverse engineering. The term reverse engineering1 refers to the analysis of a software system for the a) identification of the software product components and their correlation or association b) creation of system representation into another configuration or an abstraction at upper level c) creation of visible component representation of that software product. Hence, software reverse engineering originates deep understanding, knowledge and real facts about the software system. Finally, the initial stage of reverse engineering begins with software system or a legacy source code as depicted in Figure 1. The existing source code is modified and restructured and clear source code is given. The subsequent engineering activities are performed on the clean source code and the reading and understanding of source code becomes easier. Extraction of abstract details is the elementary portion of the reverse engineering process. The meaningful information is fetched from the software system after analyzing the source code. 1.1 Data Flow Diagrams Reverse engineering initiates from abstraction process at lower level that comprises of source code analysis to reach higher level abstraction (Software requirement specifications and design documents or UML diagrams). The data flow diagrams for the origination of reverse engineering techniques for the analysis of malware projects to determine code clones are as follows: The DFD at level 0 (Figure 2) depicts the basic functions and methods of reverse engineering. Firstly, the need is to retrieve the program specification and its require- ments by using the functions of reverse engineering. In the final stage, specifications are extracted by passing input as a pertinent software and outcome as requirement specification. Figure 2. Level 0 DFD. In level 1 DFD (Figure 3), there is repository for storing all the essential details of abstraction extracted by applying reverse engineering techniques. First there is need of software which is relevant. Then source code is analyzed, extracted, parsed to fetch functions, classes and other details of strings. Database has indexed data set functions extracted. They are represented in required format. Retrieval functions use the pertinent software to acquire program specification, design document and other software requirements. Arfacts Recognio Pernent Soware Product Extractors Compiler Reverse Engineering Techniques Repository Analyzers & Visualizers Debugger Program Specificaon Figure 3. Level 1 DFD. In this level 2 DFD (Figure 4), source code and design files are merged to pass into analysis phase. First there is the need of existing software. User goes to an analysis phase or interface of source code. It parses the file set Figure 1. Evolutionary Development of Reverse Engineering 2 Vol 9 (33) | September 2016 | www.indjst.org Indian Journal of Science and Technology Simarleen Kaur and Arvinder Kaur and semantics of files. One of the many tools can be used to work on the source code of pertinent software. Some existing tools of reverse engineering are listed that can be used. Those tools associate the database and then reverse techniques are applied to software code. Type 1: Exact Clones - Those clones that are identical code segments except for variations in whitespace, structure, design, and comments. Type 2: Renamed Clones - Syntactically identical code fragments except for variations in literals, identifiers, layout. They also contain features of exact clones. Type 3: Gapped Clones - Replicated fragments with later on modifications like addition, removal and changing of statements. In addition, they carry features of renamed clone as well. Type 4: Semantic or Logical Clones - These are developed by different syntactic variants but all the code fragments perform the same computation. 1.4 Malware Figure 4. Level 2 DFD. 1.2 Code Clone in Software Code clones or simply clones are usually referred terms for sequences of duplicate code, and the process of automation for determining the redundancy or duplication in source code is called clone detection. 1.3 Clone Definition The revolution in information technology has resulted large scale software projects to contain significant code duplication which is the outcome of copy and paste activities. Code cloning hinders software maintenance process and baffle the quality of a software. A code clone is a segment of code that is identical to some other code portion located in the source file. In software cloning, copy and paste operations3 are performed widely by doing modifications in the source code files at lower level and high level. Due to which mirror replicas of these codes are formed, named as Code clones. Research in this domain has proved that code cloning has great effects on the maintenance phase of software life cycle. The problem of detecting the duplicated code still pertains. Major concern is to explore various clone detection tools and techniques to remove the software clones. The clones can be categorized based on the textual and functional similarity. Code fragment are said to be similar if they have the identical text in the source code or carrying similar functionality among them. The first class of clone is obviously a clone that is copied and pasted into some other location. Vol 9 (33) | September 2016 | www.indjst.org Malware remains a vulnerable, treacherous and constant threat and its achievement has spawned a host of ameliorate identification and interception methodologies. Security gadgets such as virus scanners search for characteristics byte sequence to distinguish malicious code. The techniques employed for detection are used to determine the quality of detector. A good malware detection technique must be able to pinpoint malicious code that is concealed or implanted in the original code and should have some proficiency for investigation of yet unknown malware. Malicious software4,5 which is very familiarly known as Malware, refers to software specifically fabricated to enter into the computer system without the owner’s adequate consent to gain unauthorized access of system resources and perform malicious activities. It consists of programming i.e. malware code being entered by the developer itself in order to deny operations to be performed and accumulate vital information that leads to loss of privacy, exploitation and other abusive behavior. The term is coined by merging the words ‘malicious’ and ‘software’ that comprises of computer viruses, spyware, botnets, rootkits, trojans and many more. The antiquity of malicious code initiated with ‘Computer Virus’, a terminology6 first established by Cohen in 1983. Malicious software is the biggest threat in today’s digitized world as it continues to grow at an alarming rate and evolving with complexity. The malware fabrication has become a multibillion dollar industry in these few years as it is growing tremendously. Hence, malicious software operations have become more and more cloak and dagger, making their detection more challenging. Indian Journal of Science and Technology 3 Detection of Malware of Code Clone using String Pattern Back Propagation Neural Network Algorithm Malicious software is a catch-all term that is employed to cause harm intentionally to a computer system. Confidentiality, integrity and availability are the three elements affected by malware7. Malware can be categorized into various types like Viruses, Trojans, Rootkits, Backdoors, Spyware, Worms, Adware, Botnets etc. on the basis of infection mechanism and behavior. All the classes of malware are briefly described below: 1.5 Virus Computer virus is a type of malware program that replicates by attaching itself to other programs with harmful intent. It gets appended to an executable files with the human participation in order to perform replication phenomenon. Hence, malicious code is written by its authors and is man-made. Virus is a piece of code that is injected onto your system without the permission of the owner. Viruses are self-replicating programs8 that can cause major destruction within the host machine without his own consent. 1.6 Trojan Horse Trojan horse is a program that appears to be benign but performs malicious activity. The famous story of Trojan horse in Greek mythology was the reason behind the derivation of the term Trojan horse. It is one of the non-replicating malware type9,10 that drops a malicious payload and shows it performs a desired operation which is not a reality. The target of the applications or a computer system containing Trojan malware is to steal the user’s confidential and sensitive data like password without the knowledge of the owner of the system i.e. unauthorized access. One of the Trojan malware, named as SMS Trojan, is the one which has reportedly affected most of the android devices in 2012. Its aim is to send the SMS to premium rate numbers without the permission of user causing the financial loss to the user. OpFake36 is the most famous Trojan-SMS which is given second position in terms of popularity. It is named on Opera Mini Mobile Browser as it is being considered as hoax downloader of it. The famous examples of Android Trojan are: Ackposts, Acnetdoor, Adsms etc. 1.7 Worms A computer worm is a self-replicating programs or computer software that is capable of sending copies of itself 4 Vol 9 (33) | September 2016 | www.indjst.org to other nodes or computer systems by accessing the network invisibly without any user intervention5. They use bandwidth causing harm to entire network. Worms do not need the support of any file as viruses do. Sasser, My Doom, Blaster, Melissa etc are some examples of worms. The applications with the worm malware spreads the virus to all the devices that are connected with the infected device either through network or through removable media.Worm performs it by creating similar or exact copy of itself into the connected devices11. Android Obad OS is an example of the Bluetooth worm. 1.8 Botnet The application with this malware makes the user’s device available to be controlled by remote server without the consent of the user. Once the user’s device comes under the control of remote server then it can lead to attack such as transferring the sensitive information to remote server, automatically downloading of malicious application in the device, service attacks etc. Gemini and Beanroot12 are some of the example of Android Botnet. Remotely controlled autonomous software is kind of Botnet malware. It is usually a zombie program which is controlled for any network infrastructure. Botnets are generally categorized into three kinds35 i.e. centralized structure, decentralized structure and hybrid structure. 1.9 Aggressive Adware Adware, short form of advertising-supported software is a kind of malware that delivers advertisements automatically. Some common instances of adware contains pop-up ads on websites or flash on screens and advertisements that are depicted by software after installation of malicious software. Activities performed by this malware are stealing the bookmarks, sending unnecessary notifications, creating a shortcut on screen etc. This adware is famous for sending unnecessary ads on the device which hinders the efficient usage of device. 1.10 Ramsonware Ramsonware is a kind of malware that necessarily holds a computer system captive while demanding a ransom. It is a kind of “scareware” as it forces the person to pay certain amount or fee by scaring them. The strategy of this virus is that either it locks the entire device or it locks some files with password until a ransom amount is paid through Indian Journal of Science and Technology Simarleen Kaur and Arvinder Kaur online payment mode. This causes the huge financial loss of the user as they get scared and pay the demanded amount. It restricts the access of the user to computer either by locking down the system or encrypting files on the hard drive. 1.11 Trapdoor/Backdoor Trapdoor10 is a collective term for a program that bypasses the security check. This malware permits a malicious user to perform operations on the affected computer that can undermine the carried out actions. These operations proved to be very harmful and causes a serious threat to the system. They provide allowance to ruin all the relevant information, capture secure and private data and delete files on the hard disk. Backdoor examples include Bionet and Orifice. 1.12 Rootkits Rootkits are developed to take access of infected machine by acquiring administrator access of the system. The behavior of trojan horse and backdoor are merged9 to build a program known as Rootkit and additionally modifies other programs of the operating system. They parade Trojan behavior by substituting the original version of a file with an infected copy and backdoor behavior by authorizing attackers to access a system remotely. Unlike Trojans and backdoors it also modifies operating system programs. Rootkits are further divided into two types on the basis of operating environment: User Mode and Kernel Mode. 1.13 Spyware Spyware is meant to monitor and collect personal information about the logged in user whichever page he hits or any specific email address or website, any key press is recorded or any private transaction he made or so. It is likely to enter in the computer system when any trial software is installed after downloading or when the system is not in use. There is a long list of malware and this list is expanding with each passing year. The security researchers are working hard to move over from this alarming situation. Certain type of efforts have been made like improving the GUI, providing warning to user, removing malicious applications from official play stores etc. Despite of all these efforts, the malware in the user’s device is con- Vol 9 (33) | September 2016 | www.indjst.org tinuously evolving. The sources of malware scattering are social networks, pirated software, removable media, emails and websites. 1.14 Malware Detection Techniques Malware detection techniques are beneficial for shielding the computer system from various types of infection, protecting it from loss of secret or private information. These are used to detect the malware which is increasing enormously at a faster rate. Malware has exponentially increase since 2005. 100% malware growth is anticipated in 2016 by Webroot. Techniques can be classified into three types a) Signature based detection b) Anomaly based detection c) Specification based detection. Figure 5 illustrates the relationship between the various types of malware detection techniques. Each of the detection techniques can employ one of three different approaches: static, dynamic, or hybrid. Figure 5. Malware analysis and detection. 1.14.1 Signature-based Detection Signature-based detection attempts to maintain a database of malware signatures for malware detection by comparing with patterns stored in database repository. Ideally, a signature should be capable of identifying any kind of malware exhibiting the malicious behavior specified by the signature. These signatures are generated by understanding and observing the disassembled code of malware binary. Analysis of disassembled code is carried out and extraction of features is done which are further used in constructing a database of malware family13. The main advantage of signature based malware detection is that it is capable of identifying known instances of malware efficiently but on the other hand, it cannot detect zero-day attacks i.e. unknown and new malware instances as these signatures are not available for any type of malware. Indian Journal of Science and Technology 5 Detection of Malware of Code Clone using String Pattern Back Propagation Neural Network Algorithm 1.14.2 Anomaly-based Detection Anomaly-based detection usually take care of behavior analysis of known and unknown malware. A training (learning) phase and a detection (monitoring) phase are the two areas where anomaly based detection is feasible for occurrence. A key benefit of anomaly-based detection is its ability to detect zero-day attacks7. It is also known as behavior-based detection. As defined in14 zero-day attacks are attacks that are previously unknown to the malware detector, identical to zero-day exploits. The two basic drawback of this technique is its high false alarm rate and the complexity associated in finding what features should be erudite in the learning phase. 1.14.3 Specification-based Detection Specification-based detection is a derivative of anomalybased detection that tries to overpower the typical high false alarm rate linked with most anomaly-based detection Techniques. Specification-based techniques15 leverage some program specifications or rule set of what is valid behavior in order to describe the intended behavior of malicious code or security critical programs. Programs or applications violating the specification are known anomalous and usually, malicious. It comprises of examining program executions and identifying behavior deviation from actual specification. This technique is identical to behavior based detection but it is based on manually implemented specifications rather than replying on machine learning approaches. The potential benefit is that it also helps to detect both known and unknown malware instances. Implementation of detailed specification is time consuming. Malware analysis is the process of examining the purpose and functionality of the malware. The malware analysis can be done based on the three categories: Static, Dynamic and Hybrid Analysis of malware. The goal of analysis of malware is to understand the functionality of a piece of malware code for the protection of organization’s network. There exists numerous malware detection algorithms and techniques. A number of them have a great efficiency and accurateness with more complication than others. On the other hand, selecting one of them based on some criterion such as computational time, memory tradition, complexity of the algorithm, and its precision in different condition is a challenging problematic task. Based on that purpose of this research is to learn, to con- 6 Vol 9 (33) | September 2016 | www.indjst.org sider and to develop a String Back Propagation Neural Network Algorithm and compare that with the existing research outcomes in terms of accuracy in different scenarios. The survey of different techniques employed by authors is as follows: In16 described in detail on neural network based malware detection system. In17 described a tool which was used to extract an object model from the class files of a Java program. In18 proposed a qualitative juxtapose and assessing of the present state-of-the-art in clone detection approaches and tools, and standardize the large quantity of information into a coherent conceptual structural layout. Also they did the classification, comparison and evaluation of the techniques and tools in respect to two varied dimensions. Initially, they categorized and compared approaches based on a number of facets, each of which has a set of (possibly overlapping) parameters or features. On the other hand, they qualitatively evaluated the stratified techniques and tools with respect to a taxonomy of editing scenarios fabricated to model the creation of Type-1, Type-2, Type-3 and Type-4 clones. Code obfuscation technique was proposed in19 which prevents reverse engineering of software applications. Obfuscation prevents a key algorithm of application and data structures from unauthorized access. Some malicious users used same approach to insert malware into program or software. They proposed an analysis system that detect lexical and string obfuscation in Java malware. The distinction is done between obfuscated and nonobfuscated malware by identifying set of eleven features that characterizes obfuscated code, and hence use them to train a machine learning classifier. Static analyzer is used to extract features which examines bytecode. Chi-squared statistics is used for evaluating robustness of each feature. In20 proposed a technique in which they use code clone search in which they correlated previously analysed malware with new malware so that the similarities can be identified between them. Hence prevent reanalyzing code fragment which are found earlier. Also, they developed a tool named BinClone for the identification of code clone fragments. In21 designated the almost-perfect clones as “nearmiss” clones. Mainly proposed simple technique to find “near miss” clone. They used standard lexical comparison tools coupled with language-specific extractors to locate potential clones. The technique merges both compiler-based and lexical methods for finding clones. The Indian Journal of Science and Technology Simarleen Kaur and Arvinder Kaur extraction of potential clones is the unique stage in the system which is dependent on a particular language. Also, they described a novel clone-detection technique based on identification and reduction of potential clones using dynamic clustering and code normalization. In22 developed a unique algorithm Back Propagation Neural Network Algorithm which is one of the familiar approach in this world of information technology. In23 introduced new attack called “shadow attack” to evade current behavior-based malware detectors by partitioning one piece of malware into multiple “shadow processes”. The novel technique for android security was proposed in24 called SCSdroid (System Call SequenceDroid), that transforms the thread grained system call sequences which was initialized by computer systems and applications. Same family’s malicious common subsequence from system calls sequence of MRAs was extracted by SCSdroid. Hence common subsquence extracted from system employed to distinguish evaluated application and there is no need to require original application. In25 proposed a technique in which they discovered triggering relation on the request of network and they rely on knowledge containing relevant information of structure in order to investigate theft or to identify any malicious activity that are not responsible of any legitimate cause. Temporal and casual relationship between two events is defined by triggering relation. In order to infer triggering relation they designed and compared rule and learning based methods. They introduced a userintention based security policy for pointing any malicious malware activities based on a triggering relation graph. They used DARPA dataset and 7 GB real-world network traffic for evaluating solution. Results evaluated that analysis can successfully detect various malware activities which includes spyware, malware etc. In26 proposed a new technique which detects clones in source code in order to enhance security of software system. Here they developed mining algorithm which they considered program structure and define measures which rely on similarities which are incorporated in sequential structured text for retrieving similar fragment in source code. In27 proposed a technique for detection of browser extension types by employing the Hidden Markov Model (HMM). They trained multiple HMMs on a different compilers and malware generators. Then provided a score against this model for the malware samples and further Vol 9 (33) | September 2016 | www.indjst.org this models then segregated into clusters based on the awarded scores. Hence clustered results would able to categorize malware samples into their appropriate families with good accuracy. Hence no malware families in the test set were used to generate the HMMs, research proves that the implemented approach can effectively classify previously unknown malware in some scenarios. Finally, this clustering strategy could serve as a useful tool in malware analysis and classification. In28 presented a classification architecture namely Malware Evaluator that altered malware encyclopedias like Trend Micro, Symantec into an automated classifier that clustered species corresponding to taxonomic features and helped in detection and classification of zero-day attacks based on learning and generalization potential in comparison to other existing approaches. Their framework treated malware categorization as a supervised learning task and built learning models for taxonomic features with gradient boosting decision trees (GBDTs)as well as support vector machines (SVMs) and eventually visualized malware categorizations with selforganizing maps. They also deployed word stemming and stopword removal techniques for feature space reduction along with the tokenization process which generated attributes of malware strains. Malware Evaluator has revealed that Trojan, Infector, Backdoor and Worm remarkably contribute to the malware community and impose serious threats on the Internet ecosystem. Finally, it helped in defending against risks and recognize zero-day attacks in real-world scenarios. In29 focused on evaluation of the ability of static code analysis tools for diagnosis of security vulnerabilities. Static Analysis of source code is termed as a way of figuring out the software risks and errors. Also for discovering security vulnerabilities software analyzers were used. They implemented scenarios on the basis of Juliet benchmark test suite which permitted to automatically estimate execution of large quantity of test cases that covered entire range of C/C++ and JAVA susceptibilities, to find out tools’s performance both per CWE and across CWE’s all over, and to manage the enquiry and assessment of outcomes. The experimental approach was implemented for identification of static code analyzers to detect security vulnerabilities. A hardback approach30 was initiated for retrieving the information by employing Speech Recognition and Neural Network. They utilized Marcovian method for refurbishing the sampling weight produced from the Indian Journal of Science and Technology 7 Detection of Malware of Code Clone using String Pattern Back Propagation Neural Network Algorithm input speech. Then filtering technique known as Kalman was utilized for the feature vector extraction and the classification was done on the basis of rationalizing energy. In31 soft computing approaches were aggrandized for the evaluation of the performance of procedures and their optimization. They proposed a unique hybrid technique for controlling and monitoring the arm of malleable robots by employing fuzzy logical algorithms, neural network and other optimization algorithms related to swarms. In32 a unique technique for malware processing is introduced that employed Bayesian approach and an algorithm named as Nymble algorithm. The fabrication of secure DTN is accomplished by the use of Bayesian technique and the algorithm is used for the removal of malware. In33 proposed the development of a reverse engineering approach for the obstruction of the access to the resources and grants the permission to use of requirement base method for accessing various resources. In34 addressed the issue of mobile security as one of the major domain which is rapidly growing in terms of mobile technology. For that, and distinctly to address the malware threats, numerous malware analysis methods were developed as per investigation to identify, stratify and defend against malicious code and mobile threats. Hence, they proposed a feature-rich hybrid anti malware system, which was named as Andro-Dumpsys. The system was built to leverage volatile memory acquisition for Precise Malware Detection And Categorization. The system developed is based on technique called similarity matching. Andro-Dumpsys can distinguish malicious applications and classify malware applications into identical behavior groups based on unique behavior characteristics. The results demonstrated that the system built was very reliable and showed good performance while detecting malware. It was also examined that it performed well in classifying malware families with low false positives and false negatives. Furthermore, Andro-Dumpsys permits us to find zero-day malware. 2. Proposed Methodology In this research Java source code projects and applications are employed for segregating the project into different functions among varied classes in order to determine the code clone in the source files. The code clones are iden- 8 Vol 9 (33) | September 2016 | www.indjst.org Figure 6. Flow chart of methodology. tified by utilizing the String Pattern Back Propagation Neural Network Algorithm which is being developed as a target goal of the research. Then these code clones are used as a signature to determine the accuracy of the malware signatures which are maintained in the repository for detection of malware. 2.1 Methodology The identification of precision, recall and accuracy from the results obtained is used for comparison with the existing research. Steps followed for determining the precision, recall and accuracy of implemented algorithm to evaluate performance of the developed system are defined (Figure 6). The methodology was segregated into following steps: Step 1: Firstly, we will design user panel using Java swings programming where user have two options for uploading different type of project and adding malware Indian Journal of Science and Technology Simarleen Kaur and Arvinder Kaur patterns. One is Malware project patterns and second is Application project. a) Malware project is uploaded to store pattern of code in database b) Application project is uploaded to check vulnerability in the project after searching the code clones by applying implemented algorithm “String Pattern Back Propagation NN Algorithm” and matching with malware signatures. Step 2: This phase includes the implementation of string pattern back propagation neural network algorithm to find clone in the project. Step 3: In this phase we will pass every module of project to algorithm to find malware pattern inside code of project. Step 4: Finally results will be validated and compared with other other algorithms using factor computation. 3. Results and Discussion 2.2 Outline of Algorithm Precision is the fraction of retrieved documents that are relevant to the above query. Recall in information retrieval is the fraction of the documents that are relevant to the query that are successfully retrieved. The implementation of the following algorithm is done in this research in order to find clones from repository. Algorithm : String Pattern Back Propagation Neural Network Algorithm Dataset: Containing java source files project. First user will upload the project in system. 1. begin 2. set x={r1,r2,rn} where x is an array of files. 3. set i=0,D=length of array; 4. Foreach i<D 5. read the line from file and split the line and store in array list b 6. set k=0,length=b.length; 7. Foreach k <length value[k]==malware[k] 8. if condition get true then set flag=true and store in the line number in array t 9. if flag=false return to step number 2 10. if i>D return false. In first step of algorithm, project files are uploaded in the system. Secondly, array of files is created, then loop is created and line from individual file is read in fifth step. At the same time lines in the file is splitted and stored in array which is named as array list b. Then loop upto length of each line runs and search malware. If malware is detected instead of going to outer loop it will take inner loop and detect the malware in same file. Vol 9 (33) | September 2016 | www.indjst.org The objective of the research was to evaluate the proposed code detection algorithmic approach in terms of precision, recall and F-measure. Experiments were conducted using two different sets of java files. In our research the analysis is conducted on two projects. One is of 20 files and other consisting of 40 files which is referred as Window size. Each analyzed using the String Pattern Back propagation algorithm which is developed in the form of a system. After application of algorithm code clones were determined which are being matched to malware signatures present in the repository. Then we computed factors like Precision, Recall and F-measure by using below formulas: Precision= {relevant document}∩{retrieved document} {retrieved document} Recall= {relevant document}∩{retrieved document} {relevant document} F-measure is the combination of precision and recall is the harmonic mean of precision and recall, the conventional F-measure is calculated as follows: F-measure= 2* precision*recall precision + recall All these parameters are used to calculate the performance of the implemented system or proposed algorithm. Table 1 illustrates the parametric values that were evaluated by applying the proposed algorithm to the java source files for the detection of malware patterns. The results in table depicts the high accuracy of the implemented approach as compared to the existing work. This proves the better outcomes of the system being developed in this research. Hence, the algorithm is very much efficient in terms of malware detection of source code files. The data set of two projects associated with the research is put under clone detection approach and the results generated are depicted in Figure 7 which shows that precision value has increased in case of new developed algorithm and hence, it is efficient and reliable. The window size 20 depicts that the input source file is of size Indian Journal of Science and Technology 9 Detection of Malware of Code Clone using String Pattern Back Propagation Neural Network Algorithm 20. The value has enhanced from 81 to 97.25 when malware signatures are detected using String Pattern Back Propagation NN Algorithm. Table 1. Comparison of proposed algorithm with existing work of an approach scalclone Approach String Pattern Back Propagation Neural Network Algorithm ScalClone System Window Size 20 40 20 40 Precision 0.9725 0.9706 0.81 0.83 Recall 0.9670 0.9677 100 100 F-measure (%) 96.97 96.92 88 91 Figure 8. Recall. Figure 9. F-measure. Figure 7. Precision. The first data set is an assortment of 20 java files and computation of similar code fragments is easy as compared to project having 40 files. To obtain the precision of the proposed algorithm, the identification of the code clone is done and malware signatures are detected. Figure 8 illustrates the parametric values for recall for similar code clones. Also, the F-measure values of the proposed clone detection system are depicted in Figure 9. The f-measure are consistently above 95% for both type of data sets in comparison to existing research. The computational values for f-measure have risen from 88 to 96.97 in case of 20 window size and from 91 to 97.92 in second set of data source files which are 40 in number. Finally, the results of the proposed system are compared to the ScalClone, the existing research approach in order to determine the best results possible in terms 10 Vol 9 (33) | September 2016 | www.indjst.org of precision, recall and f-measure. This proves that the developed system is effective. The proposed work results are graphically depicted in Figure 10 in terms of precision, recall and accuracy after applying the implemented algorithm. Figure 10. Algorithmic results of proposed work. 4. Conclusion and Future Scope As malware pose a treacherous threat to computer security and hampers accustomed functioning of system by Indian Journal of Science and Technology Simarleen Kaur and Arvinder Kaur modifying the source code or appending some malicious information in it. Hence malware detection using the code clones is our major concern. Our main motive is to detect malware from source code file or from other applications which pose threat to computer security or distort computer functioning. Malware are increasing rapidly and unintentionally used by unauthorized users to destroy software or computer system. In this research paper we have developed algorithm named as String Pattern Back Propagation Neural Network Algorithm to determine the code clones from the maintained repository of malware signatures in the java projects. The proposed CLONE DETECTION ALGORITHM is used to detect malware and hence able to protect system from malicious attacks. This algorithm is simple and easy to understand. We have efficiently used this algorithm to detect malware and results are already shown in this paper. There is one limitation in the current work as accomplished work helps in identification of malware in case of one language only. So this can further be pursued for detection of malware in case of usage of multiple languages in a project or an application. 5. References 1. Ref- M. J. Rekoff. On reverse engineering. IEEE Trans. Systems, Man, and Cybernetics, pages 244–252, MarchApril 1985. 2. C.K. Roy and J.R. Cordy. A Survey on software clone detection research. Queens School of Computing TR, 541 : 115, 2007. 3. Bellon, Stefan, Rainer Koschke, Giuliano Antonial, Jens Krinke, and Ettore Merlo. “Comparison and evaluation of clone detection tools.” Software Engineering, IEEE Transactions on 33, no. 9 (2007): 577-591 4. Vinod, P., R. Jaipur, V. Laxmi, and M. Gaur. “Survey on malware detection methods.” In Proceedings of the 3rd Hackers’ Workshop on Computer and Internet Security (IITKHACK’09), pp. 74-79. 2009. 5. Royinghal, Priyank, and Nataasha Raul. “Malware detection module using machine learning algorithms to assist in centralized security in enterprise networks.” arXiv preprint arXiv:1205.3062 (2012). 6. Fred Cohen. Computer Viruses. PhD thesis, University of Southern California, 1985. 7. Williamson, David. “Deconstructing malware: what it is and how to stop it.”Information Security Technical Report 9, no. 2 (2004): 27-34. 8. Sridhara, Sudarshan Madenur, and Mark Stamp. “Metamorphic worm that carries its own morphing Vol 9 (33) | September 2016 | www.indjst.org engine.” Journal of Computer Virology and Hacking Techniques 9, no. 2 (2013): 49-58. 9. Ravula, Ravindar Reddy. “Classification of Malware using Reverse Engineering and Data Mining Techniques.” PhD diss., University of Akron, 2011. 10. Annachhatre, Chinmayee, Thomas H. Austin, and Mark Stamp. “Hidden Markov models for malware classification.” Journal of Computer Virology and Hacking Techniques 11, no. 2 (2015): 59-73. 11. Mathur, Kirti, and Saroj Hiranwal. “A survey on techniques in detection and analyzing malware executables.” International Journal of Advanced Research in Computer Science and Software Engineering 3, no. 4 (2013): 422-428. 12. Zhou, Yajin, and Xuxian Jiang. “Dissecting android malware: Characterization and evolution.” In Security and Privacy (SP), 2012 IEEE Symposium on, pp. 95-109. IEEE, 2012. 13. Landage, Jyoti, and M. P. Wankhade. “Malware and Malware Detection Techniques: A Survey.” In International Journal of Engineering Research and Technology, vol. 2, no. 12 (December-2013). ESRSA Publications, 2013. 14. Shahzad, Khurram, and Steve Woodhead. “A Pseudo-Worm Daemon (PWD) for empirical analysis of zero-day network worms and countermeasure testing.” In Computing, Communication and Networking Technologies (ICCCNT), 2014 International Conference on, pp. 1-6. IEEE, 2014. 15. Pandey, Sudhir Kumar, and B. M. Mehtre. “A Lifecycle Based Approach for Malware Analysis.” In Communication Systems and Network Technologies (CSNT), 2014 Fourth International Conference on, pp. 767-771. IEEE, 2014. 16. Saxe, Joshua, and Konstantin Berlin. “Deep neural network based malware detection using two dimensional binary program features.” In 2015 10th International Conference on Malicious and Unwanted Software (MALWARE), pp. 11-20. IEEE, 2015. 17. Jackson, D. and Waingold, A., 2001. Lightweight extraction of object models from bytecode. IEEE Transactions on Software Engineering, 27(2), pp.156-169. 18. Roy, Chanchal K., James R. Cordy, and Rainer Koschke. “Comparison and evaluation of code clone detection techniques and tools: A qualitative approach.” Science of Computer Programming 74, no. 7 (2009): 470-495. 19. Kumar, Renuka, and Anand Raj Essar Vaishakh. “Detection of Obfuscation in Java Malware.” Procedia Computer Science 78 (2016): 521-529. 20. Farhadi, Mohammad Reza, Benjamin Fung, Philippe Charland, and Mourad Debbabi. “BinClone: detecting code clones in malware.” In Software Security and Reliability (SERE), 2014 Eighth International Conference on, pp. 78-87. IEEE, 2014. Indian Journal of Science and Technology 11 Detection of Malware of Code Clone using String Pattern Back Propagation Neural Network Algorithm 21. Cordy, James R., Thomas R. Dean, and Nikita Synytskyy. “Practical language-independent detection of near-miss clones.” In Proceedings of the 2004 conference of the Centre for Advanced Studies on Collaborative research, pp. 1-12. IBM Press, 2004. 22. Cilimkovic, Mirza. “Neural networks and back propagation algorithm.”Institute of Technology Blanchardstown, Blanchardstown Road North Dublin15 (2015). 23. Ma, Weiqin, Pu Duan, Sanmin Liu, Guofei Gu, and Jyh-Charn Liu. “Shadow attacks: automatically evading system-call-behavior based malware detection.” Journal in Computer Virology 8, no. 1-2 (2012): 1-13. 24. Lin, Ying-Dar, Yuan-Cheng Lai, Chien-Hung Chen, and Hao-Chuan Tsai. “Identifying android malicious repackaged applications by thread-grained system call sequences.” computers & security 39 (2013): 340-350. 25. Zhang, Hao, Danfeng Daphne Yao, Naren Ramakrishnan, and Zhibin Zhang. “Causality reasoning about network events for detecting stealthy malware activities.” computers & security 58 (2016): 180-198. 26. Yoshihisa Udagawa. “A Novel Technique for Retrieving Source Code Duplication” ICONS 2014 : The Ninth International Conference on Systems, pp.172-177 27. Annachhatre, Chinmayee, Thomas H. Austin, and Mark Stamp. “Hidden Markov models for malware classification.” Journal of Computer Virology and Hacking Techniques 11, no. 2 (2015): 59-73. 28. Chen, Zhongqiang, Mema Roussopoulos, Zhanyan Liang, Yuan Zhang, Zhongrong Chen, and Alex Delis. “Malware characteristics and threats on the internet ecosystem.” Journal of Systems and Software 85, no. 7 (2012): 16501672. 29. Goseva-Popstojanova, Katerina, and Andrei Perhinschi. “On the capability of static code analysis to detect security 12 Vol 9 (33) | September 2016 | www.indjst.org vulnerabilities.” Information and Software Technology 68 (2015): 18-33. 30. Sajeer, K., and Paul Rodrigues. “Novel Approach of Implementing Speech Recognition using Neural Networks for Information Retrieval.” Indian Journal of Science and Technology 8, no. 33 (2015). 31. Khoobjo, E. “New Hybrid Approach to Control the Arm of Flexible Robots by using Neural Networks, Fuzzy Algorithms and Particle Swarm Optimization Algorithm.” Indian Journal of Science and Technology 8, no. 35 (2015). 32. Jeyaseelan, WR Salem, and S. Hariharan. “Malware Detection and Elimination using Bayesian Technique and Nymble Algorithm.” Indian Journal of Science and Technology 8, no. 34 (2015). 33. Ahmad, Dar Muneer, and Parvez Javed. “Security Comparison of Android and IOS and Implementationof User Approved Security (UAS) for Android.”Indian Journal of Science and Technology 9, no. 14 (2016). 34. Jang, Jae-wook, Hyunjae Kang, Jiyoung Woo, Aziz Mohaisen, and Huy Kang Kim. “Andro-dumpsys: antimalware system based on the similarity of malware creator and malware centric information.” Computers & Security(2016). 35. Sathish, Vidhya, and P. Sheik Abdul Khader. “Deployment of proposed botnet monitoring platform using online malware analysis for distributed environment.” Indian Journal of Science and Technology 7, no. 8 (2014): 1087. 36. Malik, Sapna, and Kiran Khatter. “System Call Analysis of Android Malware Families.” Indian Journal of Science and Technology 9, no. 21 (2016). Indian Journal of Science and Technology
© Copyright 2026 Paperzz