112 CHAPTER 5 URL ANALYSIS 5.1 INTRODUCTION The Web has become a platform for supporting a wide range of criminal enterprises such as spam-advertised commerce, financial fraud and as a vector for propagating malware. The precise commercial motivations behind these schemes may differ but the common thread among them is the requirement that unsuspecting users visit their sites. These visits can be driven by email, web search results or links from other Web pages, but all require the user to take some action, such as clicking, that specifies the desired Uniform Resource Locator (URL) and obtains sensitive information. In order to overcome this problem, the security community has responded by developing blacklisting services encapsulated in toolbars, appliances and search engines that provide an alert or warning precisely as feedback. Many malicious sites are not blacklisted either because they are too new, or never evaluated, or not evaluated incorrectly. In order to address this problem, some client-side systems analyze the content or behavior of a Web site as it is visited which causes runtime overhead due to browser based vulnerabilities. Phishing attacks are referred as Lure, Hook and Catch (Jacobsson and Myers 2007). E-Mails addressed to the victims seem to come from legitimate company email addresses but in fact they are spoofed. These email addresses are called the ‘Lure’. Usually, the emails contain URLs that refer to the actual phishing sites which are clones of legitimate websites and lure the 113 users into entering sensitive information. The actual phishing websites are the ‘Hook’ which obtains the private information from the user. In order to make sure that the innocent user doesn’t suspect the email to be fraudulent one, the text of the email should be legitimate. The attacker create various plausible conditions by a message such as account suspension, failed transaction or even upgrading of the user’s account to the newly installed security feature. Once the user clicks the link in the email, it is automatically taken to the fake phishing site. It is referred as ‘Catch’. The legitimacy of the website may or may not be displayed by the browser depending upon a number of heuristics used by the browser to detect phishing. In some cases, the user also overrides the browsers decision. 5.1.1 Need for URL Analysis The websense report 2012 states that the email spam has been increasing at regular paces and hackers adopt new techniques every day. The shift to blended threats using email as a lure and web links remains strong as 92% of email spam contains a URL. The websense report 2011, the spam statistics states that 12% of spam messages refer to shopping, 84% of all email messages are spam, 89.9% of unwanted email links to spam sites or malicious websites, 85% of malicious emails contain a web link and 9% of data-stealing attacks occur over email. Email spam correlated to phishing came at 1.62% while virus-related email spam was 0.4%. 5.2 EXISTING SOLUTIONS Blacklists are often used by email filters and browsers to block users from the malicious content (e.g., email messages and websites). PhishNet (Pawan et al 2010) enhances existing blacklists by discovering related malicious URLs. One major problem with blacklists is that they fail to identify phishing URLs in the early hours of a phishing attack because their update process is insufficiently fast. Phishing campaigns have an average life 114 of less than two hours (Sheng et al 2009) and by the time a phishing website is positively identified and blacklisted, it would have most probably has ended and a new one started. Detecting phishing websites by email is a helpful phishing countermeasure and researchers have attempted to detect phishing websites using features extracted from the URL. Illustrations of URL based features include but are not limited to the number of dots in the URL, length of the machine names, number of special characters, presence of hexadecimal characters or IP addresses instead of machine name, and length of the URL (Garera et al 2007, Justin Ma et al 2009). Garera et al (2007) extracted 18 host- and URL-based features from potential phishing URLs and classify the features using Logistic Regression. On a data set of 2,508 URLs, of which 1,245 were phish, the classifier provided a 95.8% true positive rate and 1.2% false-positive rate. Colin Whittaker et al (2010) discussed a scalable machine learning algorithm to automatically classify phishing pages by training the classifier on noisy dataset. The classifier is used to maintain Google’s phishing blacklist automatically by analyzing millions of pages a day, examining the URL and the contents of a page and maintains a false positive rate below 0.1%.Justin Ma et al (2009) discuss a method to detect malicious websites by analyzing features indicative of suspicious URLs. They used passive aggressive algorithm and explored online learning approaches for detecting malicious websites using lexical and host-based features of the associated URLs. The improvement can be obtained by analyzing the features of page content and page rank. Zhang et al (2007) proposed a content-based method using a simple linear classifier on top of eight features, achieving 89% TP and 1% FP on 100 phishing URLs and 100 legitimate URLs. CANTINA+ (2010) classifies phishing URLs and the feature set is more exhaustive and obtained a classification accuracy of 92.3%. There exist various related researches and case studies conducted on analyzing the feature set required to reduce the exhaustiveness and time consumption. The usability 115 study experiment to evaluate the accuracy and the precision of various phishing website features were previously collected and analyzed by Maher Abburrous et al (2010). The set of phishing attacks and tricks to measure their effectiveness and influence were collected from the APWG’s archive (2011) and Phishtank archive (2012). The purpose is to find the most common and essential phishing clues that appear in the scenarios, to determine what aspects of a website effectively convey authenticity and to identify which malicious strategies and attack techniques are successful at deceiving general users and why. Some of the observation on the high impact features for phishing attacks is listed in Table 5.1. Table 5.1 High impact features on URL phishing instances S.No. 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. Phishing Features Using the IP address Abnormal request URL Abnormal URL of anchor Abnormal DNS record Abnormal URL Using SSL certificate Certification authority Abnormal cookie Distinguished Names Certificate (DN) Redirect pages Straddling attack Pharming attack Using on MouseOver to hide the link Server Form Handler (SFH) Spelling errors Copying website Using forms with ‘‘Submit’’ button Using Pop-Ups windows Disabling right click Long URL address Replacing similar characters for URL Adding prefix or suffix Using the @ symbol to confuse Using hexadecimal character codes Much emphasis on security and response Buying time to access accounts No. of appearances 14 30 7 2 5 17 4 2 4 3 2 4 6 2 24 5 6 8 2 22 16 9 6 8 5 3 Appearance % 46.669 1009 23.339 06.66 16.66 56.66 13.33 06.66 13.33 10.00 06.66 13.33 20.00 06.66 80.009 16.66 20.009 26.669 06.66 73.339 53.339 30.009 20.009 26.669 16.66 10.00 116 The above selected features (9) display high impact in various studies as mentioned in the literature and hence the feature set comprises features whose impact is greater than 20%. This involved the host based features, lexical features, page rank and suspicious keywords in the mail for better performance. 5.3 URL ANALYZER Phishing URLs can be analyzed based on the lexical features and host based features of the URL and the structures are shown in Figure 5.1. The lexical feature analyses the format of the URL. An URL consists of two parts the hostname and the path. As an example, with the URL ‘www.annauniv.edu/emmrc/emmrc.html’, the hostname portion is www.annauniv.edu and the path portion is emmrc/emmrc.html Figure 5.1 Feature analyzer The proposed methodology analyses the hosts based features such as Pagerank and age of domain, various lexical based features such as URL encoding, presence of suspicious characters, hexadecimal character or malicious IP addresses to hide them and analyses the word probabilities to 117 find whether the email contains any suspicious links to avoid end users falling by phishing attacks as illustrated in Figure 5.2. This method is quite useful as illegitimate users spoof their identities and it may pass authentication tests and during content analysis also it may get escaped by avoiding spam keywords. Some emails may not contain any message in the body except some malicious links in it urging the users to click them leading to fraudulent websites. Figure 5.2 URL feature extraction 5.3.1 Lexical Features (F1) Lexical features are the textual properties of the URL itself (not the content of the page it refers). These properties include the length of the hostname, the length of the entire URL, as well as the number of dots in the URL, binary feature for each token in the hostname (delimited by ‘.’) and in the path URL (strings delimited by ‘/’, ‘?’, ‘.’, ‘=‘,’-’ and ‘_’). This is also known as a “bag-of-words.” 5.3.1.1 IP address Phishing URLs often contain IP addresses to hide the actual URL and domain of the website. For instance, a website URL may be extremely long and look suspicious such as “http://www.freewebhostingcompany.com/ 118 markswebsite/ todaysphishingpage.html” but the URL that contains the IP address is typically shorter and more standard such as “http://66.135.200.145”. URL detection methods looks for an IP address in the URL and add to a phishing score if one is found. However, the legitimate websites also sometimes use IP addresses especially for internal private devices that aren’t accessible to the public. Network devices such as routers, servers, and networked printers are often accessed using an IP address. 5.3.1.2 Hexadecimal characters The URL can be represented with a numeric value, each character on the keyboard that the computer understands. This numeric decimal value can easily be converted into hexadecimal base. Web browsers can understand hexadecimal values and they can be used in URLs by preceding the hexadecimal value with a ‘%’ symbol. For instance, the value %20 is the hexadecimal equivalent of the space character on the keyboard. 5.3.1.3 Suspicious character Spoofguard (Neil Chou et al 2004) identified two characters common in phishing URLs, the ‘@’ and ‘-’character. The username proceeds the ‘@’ symbol and the destination URL follows the ‘@’ symbol. A @ symbol in a URL causes the string to the left to be disregarded, with the string on the right treated as the actual URL for retrieving the page which is a phishing site. For example the URL http://www.bankofamerica.com @ phishingsite.com” will navigate to the destination URL which is “phishingsite.com” and will attempt to login using “www.bankofamerica.com” as the username. Hence, the actual URL of the website is disguised and when combined with an IP address it can really hide the phishing site while the URL appears to be legitimate. 119 5.3.1.4 Number of dots in URL This feature counts the number of dots in the URL. Phishing pages tend to use more dots in their URLs than the legitimate sites. All the lexical features are denoted as a single feature set F1. After examining the dataset, 1000 phishing mails and 1000 legitimate mails the occurrence of the lexical feature is as follows in the Table 5.2. Table 5.2 Number of occurrence of lexical features in training samples IP address More Dots Phishing mail 40 60 Encoded Symbol 10 Legitimate mail 10 10 0 5.3.2 Suspicious characters 10 0 Host Based Features Host based features can describe “where” malicious sites are hosted, “who” own them, and “how” they are managed. The following are properties of the hosts (there could be multiple) that are identified by the hostname as part of the URL. 5.3.2.1 Age of domain (F2) This feature checks the age of the webpage domain name. Many phishing sites are hosted on recently registered domains, and as such have a relatively young age. In order to exploit that property, this feature measures the number of months since the domain name is first registered. The WHOIS6 lookups on the WHOIS server is used to retrieve the domain registration date, and if the domain registration entry is not found on the WHOIS server, this 6 WHOIS - Internet service that finds information about a domain name or IP address. 120 feature will simply return-1, deeming it suspicious. The occurrence of the feature in the training sample is as in Table 5.3. Table 5.3 Number of occurrence of ‘Age of Domain’ feature in training samples Dataset 5.3.2.2 Age of the domain Phishing mail 750 Legitimate mail 350 Page rank (F3) Page rank represents the relative importance of a page within a set of web pages. The higher the page rank, the more important is the page. Phishing web pages are short lived and thus either have a very low page rank or their page rank does not exist. Page rank is a link analysis algorithm first used by Google, in which each document on the web is assigned a numerical weight from 0 to 10, with 0 indicating least popular and 10 meaning most popular. A score value of í1 is assigned when the page rank value for a particular webpage is not available. The occurrence of the page rank feature in the training sample is as in Table 5.4 and Figure 5.3. Table 5.4 Number of occurrence of ‘Page rank’ feature in training samples -1 Phishing mail Legitimate mail 0 1 2 3 4 5 6 7 8 9 710 70 60 40 50 20 20 0 0 20 10 20 40 40 150 320 190 140 120 40 10 20 MAILS 121 80 70 60 50 40 30 20 10 0 Phishing mail Legitimate mail -1 0 1 2 3 4 5 6 7 PAGE RANK 8 9 Figure 5.3 ‘Page rank’ feature in training samples Among the training samples, the percentage of emails matching the Lexical and Host based features are listed in Table 5.5. Table 5.5 Percentage of emails matching the lexical and host based features Feature Has IP Address Has “Hexadecimal” Character Has suspicious character ‘@’ symbol More No. of Dots Suspicious Age of Domain Page rank< 3 feature Non-phishing Matched 0% 0% 0% 0.01% 35% 1.2% Phishing Matched 0.04% 0.01% 0.01% 0.06% 75% 88% 5.3.3 Number of Sensitive Words in URL 5.3.3.1 Individual occurrences of suspicious phishing keywords (F4) Abu-Nimeh et al (2007) adopted the bag-of-words strategy and simply used a list of 43 most frequent words as features in a machine learning approach. Garera et al (2007) summarized a set of eight sensitive words such as secure, account, update, login, sign-in, banking, confirm and Verify that 122 frequently appear in phishing URLs. The system is trained with 1000 phishing emails to give weights to the suspicious words found in the phishing e-mails. The count of most occurring words as in Table 5.6 and Figure 5.4 in the phishing mail is analyzed and hence these words can be assumed as suspicious words by which the phishing mails can be identified. Table 5.6 Number of occurrences of suspicious phishing keywords Keywords Secure Account Update Login Signin Banking Confirm Verify Notify Click Inconvenient Password No. of Occurrences 570 750 240 150 60 220 320 330 130 340 250 580 No.Of OCccurences No. of Occurences of Suspicious Keywords 80 70 60 50 40 30 20 10 0 Suspicious Keywords Figure 5.4 Number of occurrences of suspicious phishing keywords 123 5.3.3.2 Co–occurrences of suspicious keywords(F5). The table 5.7 shows the count of prominent words in 1000 phishing mails, the correlation between the words and used their correlation as a score to classify the e-mails by counting their number of occurrences. Table 5.7 Number of co-occurrences of suspicious phishing keywords Secure Account Up Log Sign In Banking Confirm Verify Notify Click date in in Convenient Secure -- Account 550 -- Update 150 240 -- Log In 110 130 50 Sign In 50 50 30 20 -- Banking 180 210 50 20 40 -- Confirm 260 290 60 70 20 40 -- Verify 210 320 80 50 50 110 120 -- Notify 60 110 10 20 20 30 40 30 -- 160 260 70 70 10 70 110 120 50 -- 190 10 30 0 10 70 90 0 20 Click In 190 Convenient -- -- The email content is parsed at that instant the content is checked for the presence of any embedded forms and then the words in the email are checked if it contains any suspicious words. If any suspicious words are found, the score for each word is calculated and correlation score of those words are also calculated and are added up to the total score. 5.3.4 LOGIN FORM DETECTION (F6) Almost all phishing attacks try to trick people into sharing their information through a fake login form. A login form is characterized by FORM tags, INPUT tags, and LOGIN keywords such as password, PIN, etc. INPUT fields are usually used to hold user input. Usually, form tags, input 124 tags and login keywords appear in the DOM. Login keywords are searched in the text nodes as well as the alt and title attributes of element nodes of the sub tree rooted at the form node. Consider when form and input tags are found, but login keywords exist outside the sub tree rooted at the form node f. Examine whether the form f is a search form by searching for keyword “search”. If f is not a search form, traverse the DOM tree up for K levels +1 to ancestor node n, and search login keywords under the subtree rooted at n. 5.4 APPROACH 5.4.1 Training Set – Bayes Classifier Commonly used in spam filters, the bayes model assumes that for a given label, the individual features of URLs are distributed independently of the values of other features. Bayes theorem provides a way to calculate the probability of a hypothesis, for the event B, given the observed training data, represented as A: P(B|A) = (|)() (5.1) () This simple formula has enormous practical importance in many applications. It is often easier to calculate the probabilities,ܲ()ܤ|ܣ, P(A), P(B) for the probability ܲ( )ܣ|ܤthat is required. Extrapolating Bayes rule, assuming that malicious and legitimate Web sites occur with equal probability, compute the posterior probability that the feature vector x belongs to a malicious URL as P(B = 1|A) = P(B|A) = P(B|A) = (|ୀଵ) (|ୀଵ)ା(|ୀ) (5.2) (|) (5.3) (|)ሺ) (5.4) (|)ା(|ᇲ ) (|)()ା(|ᇲ)ሺᇱ) 125 where, P(A) = Probability of feature F in phishing and legitimate dataset. P(B ᇱ ) = Legitimatedataset . P(B) = Phishingdataset. P(B(Phishing)) = P(B’(Legitimate)) = 0.5 The classifier has a training dataset of malicious phishing URLs and legitimate URLs. The probability occurrence of each feature in the dataset are calculated and their respective scores are obtained (i.e) Count up occurrence of features in the dataset and calculate the cumulative score. If Cumulative score > Threshold, consider as phishing URL else legitimate URL as illustrated in Figure 5.5. Figure 5.5 Phishing URL classifications a) How many times does feature F(F1,F2,F3,F4,F5,F6) appear in phishing dataset? b) How many times does feature F(F1,F2,F3,F4,F5,F6) appear in legitimate dataset? Let F1 = Lexical features F2 = Age of the domain factor of URLs F3 = Occurrence of Pagerank < 3 in phishing and legitimate dataset 126 F4 = Individual Occurrence of suspicious keywords F5= Co–Occurrences of suspicious keywords F6 = Login Form detection 5.4.1.1 Calculating Probability In order to calculate the probability of a specific feature in the phishing dataset, consider 2000 URLs 1000 phishing URLs and 1000 legitimate URLs. Feature F1 (Lexical features) The feature F1 involves the occurrence of lexical features that appeared in 120 phishing URLs and 20 legitimate URLs. Hence its probability is calculated as follows. P(B|A) = P(B|A) = P(A|B) P(A|B) + P(A|B ᇱ) (ଵଶ|ଵ) (ଵଶ|ଵ)ା(ଶ|ଵ) = 0.86, since P(B(Phishing)) = P(B’(Legitimate)) = 0.5. Feature F2 (Age of Domain) The feature F2 (Age of domain) appeared in 750 phishing URLs and 350 legitimate URLs. Hence its probability is calculated as follows. P(B|A) = P(B|A) = P(A|B) P(A|B) + P(A|B ᇱ) (ହ|ଵ) (ହ|ଵ)ା(ଷହ|ଵ) = 0.68. 127 Feature F3 (Page rank) The feature F3 (Page rank) appeared in 880 phishing URLs and 120 legitimate URLs. Hence, its probability is calculated as follows. P(B|A) = P(B|A) = 5.5 P(A|B) P(A|B) + P(A|B ᇱ) (଼଼|ଵ) (଼଼|ଵ)ା(ଵଶ|ଵ) = 0.88. DATA SETS The datasets are obtained from two sources viz DMOZ Open Directory Project and Phishtank (2012). Phishtank is a blacklist of phishing URLs consisting of manually-verified user contributions. Phishtank focuses on phishing URLs advertised in email spam (phishing, pharmaceuticals, software, etc.). Both sources include URLs crafted to evade automated filters, while phishing URLs in particular visually tricks the users as well. Phishtank consists of phishing instances, a large community-based anti-phishing service with 35849 active accounts and 489397 verified phishes. 5.6 RESULTS 5.6.1 Test Cases An E-Mail server has been configured with hMail7 named as SSE Mail Server for the testing purposes,. The system is tested against phishing URLs present in the E-Mail and the feature found in each URL is noted. This is repeated for 1000 phishing URLs from which weights for each feature has been calculated. 7 hMailServer - Free e-mail server for Microsoft Windows used by Internet service providers and companies supporting the common e-mail protocols (IMAP, SMTP and POP3) and can easily be integrated with many existing web mail systems. 128 Table 5.8 Performance analysis with the existing systems Technique Number of features (n) n1(20) Cantina (Existing) (with n1 features) n2(27) Cantina+ (Existing)(with n2 features) URL Classifier m(14) (Proposed)(with m features) TPR (%) 89 FPR (%) Time Complexity 1 O(n1) 92.54 0.407 O(n2) (n1<n2) 92.8 0.4 O(m) (m<n2) The false positive rate corresponds to the proportion of legitimate emails classified as phishing emails, and false negative rate corresponds to the proportion of phishing emails classified as legitimate. The Table 5.8 shows that out of 1000 Phishing mails with malicious URLs, the above results were obtained for identifying various lexical and host based features. The following are the sample shots for malicious URLs embedded in emails such as encoded URLs and embedded forms as in Figure 5.6 and Figure 5.7 respectively. Figure 5.6 A snapshot showing encoded URL in E-Mail 129 Figure 5.7 A snapshot showing embedded forms in E-Mail 5.7 CONCLUSION Hackers bypass anti-spam filtering techniques by embedding malicious URL in the content of the messages. Hence the URL analyzer method with the help of minimized phishing feature set identifies the malicious URL in the emails.
© Copyright 2026 Paperzz