10_chapter 5

112
CHAPTER 5
URL ANALYSIS
5.1
INTRODUCTION
The Web has become a platform for supporting a wide range of
criminal enterprises such as spam-advertised commerce, financial fraud and
as a vector for propagating malware. The precise commercial motivations
behind these schemes may differ but the common thread among them is the
requirement that unsuspecting users visit their sites. These visits can be driven
by email, web search results or links from other Web pages, but all require the
user to take some action, such as clicking, that specifies the desired Uniform
Resource Locator (URL) and obtains sensitive information. In order to
overcome this problem, the security community has responded by developing
blacklisting services encapsulated in toolbars, appliances and search engines
that provide an alert or warning precisely as feedback. Many malicious sites
are not blacklisted either because they are too new, or never evaluated, or not
evaluated incorrectly. In order to address this problem, some client-side
systems analyze the content or behavior of a Web site as it is visited which
causes runtime overhead due to browser based vulnerabilities.
Phishing attacks are referred as Lure, Hook and Catch (Jacobsson
and Myers 2007). E-Mails addressed to the victims seem to come from
legitimate company email addresses but in fact they are spoofed. These email
addresses are called the ‘Lure’. Usually, the emails contain URLs that refer to
the actual phishing sites which are clones of legitimate websites and lure the
113
users into entering sensitive information. The actual phishing websites are the
‘Hook’ which obtains the private information from the user. In order to make
sure that the innocent user doesn’t suspect the email to be fraudulent one, the
text of the email should be legitimate. The attacker create various plausible
conditions by a message such as account suspension, failed transaction or
even upgrading of the user’s account to the newly installed security feature.
Once the user clicks the link in the email, it is automatically taken to the fake
phishing site. It is referred as ‘Catch’. The legitimacy of the website may or
may not be displayed by the browser depending upon a number of heuristics
used by the browser to detect phishing. In some cases, the user also overrides
the browsers decision.
5.1.1
Need for URL Analysis
The websense report 2012 states that the email spam has been
increasing at regular paces and hackers adopt new techniques every day. The
shift to blended threats using email as a lure and web links remains strong as
92% of email spam contains a URL. The websense report 2011, the spam
statistics states that 12% of spam messages refer to shopping, 84% of all
email messages are spam, 89.9% of unwanted email links to spam sites or
malicious websites, 85% of malicious emails contain a web link and 9% of
data-stealing attacks occur over email. Email spam correlated to phishing
came at 1.62% while virus-related email spam was 0.4%.
5.2
EXISTING SOLUTIONS
Blacklists are often used by email filters and browsers to block
users from the malicious content (e.g., email messages and websites).
PhishNet (Pawan et al 2010) enhances existing blacklists by discovering
related malicious URLs. One major problem with blacklists is that they fail to
identify phishing URLs in the early hours of a phishing attack because their
update process is insufficiently fast. Phishing campaigns have an average life
114
of less than two hours (Sheng et al 2009) and by the time a phishing website
is positively identified and blacklisted, it would have most probably has ended
and a new one started. Detecting phishing websites by email is a helpful
phishing countermeasure and researchers have attempted to detect phishing
websites using features extracted from the URL. Illustrations of URL based
features include but are not limited to the number of dots in the URL, length
of the machine names, number of special characters, presence of hexadecimal
characters or IP addresses instead of machine name, and length of the URL
(Garera et al 2007, Justin Ma et al 2009). Garera et al (2007) extracted 18
host- and URL-based features from potential phishing URLs and classify the
features using Logistic Regression. On a data set of 2,508 URLs, of which
1,245 were phish, the classifier provided a 95.8% true positive rate and 1.2%
false-positive rate.
Colin Whittaker et al (2010) discussed a scalable machine learning
algorithm to automatically classify phishing pages by training the classifier on
noisy dataset. The classifier is used to maintain Google’s phishing blacklist
automatically by analyzing millions of pages a day, examining the URL and
the contents of a page and maintains a false positive rate below 0.1%.Justin
Ma et al (2009) discuss a method to detect malicious websites by analyzing
features indicative of suspicious URLs. They used passive aggressive
algorithm and explored online learning approaches for detecting malicious
websites using lexical and host-based features of the associated URLs. The
improvement can be obtained by analyzing the features of page content and
page rank. Zhang et al (2007) proposed a content-based method using a
simple linear classifier on top of eight features, achieving 89% TP and 1% FP
on 100 phishing URLs and 100 legitimate URLs.
CANTINA+ (2010) classifies phishing URLs and the feature set is
more exhaustive and obtained a classification accuracy of 92.3%. There exist
various related researches and case studies conducted on analyzing the feature
set required to reduce the exhaustiveness and time consumption. The usability
115
study experiment to evaluate the accuracy and the precision of various
phishing website features were previously collected and analyzed by Maher
Abburrous et al (2010). The set of phishing attacks and tricks to measure their
effectiveness and influence were collected from the APWG’s archive (2011)
and Phishtank archive (2012). The purpose is to find the most common and
essential phishing clues that appear in the scenarios, to determine what
aspects of a website effectively convey authenticity and to identify which
malicious strategies and attack techniques are successful at deceiving general
users and why. Some of the observation on the high impact features for
phishing attacks is listed in Table 5.1.
Table 5.1 High impact features on URL phishing instances
S.No.
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
21.
22.
23.
24.
25.
26.
Phishing Features
Using the IP address
Abnormal request URL
Abnormal URL of anchor
Abnormal DNS record
Abnormal URL
Using SSL certificate
Certification authority
Abnormal cookie
Distinguished Names Certificate (DN)
Redirect pages
Straddling attack
Pharming attack
Using on MouseOver to hide the link
Server Form Handler (SFH)
Spelling errors
Copying website
Using forms with ‘‘Submit’’ button
Using Pop-Ups windows
Disabling right click
Long URL address
Replacing similar characters for URL
Adding prefix or suffix
Using the @ symbol to confuse
Using hexadecimal character codes
Much emphasis on security and response
Buying time to access accounts
No. of
appearances
14
30
7
2
5
17
4
2
4
3
2
4
6
2
24
5
6
8
2
22
16
9
6
8
5
3
Appearance %
46.669
1009
23.339
06.66
16.66
56.66
13.33
06.66
13.33
10.00
06.66
13.33
20.00
06.66
80.009
16.66
20.009
26.669
06.66
73.339
53.339
30.009
20.009
26.669
16.66
10.00
116
The above selected features (9) display high impact in various
studies as mentioned in the literature and hence the feature set comprises
features whose impact is greater than 20%. This involved the host based
features, lexical features, page rank and suspicious keywords in the mail for
better performance.
5.3
URL ANALYZER
Phishing URLs can be analyzed based on the lexical features and
host based features of the URL and the structures are shown in Figure 5.1.
The lexical feature analyses the format of the URL. An URL consists of two
parts the hostname and the path. As an example, with the URL
‘www.annauniv.edu/emmrc/emmrc.html’,
the
hostname
portion
is
www.annauniv.edu and the path portion is emmrc/emmrc.html
Figure 5.1 Feature analyzer
The proposed methodology analyses the hosts based features such
as Pagerank and age of domain, various lexical based features such as URL
encoding, presence of suspicious characters, hexadecimal character or
malicious IP addresses to hide them and analyses the word probabilities to
117
find whether the email contains any suspicious links to avoid end users falling
by phishing attacks as illustrated in Figure 5.2. This method is quite useful as
illegitimate users spoof their identities and it may pass authentication tests
and during content analysis also it may get escaped by avoiding spam
keywords. Some emails may not contain any message in the body except
some malicious links in it urging the users to click them leading to fraudulent
websites.
Figure 5.2 URL feature extraction
5.3.1
Lexical Features (F1)
Lexical features are the textual properties of the URL itself (not the
content of the page it refers). These properties include the length of the
hostname, the length of the entire URL, as well as the number of dots in the
URL, binary feature for each token in the hostname (delimited by ‘.’) and in
the path URL (strings delimited by ‘/’, ‘?’, ‘.’, ‘=‘,’-’ and ‘_’). This is also
known as a “bag-of-words.”
5.3.1.1
IP address
Phishing URLs often contain IP addresses to hide the actual URL
and domain of the website. For instance, a website URL may be extremely
long and look suspicious such as “http://www.freewebhostingcompany.com/
118
markswebsite/ todaysphishingpage.html” but the URL that contains the IP
address
is
typically
shorter
and
more
standard
such
as
“http://66.135.200.145”. URL detection methods looks for an IP address in
the URL and add to a phishing score if one is found. However, the legitimate
websites also sometimes use IP addresses especially for internal private
devices that aren’t accessible to the public. Network devices such as routers,
servers, and networked printers are often accessed using an IP address.
5.3.1.2
Hexadecimal characters
The URL can be represented with a numeric value, each character
on the keyboard that the computer understands. This numeric decimal value
can easily be converted into hexadecimal base. Web browsers can understand
hexadecimal values and they can be used in URLs by preceding the
hexadecimal value with a ‘%’ symbol. For instance, the value %20 is the
hexadecimal equivalent of the space character on the keyboard.
5.3.1.3
Suspicious character
Spoofguard (Neil Chou et al 2004) identified two characters
common in phishing URLs, the ‘@’ and ‘-’character. The username proceeds
the ‘@’ symbol and the destination URL follows the ‘@’ symbol. A @
symbol in a URL causes the string to the left to be disregarded, with the string
on the right treated as the actual URL for retrieving the page which is a
phishing site. For example the URL http://www.bankofamerica.com @
phishingsite.com” will navigate to the destination URL which is
“phishingsite.com” and will attempt to login using “www.bankofamerica.com”
as the username. Hence, the actual URL of the website is disguised and when
combined with an IP address it can really hide the phishing site while the
URL appears to be legitimate.
119
5.3.1.4
Number of dots in URL
This feature counts the number of dots in the URL. Phishing pages
tend to use more dots in their URLs than the legitimate sites.
All the lexical features are denoted as a single feature set F1. After
examining the dataset, 1000 phishing mails and 1000 legitimate mails the
occurrence of the lexical feature is as follows in the Table 5.2.
Table 5.2 Number of occurrence of lexical features in training samples
IP address More Dots
Phishing mail
40
60
Encoded
Symbol
10
Legitimate mail
10
10
0
5.3.2
Suspicious
characters
10
0
Host Based Features
Host based features can describe “where” malicious sites are
hosted, “who” own them, and “how” they are managed. The following are
properties of the hosts (there could be multiple) that are identified by the
hostname as part of the URL.
5.3.2.1
Age of domain (F2)
This feature checks the age of the webpage domain name. Many
phishing sites are hosted on recently registered domains, and as such have a
relatively young age. In order to exploit that property, this feature measures
the number of months since the domain name is first registered. The WHOIS6
lookups on the WHOIS server is used to retrieve the domain registration date,
and if the domain registration entry is not found on the WHOIS server, this
6
WHOIS - Internet service that finds information about a domain name or IP
address.
120
feature will simply return-1, deeming it suspicious. The occurrence of the
feature in the training sample is as in Table 5.3.
Table 5.3 Number of occurrence of ‘Age of Domain’ feature in training
samples
Dataset
5.3.2.2
Age of the domain
Phishing mail
750
Legitimate mail
350
Page rank (F3)
Page rank represents the relative importance of a page within a set
of web pages. The higher the page rank, the more important is the page.
Phishing web pages are short lived and thus either have a very low page rank
or their page rank does not exist. Page rank is a link analysis algorithm first
used by Google, in which each document on the web is assigned a numerical
weight from 0 to 10, with 0 indicating least popular and 10 meaning most
popular. A score value of í1 is assigned when the page rank value for a
particular webpage is not available. The occurrence of the page rank feature in
the training sample is as in Table 5.4 and Figure 5.3.
Table 5.4
Number of occurrence of ‘Page rank’ feature in training
samples
-1
Phishing mail
Legitimate mail
0
1
2
3
4
5
6
7
8
9
710 70
60
40
50
20
20
0
0
20
10
20
40
40 150 320 190 140 120 40
10
20
MAILS
121
80
70
60
50
40
30
20
10
0
Phishing mail
Legitimate mail
-1
0
1
2
3
4
5
6
7
PAGE RANK
8
9
Figure 5.3 ‘Page rank’ feature in training samples
Among the training samples, the percentage of emails matching the
Lexical and Host based features are listed in Table 5.5.
Table 5.5 Percentage of emails matching the lexical and host based
features
Feature
Has IP Address
Has “Hexadecimal” Character
Has suspicious character ‘@’ symbol
More No. of Dots
Suspicious Age of Domain
Page rank< 3 feature
Non-phishing
Matched
0%
0%
0%
0.01%
35%
1.2%
Phishing
Matched
0.04%
0.01%
0.01%
0.06%
75%
88%
5.3.3
Number of Sensitive Words in URL
5.3.3.1
Individual occurrences of suspicious phishing keywords (F4)
Abu-Nimeh et al (2007) adopted the bag-of-words strategy and
simply used a list of 43 most frequent words as features in a machine learning
approach. Garera et al (2007) summarized a set of eight sensitive words such
as secure, account, update, login, sign-in, banking, confirm and Verify that
122
frequently appear in phishing URLs. The system is trained with 1000 phishing
emails to give weights to the suspicious words found in the phishing e-mails.
The count of most occurring words as in Table 5.6 and Figure 5.4 in the
phishing mail is analyzed and hence these words can be assumed as
suspicious words by which the phishing mails can be identified.
Table 5.6 Number of occurrences of suspicious phishing keywords
Keywords
Secure
Account
Update
Login
Signin
Banking
Confirm
Verify
Notify
Click
Inconvenient
Password
No. of Occurrences
570
750
240
150
60
220
320
330
130
340
250
580
No.Of OCccurences
No. of Occurences of Suspicious Keywords
80
70
60
50
40
30
20
10
0
Suspicious Keywords
Figure 5.4 Number of occurrences of suspicious phishing keywords
123
5.3.3.2
Co–occurrences of suspicious keywords(F5).
The table 5.7 shows the count of prominent words in 1000 phishing
mails, the correlation between the words and used their correlation as a score
to classify the e-mails by counting their number of occurrences.
Table 5.7 Number of co-occurrences of suspicious phishing keywords
Secure Account
Up Log Sign
In
Banking Confirm Verify Notify Click
date in
in
Convenient
Secure
--
Account
550
--
Update
150
240
--
Log In
110
130
50
Sign In
50
50
30
20
--
Banking
180
210
50
20
40
--
Confirm
260
290
60
70
20
40
--
Verify
210
320
80
50
50
110
120
--
Notify
60
110
10
20
20
30
40
30
--
160
260
70
70
10
70
110
120
50
--
190
10
30
0
10
70
90
0
20
Click
In
190
Convenient
--
--
The email content is parsed at that instant the content is checked for
the presence of any embedded forms and then the words in the email are
checked if it contains any suspicious words. If any suspicious words are
found, the score for each word is calculated and correlation score of those
words are also calculated and are added up to the total score.
5.3.4
LOGIN FORM DETECTION (F6)
Almost all phishing attacks try to trick people into sharing their
information through a fake login form. A login form is characterized by
FORM tags, INPUT tags, and LOGIN keywords such as password, PIN, etc.
INPUT fields are usually used to hold user input. Usually, form tags, input
124
tags and login keywords appear in the DOM. Login keywords are searched in
the text nodes as well as the alt and title attributes of element nodes of the sub
tree rooted at the form node. Consider when form and input tags are found,
but login keywords exist outside the sub tree rooted at the form node f.
Examine whether the form f is a search form by searching for keyword
“search”. If f is not a search form, traverse the DOM tree up for K levels +1 to
ancestor node n, and search login keywords under the subtree rooted at n.
5.4
APPROACH
5.4.1
Training Set – Bayes Classifier
Commonly used in spam filters, the bayes model assumes that for a
given label, the individual features of URLs are distributed independently of
the values of other features. Bayes theorem provides a way to calculate the
probability of a hypothesis, for the event B, given the observed training data,
represented as A:
P(B|A) =
୔(୅|୆)୔(୆)
(5.1)
୔(୅)
This simple formula has enormous practical importance in many
applications. It is often easier to calculate the probabilities,ܲ(‫)ܤ|ܣ‬, P(A),
P(B) for the probability ܲ(‫ )ܣ|ܤ‬that is required. Extrapolating Bayes rule,
assuming that malicious and legitimate Web sites occur with equal
probability, compute the posterior probability that the feature vector x belongs
to a malicious URL as
P(B = 1|A) =
P(B|A) =
P(B|A) =
୔(୅|୆ୀଵ)
୔(୆|୅ୀଵ)ା୔(୆|୅ୀ଴)
(5.2)
୔(୅|୆)
(5.3)
୔(୅|୆)୔ሺ୆)
(5.4)
୔(୅|୆)ା୔(୅|୆ᇲ )
୔(୅|୆)୔(୆)ା୔(୅|୆ᇲ)୔ሺ୆ᇱ)
125
where,
P(A) = Probability of feature F in phishing and legitimate dataset.
P(B ᇱ ) = Legitimatedataset .
P(B) = Phishingdataset.
P(B(Phishing)) = P(B’(Legitimate)) = 0.5
The classifier has a training dataset of malicious phishing URLs
and legitimate URLs. The probability occurrence of each feature in the dataset
are calculated and their respective scores are obtained (i.e) Count up
occurrence of features in the dataset and calculate the cumulative score. If
Cumulative score > Threshold, consider as phishing URL else legitimate URL
as illustrated in Figure 5.5.
Figure 5.5 Phishing URL classifications
a) How many times does feature F(F1,F2,F3,F4,F5,F6) appear in
phishing dataset?
b) How many times does feature F(F1,F2,F3,F4,F5,F6) appear in
legitimate dataset?
Let
F1 = Lexical features
F2 = Age of the domain factor of URLs
F3 = Occurrence of Pagerank < 3 in phishing and legitimate dataset
126
F4 = Individual Occurrence of suspicious keywords
F5= Co–Occurrences of suspicious keywords
F6 = Login Form detection
5.4.1.1
Calculating Probability
In order to calculate the probability of a specific feature in the
phishing dataset, consider 2000 URLs 1000 phishing URLs and 1000
legitimate URLs.
Feature F1 (Lexical features)
The feature F1 involves the occurrence of lexical features that
appeared in 120 phishing URLs and 20 legitimate URLs. Hence its
probability is calculated as follows.
P(B|A) =
P(B|A) =
P(A|B)
P(A|B) + P(A|B ᇱ)
(ଵଶ଴|ଵ଴଴଴)
(ଵଶ଴|ଵ଴଴଴)ା(ଶ଴|ଵ଴଴଴)
= 0.86,
since P(B(Phishing)) = P(B’(Legitimate)) = 0.5.
Feature F2 (Age of Domain)
The feature F2 (Age of domain) appeared in 750 phishing URLs and
350 legitimate URLs. Hence its probability is calculated as follows.
P(B|A) =
P(B|A) =
P(A|B)
P(A|B) + P(A|B ᇱ)
(଻ହ଴|ଵ଴଴଴)
(଻ହ଴|ଵ଴଴଴)ା(ଷହ|ଵ଴଴଴)
= 0.68.
127
Feature F3 (Page rank)
The feature F3 (Page rank) appeared in 880 phishing URLs and 120
legitimate URLs. Hence, its probability is calculated as follows.
P(B|A) =
P(B|A) =
5.5
P(A|B)
P(A|B) + P(A|B ᇱ)
(଼଼଴|ଵ଴଴଴)
(଼଼଴|ଵ଴଴଴)ା(ଵଶ଴|ଵ଴଴଴)
= 0.88.
DATA SETS
The datasets are obtained from two sources viz DMOZ Open
Directory Project and Phishtank (2012). Phishtank is a blacklist of phishing
URLs consisting of manually-verified user contributions. Phishtank focuses
on phishing URLs advertised in email spam (phishing, pharmaceuticals,
software, etc.). Both sources include URLs crafted to evade automated filters,
while phishing URLs in particular visually tricks the users as well. Phishtank
consists of phishing instances, a large community-based anti-phishing service
with 35849 active accounts and 489397 verified phishes.
5.6
RESULTS
5.6.1
Test Cases
An E-Mail server has been configured with hMail7 named as SSE
Mail Server for the testing purposes,. The system is tested against phishing
URLs present in the E-Mail and the feature found in each URL is noted. This
is repeated for 1000 phishing URLs from which weights for each feature has
been calculated.
7
hMailServer - Free e-mail server for Microsoft Windows used by Internet
service providers and companies supporting the common e-mail protocols
(IMAP, SMTP and POP3) and can easily be integrated with many existing
web mail systems.
128
Table 5.8 Performance analysis with the existing systems
Technique
Number of
features (n)
n1(20)
Cantina
(Existing) (with
n1 features)
n2(27)
Cantina+
(Existing)(with
n2 features)
URL Classifier m(14)
(Proposed)(with
m features)
TPR
(%)
89
FPR
(%)
Time
Complexity
1
O(n1)
92.54
0.407
O(n2) (n1<n2)
92.8
0.4
O(m) (m<n2)
The false positive rate corresponds to the proportion of legitimate
emails classified as phishing emails, and false negative rate corresponds to the
proportion of phishing emails classified as legitimate. The Table 5.8 shows
that out of 1000 Phishing mails with malicious URLs, the above results were
obtained for identifying various lexical and host based features. The following
are the sample shots for malicious URLs embedded in emails such as encoded
URLs and embedded forms as in Figure 5.6 and Figure 5.7 respectively.
Figure 5.6 A snapshot showing encoded URL in E-Mail
129
Figure 5.7 A snapshot showing embedded forms in E-Mail
5.7
CONCLUSION
Hackers bypass anti-spam filtering techniques by embedding
malicious URL in the content of the messages. Hence the URL analyzer
method with the help of minimized phishing feature set identifies the
malicious URL in the emails.