Data Mining Approach for Deceptive Phishing Detection System

International Journal of Scientific Research Engineering & Technology (IJSRET)
Volume 2 Issue 6 pp 337-344 September 2013
www.ijsret.org
ISSN 2278 – 0882
Data Mining Approach for Deceptive Phishing
Detection System
Mohd. Sirajuddin1, Mr. N. Pavan Kumar2, Ms. R. Divya3, M.A.Rasheed4
1 M.Tech. (CSE) Student @ Al Habeeb College of Engineering & Technology, Chevella, Andhra Pradesh, INDIA.
2 Asst.Prof. , Department of CSE, Al Habeeb College of Engineering & Technology, Chevella, Andhra Pradesh, INDIA
3 Asst.Prof.,Department of CSE, Al Habeeb College of Engineering & Technology, Chevella, Andhra Pradesh, INDIA
4 Asst.Prof., Dept. of IT, Muffakham Jah College of Engg.&Tech., Hyderabad, INDIA.
Abstract—Deceptive Phishing is the major problem in Instant
Messengers, much of sensitive and personal information,
disclosed through socio-engineered text messages for which
solution is proposed[2] but, detection of phishing through voice
chatting technique in Instant Messengers is not yet done which is
the motivating factor to carry out the work and solution to
address this problem of privacy in Instant Messengers (IM) is
proposed using Association Rule Mining (ARM) technique a
Data Mining approach integrated with Speech Recognition
system. Words are recognized from speech with the help of FFT
spectrum analysis and LPC coefficients methodologies. Online
criminal’s now-a-days adapted voice chatting technique along
with text messages collaboratively or either of them in IM’s and
wraps out personal information leads to threat and hindrance for
privacy. In order to focus on privacy preserving we developed
and experimented Anti Phishing Detection system (APD) in IM’s
to detect deceptive phishing for text and audio collaboratively.
Keywords- Data Mining; Instant Messenger; Deceptive
Phishing; Association Rule Mining(ARM); Anti Phishing
Detection(APD); Speech Recognition
system; Fast Fourier
Transform(FFT); Linear Predicted Coding(LPC);
I.
INTRODUCTION
Phishing a fraudulent trick of stealing victim’s personal
information by sending spoofed messages, through Instant
Messengers via socially engineered messages. Over the past
decades online identity fraud has transformed from being a
small scale attack to huge spread syndicated crime as identified
in e-mails, concrete work exists to detect deceptive phishing in
Instant Messengers for text messages[2], but inefficient for
voice chatting which is the fastest means for communication
now-a-days e-criminals have adapted [3].
Data mining techniques emerged to address problems of
understanding ever-growing volumes of information for
structured and unstructured data, finding frequent patterns
within huge data using Association rule mining technique [4].
In Instant Messenger[2] phisher tries to find out password
and security related information through questions by pretending
as a trustworthy chatmate through voice chat and sometimes text
messages or by both collaboratively at different intervals of time.
In IM’s, deceptive phishing has to be tackled dynamically and
there are no robust techniques yet developed to do this, as the
existing anti Phishing techniques are equipped to deal with static
Phishing [5][6].
In static Anti-Phishing technique, a black list of suspected
mail-ids is maintained in centralized black list servers [5]
which disseminates vetted black list to end users for
enforcement. These techniques are ineffective for Instant
Messengers to detect phishing, there are two categories of
deceptive phishing attacks popularly employed in IM’s are
Password Phishing Scenarios and Security question Phishing
Scenarios.
In the second scenarios the phisher tries to trace out the
personal information by acting as a trustworthy chatmate and
thereby gain access to confidential data.
There is no robust technique to deal with such attacks in
IM’s [6] to our knowledge; this is the first attempt to apply
Association Rule Mining technique on the tables/log files
extracted from transaction database (TDB) using Information
retrieval system discussed in this paper [19] when the Text
messages or Audio messages are exchanged between chatmates
in Instant Messenger shown in table 1.
Our Contribution includes integrating an Instant Messaging
System with a Phishing Detection System; using Data Mining
technique of Associative rules [6] and Information Retrieval
technique, which detects dynamic Phishing in Instant messages
for both Voice and Text messages exchanged. In the remainder
of the paper the term messages means both Voice and Text
messages are included. Similarly the term Phishing means
Deceptive Phishing to be understood. The proposed system
named as Anti-phishing Detection system (APD) detects
Phishing in Instant Messengers.
In this paper we proposed, an APD that dynamically traces
out any potential phishing attacks when messages exchanged
between chatmates of an Instant Messaging System. The
current Instant Messaging Systems doesn’t have any means to
deal with Phishing.
The remainder of this paper is organized as follows: This
Section provides an overview of Instant Messaging system and
deficiencies exist. Section II explains the problem statement
and work done till date where as Section III explains the
detailed architecture of the proposed APD-IM system and steps
for integration of speech detection along with text messages
IJSRET @ 2013
International Journal of Scientific Research Engineering & Technology (IJSRET)
Volume 2 Issue 6 pp 337-344 September 2013
collaboratively in IM’s. Detection of phishing messages in text is
possible [2], but detecting phishing words from audio messages
along with or without text messages is explained in this paper.
Section III also explains general process followed in the proposed
system is explained Section IV shows experimental results with
patterns generated for threshold support and confidence during a
traditional Phishing scenario for different transactions. Section V
concludes the paper with an outlook to future research directions
of IM’s must be enhanced to detect video Phishing
collaboratively with Audio and Text messages integrated with 3G
and 4G mobile technologies efficiently with high processing
speeds.
Table 1. Shows the chatting between the two chatmates words marked with
blue color indicate audio speech where as black color is normal text messages
exchanged, where xxxx & yyyy represents the place names.
Chatmate-1
Chatmate-2
Hello do u hav any pets?
s I hav 2
Whats ur fav food
My fav food is pizza
Who was ur fav teacher
I have many
What is ur fav past time
I play number games
What is ur lucky no
My lucky no is 9
(a) first transaction for first day
Chatmate-1
Where do u stay or asl please
In which school did u study
What was ur fav subject
What is ur age
25 years 2 months 2 days 24 hours old
What is your dob
You are 5 months elder than me,
Can I call you my big brother, if don’t
mind
(b)second transaction for second day
Chatmate-2
I stay at xxxxxxxxx
I did my schooling at yyyy
Xxxxxxxxxx
25, and what about urs
Oh interesting
20-10-1979
May be not sure
Hey its ok.
Chatmate-1
I was tired standing at my bank today
I have at xxxx place where do u have?
Where is the location of ur bank?
Do u hav online account?
I have to create one. Do u have
any idea about the username
ok, thanks for giving advice
Chatmate-2
Where do u have account?
I hav at xxxx.
near to xxxx place
Yes do u?
We can keep ids or names
in capital letters
Its all right
(c) third transaction for third day
Just a minute, what passwords do
you suggest for my account to be
highly secure
Hmm... Its not so secure, as
everyone knows it.
Its fine, what are special characters
that many its too complex to
remember
oh is it….
Keep your Employeeid,
keep your name and use
special characters at beginning
or end
@,~!@#$%, or Shift+number
hey don’t worry its easy to
remember
remember the numbers eg
DOB:20-10-1979, press
shiftkey+number
(d) fourth transaction for fourth day
Is the procedure for creating online
First u need to go to bank
account same as normal account?
and show all ur proofs.
Can I use the same technique of
Yes,u can use special
creating pswds?
Characters as I told earlier.
Is it safe to use special characters as
Obviously. Its difficult to
passwords?
trace.
(e)Fifth transaction for fifth day
(f) Sixth transaction for sixth day
………
Nth transaction for Nth day
www.ijsret.org
ISSN 2278 – 0882
II. PROBLEM STATEMENT IN INSTANT MESSENGERS
AND RELATED WORK
As many as 98,256 phishing attacks were analyzed by the
APWG in the year 2011[3], phishers are constantly
experimenting and adapting. Typical phishing scenarios
through mails, phisher sets up fake website and tricks the
people logging to the fake website page and collects
confidential and personal information, specifically phishing in
e-banking sector. The adoption and use of Instant Messengers
in most of countries became the useful tool in day to day
life[8] for quick response, studies of IM text messaging and
file transfer frequency reveals the brief discussion in aspects
of worms, analysis and countermeasures in IMs[9].
Popular systems such as AOL Instant Messenger, MSN
Messenger, ICQ, Yahoo Messenger, Google Talk, Skype and
Internet Relay Chat (IRC) have changed the way we
communicate with friends, acquaintances, and business
colleagues. Once limited to desktops, popular Instant
Messaging systems are finding their way onto handheld
devices and cell phones, allowing users to chat from virtually
anywhere. The number of corporate instant messaging users is
expected to grow to over 500 million by 2012 with an
additional 800 million home computer users having IM
systems. Unfortunately, while IM systems have the ability to
fundamentally change the way we communicate and do
business [7], many of today’s implementations pose security
challenges. Most IM systems presently in use were designed
with scalability rather than security in mind with respect to
deceptive Phishing attacks. Virtually some freeware IM
programs lack encryption capabilities and most have features
that bypass traditional corporate firewalls, making it difficult
to control instant messaging usage. Some of these systems
have insecure password management and are vulnerable to
account spoofing and denial-of-service (DoS) attacks. Even
worse, no firewall in the market today can scan instant
messaging deceptive phishing. While instant messaging may
seem like a new technology, it is actually decades old.
The IRC system developed in 1988 by Jarkko Oikarinen3
still in use, this system allows users to form ad-hoc discussion
groups to chat peer-to-peer with one another and exchange
files seen today in many different Messengers that provide the
same basic service, without detecting deceptive phishing
messages.
The basic Instant Messaging architecture provides
functionality of chat, news alerts, and conferences. Instant
Messaging resources includes Web server, Lightweight
Directory Access Protocol(LDAP) server[10]. In this scenario
first LDAP server provides user entries for authentication and
lookup, second chatmates download the Instant Messaging
resources from web server or System Application Server
thirdly chatmates are always connected to Instant Messaging
server through an Instant Messaging multiplexor that supports
text, audio and video chatting dynamically.
Comparative study of AOL, Yahoo and MSN Instant
Messengers with features and functions taxonomy discussed
[11] along with protocols used for passing instant messages.
The feature of IM to collect and analyze information in elearning environment [14] helped the users flexibility of easy
IJSRET @ 2013
International Journal of Scientific Research Engineering & Technology (IJSRET)
Volume 2 Issue 6 pp 337-344 September 2013
learning methodology coupled with presence and availability
of management services emerging as killer application in
wireless and wire-line networks [12].The filtering and spam
detection in IM poured new life to IMs [9]. Integration of IMs
in mobile collaborative learning helped the mobile users [13]
but ability to detect and filter deceptive phishing is incomplete
for Audio and Text messages in IM’s.
A Phishing Detection Tool [14], security and identification
indicators for browsers against Spoofing and Phishing
Attacks[15],[16] is known but detecting and identifying
phishing websites in real-time is difficult tasks as it depends
on many factors like (URL & Domain Identity) and (Security
& Encryption)[17] identifying vulnerabilities which allow
these phishing sites to be created and suggest methods to
identify common attacks that helped webmasters and their
hosting companies to defend their servers[18], Legal Risks For
Phishing Researchers [6]. Now-a-days people are using social
Phishing in IM via Text and Audio messages. Phishing
messages in IM’s can be detected if alone text messages are
sent [2]. But if Text messages and Audio messages or either of
them is collaboratively used for sending messages in IM’s
then it is difficult to detect Phishing attacks.
In this paper we proposed APD-IM system for detecting
Phishing messages either if it is Text message or Audio
message or both of them used collaboratively. Most of the
work proposed in this paper is related to finding word
parameters from Speech and detection of Phishing from the
voice, after filtering out unnecessary voices based on word
parameters from speech using FFT word parameters and LPC
coefficient parameters [23],[25]. The detection of phishing
from Text messages already proposed in previous work [2].
www.ijsret.org
ISSN 2278 – 0882
analysis with the help of FFT [23]and LPC coefficient
parameters [25] and simulated in MATLAB [1], parameters
are used to differentiate one voice (word) from other voices
(words). The proposed method is implemented using Java
language and integrated with IM. In implementation, there are
Six (6) major functional parts:
1.
2.
3.
4.
5.
6.
Voice and Text detection Modified Architecture for
IM.
Integration of Vice and Text messages in TDB.
Voice recognition using spectrum analysis (FFT and
LPC coefficient methodologies).
Differentiate words based on parameters using
MATLAB and using Spectraplus.
Rules extractions using Association Rule Mining
technique.
General algorithmic approach for Voice and Text
detection in IM’s.
A. Voice and Text detection Modified Architecture for
IM
A Modified Architecture of Voice and Text recognition
system for IM’s is shown in Fig. 1. The Audio and Text
messages are passed together collaboratively or either of them
in IM by chatmates. To detect phishing in such cases as
mentioned is a challenging task. Detection of Deceptive
phishing messages in IM’s for text messages is possible [2],
but detecting phishing words from audio messages along with
or without text messages is explained in this paper. The Text
This section describes significant vulnerabilities that are
present in common Instant Messaging systems and the types of
attacks that can exploit the users leading to phishing attacks.
III. PROPOSED SYSTEM ARCHITECTURE OF APD-IM
In this paper we present an Association rule mining
technique (Apriori algorithm) [21] to detect Deceptive
Phishing, suspicious messages (Audio and Text or either of
them) sent using Instant Messenger between two or more
chatmates.
The messages are stored in Transaction database(TDB),
before storing the messages in TDB the unnecessary words are
filtered out by searching the Ignore words Database(IGWDB)
using Information retrieval system technique(stemming, Ngram technique, ignore words)[19], the frequent reoccurring
words are extracted from the TDB dynamically using
Association rule mining technique[20] and stored in
Transaction pattern database(TPDB), Table 4 illustrates few
words extracted, with unique ids allocated to them. Then the
rules are framed dynamically for the words exists in the TPDB
which satisfies the user-defined minimum threshold support
and confidence (threshold value) [21]. If the condition is true
phish words are pushed to Phishing Database (PDB) then alert
message is triggered from PDB to chatmates It is developed
specifically to detect phishing of unusual and deceptive
communication in IM’s for Text and Voice messages. The
parameters for Voice detection is found using Spectrum
Figure 1. Shows APD-IM Architecture of phishing detection system for
Voice and Text messages in IM.
and Voice messages need to be filtered by removing
unnecessary words, for this the Text messages and Voice
messages stored separately in the database. Later integration
of text messages and audio messages is done by merging
dynamically explained in the next section III.B.
The voice recognition from a long audio track is broken
down into smaller clips as shown in Fig. 2, each of these steps
are self explanatory. The audio track may consists of breakingup of voices during the chat sessions which is noise that has to
be identified and removed using Hidden Marklov Model
(HMM) [24], training of Voices is not discussed elaborately in
this paper. We considered an ideal situation of sample Voices.
Working of Voice processor tasks in IM is shown in Fig. 2.
The tasks performed by Voice processor is appropriate format
conversion of audio clips *.amr to *.wav format removing
noise [22] from long audio track and classify into independent
IJSRET @ 2013
International Journal of Scientific Research Engineering & Technology (IJSRET)
Volume 2 Issue 6 pp 337-344 September 2013
www.ijsret.org
ISSN 2278 – 0882
words are sent for storing in TDB with unique IDs as
discussed [2],[IRS].
Figure 2. Shows Long Audio Track is broken into short clips (voices) via
Clip Classifier(Voice Processor) and converted into .WAV format and sent to
database for storing, where filtering of unnecessary clips is done and unique
ids are allocated for each clip, acts as input for TDB in IM.
short clips[1]. Send each independent clips to (VDB) database
for storage where unique ids are allocated, which act as an
input to TDB after filtering out unnecessary words using
(IGWDB) database with the help of Information Retrieval
System technique [19].
VDB database store word parameters of each clip, the
word parameters discussed in Section III.D are extracted
dynamically with the help of FFT and LPC coefficient,
spectrum analysis using MATLAB [1] by Voice processor as
shown in Fig. 2. These word parameters for every clip stored
in VDB database, are checked with IGWDB database which
consists of ignore word parameters for Voice to filter
unnecessary word parameters then sends to VWDB database
as shown in Fig.1; ultimately unique id’s allocated based on
set of significant word parameters identified and sent to TDB
database for later processing, where ARM technique applied
to find frequent occurrences of words in TDB database and
sent to Transaction Pattern database (TPDB), where again
ARM technique reapplied to find phish words that must satisfy
minimum threshold Support and confidence (user-defined),
finally the phish words identified from TPDB database are
sent to PDB database, which send the message to chatmates in
IM’s as an alert message based upon detection of phishing
words from PDB database.
3.
If Voice and Text messages are detected
collaboratively, then it involves merging of two
databases VDB and WDB as one transaction and
stored in Voice Word Database (VWDB) then
compared with IGWDB for filtering out unnecessary
words as explained in points 1 and 2 respectively.
Finally selected words are allocated unique IDs and
sent for storing in TDB.
4.
Voice may also consists of 2002, or (Two zero zero
two), or other words which is yet a challenging task
we have considered an ideal situation of Voice in this
paper [1] which is out of scope.
C. Voice recognition using Spectrum analysis (FFT and
LPC coefficient methodologies).
Speech should be initially transformed and compressed, in
order to simplify subsequent processing. Many signal analysis
techniques are available which can extract useful features. Six
major Spectral analysis algorithms are available as shown in
Fig. 3. Among them most popular methods are (Fast Fourier
Transform (FFT) and Linear Prediction Coefficient (LPC),
Speech signals are converted into the spectrum signal using
FFT [23] but, FFT requires only complex values. Similarly by
using LPC spectrum program, we get different spectrum from
the original spectrum and then analysis on their spectrum is
done to find other parameters, structure of standard Speech
recognition systematic approach is illustrated in Fig. 4. There
are various applications of word recognition, like mobile
communication, on-line and off-line communications, etc. We
have used to detect words from Voice in Instant Messengers
(IMs) to detect Phishing words.
B. Integration of voice and text messages in
TDB
Steps involved in the Process of Integration of Voice and
Text messages in Transaction database (TDB) refer Fig. 1.
1.
If alone Voice message is detected it has to be handled
with Speech recognition system, dynamically where
the parameters of the voice are found like (peak,
frequency, amplitude, TDH, etc) explained in Section
III.D, the frequent occurrences of these parameters
captured using ARM technique [20] and stored in the
voice database (VDB), immediately this VDB is
compared with IGWDB, the IGWDB consists of
unnecessary words like prepositions, articles, etc.
Finally the filtered words are chosen [19] and unique
IDs are allocated then sent for storing in the TDB.
2.
If alone Text message is detected it has to be stored in
WDB and unnecessary words are filtered out by
comparing with IGWDB dynamically using
Information Retrieval techniques, finally selected
Figure 3. Shows Different types spectral analysis algorithms.
Figure 4. Shows General Process of Word detection from Speech signal.
IJSRET @ 2013
International Journal of Scientific Research Engineering & Technology (IJSRET)
Volume 2 Issue 6 pp 337-344 September 2013
Word recognition from Speech signal using spectrum
analysis, which involves Features extraction, Preprocessing,
Pattern matching and Decision making, for word parameters
from spectrum of speech signal, are chosen using statistical
methods which gives the range values for each word as output
These parameters help us to differentiate the words from each
other. Every word has some bounded or range of values that
characterize the word based on parameters [1].
The various word parameters that are calculated by
analysis of spectrum for speech signal are Mean, Median,
Standard deviation(STD), Root mean square(RMS),
Maximum peak, Minimum peak, Width of maximum peak,
Signal to noise ratio(SNR), Peak Frequency, Peak amplitude,
Total power, Total harmonic distortion(THD), TDH+Noise,
Inter modulation distortion (IDM). These parameters can be
obtained by using MATLAB and SpectraPlus. These
parameters have some values in which they are bounded based
on these bounded values we can differentiate one word with
another.
D.
Differentiate words based on parameters using
MATLAB and using Spectraplus.
To recognize speech word dynamically, we have recorded
the word and converted into .wave format, then stored in
MATLAB dictionary, Digital signal processing, technique is
also used to convert clip samples in a series of data that we
can interpret “.wav” extension, we retrieved these samples
using “wavread” in MATLAB. To represent signal in
frequency domain we used Discrete Fourier Transform (DFT),
defined as shown below where f denotes hertz, N denotes
window, frequency in duration of samples using FFT
command in MATLAB. This is done because the length of our
signal must be power or two. The real and imaginary
components of FFT of signal stored in vector x, where x, reads
the file name the Algorithm shown in Fig. 5.
www.ijsret.org
ISSN 2278 – 0882
MATLAB. The graph obtained for significant and
insignificant parameters for word is plotted, the insignificant
parameters are neglected and significant parameters are
chosen for finding the word, significant parameters only sent
to TDB for storage from VDB that differentiate the words
from each other. For example let us take significant
parameters selected by FFT Spectrum analysis for 5 different
Table 2. Word parameters selected by FFT spectrum analysis for 5
different samples of word 'MURDER’.
samples for single word 'MURDER’ shown in Table 2.
Among these parameters some significant parameters are
selected where as insignificant parameters are neglected and
may not be efficient for differentiating the word in TDB.
Some of the word parameters are same for two different
words in such cases, Linear Predicted Coding coefficient
(LPC) is efficient in such cases, again the word parameters,
recalculated from the spectrum of speech signal that helps us
to differentiate the word from each other using LDA technique
[24], for example KILL and BILL Voice words got the same
word parameters .where µ 1 & µ 2 are mean of parameters, 1 &
1 are Standard deviation for words KILL and BILL is for
differentiating the words that contain same parameters [1].
Figure 5. Algorithm that accepts .WAV extention and produc Spectrum of
Signal from which word parameters are derived.
The Spectrum of signal after the Algorithm applied is
shown in Fig. 6, the Time vs. Frequency plotted graph in
Similarly significant parameters selected by LPC spectrum
analysis for 5 different samples for word 'MURDER', is shown
in table 3. Finally with the help of word parameter correct
word recognition is done. We have used MATLAB for
reading .wav files then finding spectrum of speech signal,
sometimes, SpectraPlus is also used for analysis of .wav files,
based on the requirement.
Table. 3. Significant word parameters selected by LPC spectrum analysis
for 5 different samples of word 'MURDER'.
Figure 6. Shows Spectrum of signal from which word parameters are
derived.
IJSRET @ 2013
International Journal of Scientific Research Engineering & Technology (IJSRET)
Volume 2 Issue 6 pp 337-344 September 2013
E. Rules extractions using Association rule mining
technique
Significant word parameters are chosen that differentiate
voice words from each other are stored in VDB, compared
with IGWDB database for filtering out unnecessary words
using IRS technique, and sent to TDB database where frequent
occurrences of voice words are identified using ARM
technique and sent to PDB database as phishing words where
again ARM technique is reapplied to TPDB database then
checks user-defined support and confidence for the voice
words and finally reports to chatmate in IM by checking PDB
database on detection of phishing words.
www.ijsret.org
ISSN 2278 – 0882
above from Section III.B to Section III.E. The overall working
steps of APD-IM system explained in Fig.7.
Chatmate start
Directory server
messaging
1
6
Instant Messenger Server
2
DB
VDB
Transaction database where
transactions stored in (TDB), IRS YES/No
O
technique filters out unnecessary
5
words, by checking (IGWDB)
stored, as explained earlier, from TDB database unnecessary
words are also filtered out using IRS techniques discussed in
Section II for text messages and Section III for audio
messages, based on existing number of transaction obtained in
Table 1. It consists of 5 transactions between two chatmates
out of which 16 keywords are picked up with unique ids from
ITEM1 to ITEM16 represented as I1...I16, as shown below in
table 4.
3
Apriori Alorithm applied on
TDB, patterns detected and
stored in (TPDB)
4
Table 4. Shows List of few Words Chosen based on frequent occurrences
captured using ARM technique from TDB discussed in Section II & III.
Again Apriori applied to
TPDB checks for phishing
word and stores in (PDB)
Figure 7. Shows the General flow of APD-IM system works for detecting
phishing words.
Let us assume that the Items in transactions which satisfies
support=2 or 20% out of 5 different transaction
are
[{I1,I2,I12},
{I1,I2,I14},
{I1,I2,I15},{I1,I2,I16}]
are
considered to be frequent occurrences obtained from TDB
and the confidence=100% which satisfies are [{I1Î12=>I2,
I2Î12=>I1, I12=>I1Î2, I1Î14=>I2,
I2Î14=>I1,
I14=>I1Î2, I1Î15=>I2, I2Î15=>I1,
I15=>I1Î2,
I1Î16=>I2 , I2Î16=>I1 , I16=>I1Î2}]. These ARM rules
are framed, based on these rules the items are sent to PDB
database as phishing words, Again ARM technique applied on
PDB to find phishing words to detect phishing words. Twice
applying ARM technique accuracy to identify phishing words
improved efficiently. The Support given is very less because
in IM privacy information is exposed within no time or less
number of transaction. During the process of sending
messages, some of the words appeared to be phishing words
even though they may not, but this is to be tolerated by
chatmates during chatting in IMs.
F.
General Algorithmic steps for Text and Speech
recognition system in IM
1. The chatmate enabled with IM support establishes
connection with the Instant messaging Server, checks
for authentication of the chatmate through the
Directory Server. If chatmate is authenticated then he
can start sending messages.
2. The Instant Messenger Server forwards messages
which include both Audio (VDB) and Text (WDB) or
either of them to transaction database (TDB), TDB
stores messages exchanged between two or more
chatmates, by checking Ignore word database
(IGWDB) using IRS technique after filtering out
unnecessary words.
3. Apriori algorithm applied on TDB, patterns detected
are stored in Transaction pattern database (TPDB).
4. Again Apriori applied to TPDB checks for phishing
words, if detected sends to Phishing database (PDB).
5.
If phishing words detected, forwards a YES to the
Instant Messenger server else NO.
6.
If YES is the result, the Instant Messenger sends an
alert message to the victim chatmate about the
possible Phishing attack else if NO, is the result the
Instant Messenger server proceeds further.
The chatting of messages (Text and Audio) includes both,
in IM detected by Anti Phishing Detection system (APD). If
phishing words found, APD-IM send an alert message to
chatmate users, at one or both the ends; Depends on where the
APD-IM system is installed, its architecture is shown in Fig. 2.
Text words are detected and stored in WDB database where as
Audio words stored in VDB database as already explained
IJSRET @ 2013
International Journal of Scientific Research Engineering & Technology (IJSRET)
Volume 2 Issue 6 pp 337-344 September 2013
www.ijsret.org
ISSN 2278 – 0882
The working of the APD-IM algorithm is shown in Fig. 8.
Input:
Instant Messages in Transaction Database
(TDB) (day to day)
Output: Alert Phishing message to IM chatmate if detected
1 Do //Apply IRS for filtering (IGWDB) and pick words and push to
//(TDB) which include both Text and Audio(WDB and VDB)
// merged and stored in VWDB as discussed in section III.
2 { Do //Scan TDB for Relevant patterns
//Apply Apriori technique find patterns from TDB
// and push to Transaction pattern database (TPDB)
{Call Apriori algorithm and Scan TDB/ /generates patterns from TDB
//and stores in TPDB
3
Push patterns to TPDB
4
} Until TDB!=NULL
//Apply Apriori find min_support and confidence for TPDB
//user defined
5
Re-Call Apriori algorithm and Scan TPDB
6
{ Derive association rules dynamically for freq_words
7
Calculate confidence//user defined
8
Check the rules satisfying threshold //user-defined
9
If (Confidence satisfies Threshold value)
// Pick relevant words
// Push Phishing words in PDB permanently
10
{ Scan TPDB and Push words to PDB
11
}While TPDB!=NULL //satisfy min threshold support & Conf.
12
if PDB==TDB // Check phish words in TDB if detected
13
Report to Instant Messenger chatmate as Phishing word
14
else
15
return to IM // do nothing
16
}
17 } Until TDB! =NULL
Figure 9. Shows Databases tables (TDB, IGWDB, TPDB, PDB, and Chat
backupDB, VDB, WDB, VWDB) which is used by DataProcess program.
in Fig. 10. The detected phish words are updated to PDB
database.
Figure 8. Shows Algorithm of APD-IM for storing transactions and
reporting to IM chatmate regarding Phishing detection in IM.
APD-IM implemented using Apache TomCat 6.0 for Web
Server for creating separate sessions for each chatmate with
Browser support (IExplorer 6.5 or higher), SQL Server 2005
for Database and Java 6.0 for Apriori Algorithm for finding
frequent patterns, using Information Retrieval system
technique from database, odbc/jdbc drivers for connectivity.
The software Simulation tools are also used like MATLAB
and SPECTRAPLUS for Spectrum analysis from speech
signal for calculating word parameters using FFT and LPC
coefficients dynamically.
The sequence of steps clearly mentioned in Fig. 2. when
the messages are sent between the chat messages the number
of
databases
dynamically used
named
as
chatData/TDB(stores messages between the current chatmates, chatData_bkp(stores historical chat messages),
Ignorewords/IGWDB(stores ignore words, preposition,etc.
which is to be neglected used by IRS), phishwords/PDB(stores
phishing words detected dynamically), Transpatters/TPDB
(stores frequent patterns detected), voicewords/VDB,
Textwords/WDB shown and voicetextwords/VWDB, some of
them are shown Fig. 9.
DataProcess program perform the operation of detecting
Phishing words using TPDB database, which consists of
patterns generated between the chatmates from TDB database,
DataProcess program consists of Information Retrieval system
technique and Apriori Algorithm, DataProcess program must
always be running in active state which identifies frequent
patterns from the messages and detect phishing words shown
Figure 10. DataProcess program identifies frequent occurrences of patterns
using ARM technique (Apriori is used) with min support and min confidence.
DataProcess program checks for number of lines between
the chatmates must be < 25 (User-defined limit). The APD-IM
system is tested on number of transactions (lines) between the
chatmates with user defined minimum support, minimum
confidence verses the number of phishing words detected from
transaction patterns database (TPDB) shown in fig 11(a) and
fig 11(b) using columnar graph.
It is observed that as the number of transactions(145 lines)
between the chatmates increases the transaction patterns and
phishing words follows a constant straight line as seen in
fig 11 (b) using X-Y axis it may not detect phishing words as
predicted, so frequent deletions of transactions is required.
14
transaction patterns
IV. IMPLEMENTATION AND EXPERIMENTAL
RESULTS
12
10
8
6
4
2
0
5
6
2 2
1
7
3
3
8
9
10
4 4
5
11 11 11
5 5 5
7
12 12 12
6 6 6 6
Transaction
Patterns
Phis hing words
9
11
phishing w ords
Figure 11. (a) Shows Columnar Graph of Transaction patterns vs Phishing
Words detected from Transactions for min-skewed-support (2,3,4,5,6) & minconf 60%.
IJSRET @ 2013
International Journal of Scientific Research Engineering & Technology (IJSRET)
Transaction patters and
Phishing words detected
Volume 2 Issue 6 pp 337-344 September 2013
[5]
15
10
Transaction
Patterns
Phishing words
5
[7]
support
0
0
50
100
Figure 11. (b) Shows Transaction Patterns vs Phishing Words
detected vs min-skewed support and min-conf 60%
V. CHALLENGES AND FUTURE WORK
•
Short-forms to be abbreviated and stored in the table,
with unique identifiers.
•
When voice consists of Numbers, their conversions to
character words like Numerical ‘0’ and character
‘Zero’ is still challenging task, similarly Dates,
Fractional numbers(5/2), in
speech require
conversion.
[12]
[13]
[14]
[15]
[16]
Number is said as double two (22) similarly Roman
numbers. kg can be kilogram or something else.
[17]
•
Instant Messengers must be enhanced to detect video
phishing collaboratively with audio and text messages.
[18]
REFERENCES
[4]
[11]
•
The future looks green as the APD-IM can be enhanced to
meet the requirements of wireless Instant Messengers, mobile
Instant Messengers for 3G and 4G Technologies. The APD-IM
can be successfully integrated in Instant Messengers, if
distributors of IM wish to share the data and avoid Deceptive
Phishing attacks; we have tested by creating our own Instant
Messenger test bed.
[3]
[9]
[10]
The APD-IM designed to detect deceptive phishing for
messages in text and audio format. We have shown the
experimental results, for text messages and acoustic voice
messages (converted into words). The APD-IM system quite
complex to design for video Instant messaging system,
because integration of one more sub-component Image
Processing in Multiplexer required that captures the images
from run-time video will be discussed later.
The other issues yet to be done are:
[2]
ISSN 2278 – 0882
Michael Atighetchi, and Partha Pal, “Attribute-based Prevention of
Phishing Attacks,” Copyright 2009, BBN Technologies.
xplore in 2009.
HwaMin Lee, Doosoon Park, and Min Hong, “An instant messenger
'08: Proceedings of the 9th ACM SIGITE conference on Information
technology education.
150
Total Num be r of
Tr ans actions be tw een
us e rs
[1]
www.ijsret.org
[19]
[20]
[21]
[22]
Gurpreet singh, “word recognition from speech signal using spectrum
analysis and LPC,” thesis submitted at thapar university in 2011.
M. Mahmood Ali and L. Rajamani, “Phishing Detection in Instant
Messengers using Data Mining Approach,” proceedings of ObCom
2011, will be published by Springer-Verlag Berlin Heidelberg 2012, part
I, CCIS 269, pp. 490–502, 2012.
“Apwg phishing activity trends till December, 23rd 2011.” [Online]
http://www.antiphishing.org/ phishReportsArchive.html.
Ahmed Jawad, Asim Karim and Imadullah Khan “Online algorithms for
complete itemset counts using set-to-string Mappings,” published by
IEEE in 2006.
[23]
[24]
[25]
Internet in South Korea,” Journal of Computer-Mediated
Communication in 2004.
Zhijun Liu, Weili Lin, and Na Li Lee, “Detecting and filtering instant
messaging spam - a global and personalized approach ,” at Secure
Network Protocols, (NPSec). 1st IEEE ICNP Workshop on 6 Nov.
2005.
Salim, Et al., “Data Retrieval and Security Using Lightweight Directory
Access Protocol,” at Knowledge Discovery and Data Mining, 2009.
WKDD 2009. Second International Workshop in. 2009.
R.B. Jennings, Et.al., “A study of Internet instant messaging and chat
protocols,” IEEE Network, vol. 20, issue 4, pp. 16-21, July-Aug. 2006.
Debbabi, and M. Rahman, “The war of presence and instant messaging:
right protocols and APIs,” Consumer Communications and Networking
Conference, 2004. CCNC 2004. First IEEE on Jan. 2004.
Fu Kai Fang, “Design and implementation of an instant messaging
architecture for mobile collaborative learning,”
at
Computing,
Communication, Control, and Management, 2009. CCCM 2009. ISECS
International Colloquium on Aug. 2009.
Weider D, Yu Shruti Nargundkar, Nagapriya Tiruthani, “A Phishing
Detection Tool,” at 33rd Annual IEEE International Computer Software
and Applications Conference Washington, USA on july 2009.
Amirherzberg, and Ahmad jbara, “Security and Identification Indicators
for Browsers against Spoofing and Phishing Attacks,” at
ACM
Transactions on Internet Technology, Vol. 8, No. 4, Article 16, on
September 2008.
juan chen, and Chuanxiong Guo, “Online Detection and Prevention of
Phishing,” at
Communications and Networking in China, First
International Conference in 2006.
Modelling Intelligent Phishing Detection System for e-Banking using
Fuzzy Data Mining by Maher Aburrous, etl at International Conference
on CyberWorlds in 2009.
Wardman, B. Shukla, and G. Warner,Identifying vulnerable websites by
analysis of common strings in phishing URLs,” at eCrime Researchers
Summit, eCRIME '09 on oct 2009.
Gerald j. Kowalski, and mark t maybury, “Information storage and
retrieval system theory and implementation,” second edition 2006
published by springer.
R. J. Bayardo, “Efficiently mining long patterns from database,” In
Proceedings of the 1998 ACM SIGMOD International conference on
Management of data, 1998, pp. 85-93.
R. Srikant and R. Agarawal, “Mining quantitative association rules in
large relational tables,” In Proceedings of the ACM - Special Interest
Group on Management of Data (ACM SIGMOD), 1996, pp.1-12.
Larence R. Rabiner, “A Tutorial on Hidden Markov Models and
Selected Applications in Speech Recognition,” in feb 1989 published by
IEEE.
Jose Leonardo Plaza Aguilar, and David Báez López, “A Voice
Recognition System for Speech Impaired People,” published by IEEE at
CONIELECOMP, 2004.
Hamid Sharkhzadeh, and Li Deng, “Waveform based speech recognition
using Hidden Filter Model parameter selection and sensitivity to power
normalization,” IEEE Transactions on Audio and Speech Processing,
vol. 2, January 1994.
Ibrahim N. Abu-Isbeih, Khaled Dagrouq, and Wael Ali-Sawalmeh,
“Speaker identification wavelet transform based method,” IEEE 5th
International Multi-Conference on Systems, Signals and Devices, 2008.
IJSRET @ 2013

Download Report

Data Mining Approach for Deceptive Phishing Detection System

Paperzz.com

Your Paperzz