Bayesian spam filtering

JOHN HOSSLER
Bayesian Analysis −− Final Project
B AYESIAN S PAM F ILTERING
1 Background
∙ Definition: Bayesian spam filtering (BSF) is a statistical technique of e-mail filtering
using the naive Bayesian classifier to identify spam email
∙ BSF began in 1996 to sort e-mail into folders
∙ 1998: first publication regarding BSF
∙ spam: illegitimate e-mail
∙ ham: legitimate e-mail (bacn)
∙ IDEA: particular words have associated probabilities of occurring in spam or ham
− Example: "Viagra," etc. is in spam a lot, but seldom in ham
− the filter does not know these probabilities a priori, so we must build/train
− manually indicate an email as spam or ham
− for all words in the email, the filter adjusts the word probabilities
⊳ high spam probabilities for "Viagra", "refinance", etc.
⊳ high spam probabilities for names, personal things, etc.
− after training, use word probabilities, i.e. likelihood functions, to compute the
probability an email is spam
− can use all words in an e-mail or only "interesting" ones
− each word contributes to this probability (contribution is called posterior probability)
− if total prob. > threshold (95%?), filter marks as spam → move to junk folder
(some filters quaratine → user judges/reviews)
− initial training refined when false positive or false negatives are identified by user
− some filters use other pre-defined rules, often exchanging adaptability for filter
accuracy
2 Filter
∙ Mathematics: for any given word 𝑤 and message 𝑚,
− 𝑃 (𝑆∣𝑊 ) = probability 𝑚 is spam given 𝑤 ∈ 𝑚;
− 𝑃 (𝑆) = probability any given message is spam;
− 𝑃 (𝑊 ∣𝑆) = probability 𝑤 is in a spam message;
− 𝑃 (𝐻) = probability any given message is ham [1 − 𝑃 (𝑆)];
− 𝑃 (𝑊 ∣𝐻) = probability 𝑤 is in a ham message;
− then by Bayes’ Theorem,
𝑃 (𝑆∣𝑊 ) =
𝑃 (𝑊 ∣𝑆) 𝑃 (𝑆)
.
𝑃 (𝑊 ∣𝑆) 𝑃 (𝑆) + 𝑃 (𝑊 ∣𝐻) 𝑃 (𝐻)
⊳ the spamicity (spaminess) of a word 𝑤 is 𝑃 (𝑆∣𝑊 ) for 𝑤
− often filters let 𝑃 (𝑆) = 𝑃 (𝐻) = 0.5
⇒ 𝑃 (𝑆∣𝑊 ) =
𝑃 (𝑊 ∣𝑆)
𝑃 (𝑊 ∣𝑆) + 𝑃 (𝑊 ∣𝐻)
1
John Hossler
− recent studies show 𝑃 (𝑆) ≈ 0.8 (???)
− success depends on "large" number of learned messages (≈ 50% ∈ 𝑆, 50% ∈ 𝐻)
− this is "naive"’ since it assume words are independent (not so in English)
⊳ Example: 𝑃 (adj.) ∕⊥ 𝑃 (noun)
− however, with this assumption we have
𝑁
∏
𝑝=
𝑁
∏
𝑝𝑖
𝑖=1
𝑁
∏
𝑝𝑖 +
,
(1 − 𝑝𝑖 )
𝑖=1
𝑖=1
where
𝑝 = probability a suspected message is spam
𝑝𝑖 = 𝑃 (𝑆∣𝑊𝑖 ) = probability it is spam given it has
some word 𝑤𝑖 , 𝑖 = 1, . . . , 𝑁
− thus we have a naive (assumption) Bayes (formula) classifier (filter action)
∙ Problems:
− words not encountered during learning often omitted (since 𝑃 (𝑆∣𝑊 ) = 00 )
− words with low frequencies during learning are not as trustworthy
− some filters use strength (𝑠) component
⊳
⊳
⊳
⊳
⊳
⊳
(𝑆∣𝑊 )
𝑃 ′ (𝑆∣𝑊 ) = 𝑠𝑃 (𝑆)+𝑛𝑃
𝑠+𝑛
𝑃 ′ is the corrected probability
𝑠 = strength of background info
𝑛 = # occurrences of 𝑤 during learning
𝑃 (𝑆) often taken to be 0.5
can let 𝑛 = 0, as above, to get 𝑃 ′ (𝑆∣𝑊 ) = 𝑃 (𝑆)
∙ Heuristics:
− neutral words (e.g "the", "a", "some", etc.) often ignored
− spamicity ≈ 0.5 is bad
− spamicity near 0 or 1
⊳ some methods keep 𝑘 words with greatest ∣0.5 − 𝑝𝑖 ∣, 𝑖 = 1, . . . , 𝑘
− words appearing multiple times
− patterns (sequence of words) → context window
⊳ Example: ". . . but Viagra is good . . . "
3 R Example
∙ check out the computer for a very small scale simulation
∙ see http://spambayes.sourceforge.net/ to attach a BSF to your e-mail account
4 Thanks
∙ wikipedia.com
∙ many other websites
∙ John Bardsley
5 MERRY CHRISTMAS !!!!!!!
2
John Hossler
# FINAL EXAM R-CODE -- BAYESIAN ANALYSIS 514
# -----------------------------------------W = c("free","sex","girl","viagra","need","big","i","special","drunk","sexual","bonus","action","pleasure","it")
WgS = c(90,1,5,38,1,3,10,5,1,1,1,3,2,1)/118
WgH = c(100,3,10,abs(jitter(0)),310,61,410,82,abs(jitter(0)),1,57,14,abs(jitter(0)),813)/3270
# word spamicity
S = 0.5
H = 1-S
SgW = WgS/(WgS+WgH)
SgW
#0.13 is interesting cutoff
email1 = c("i want to eat")
email2 = c("get free stuff")
email3 = c("i need to go to the store")
email4 = c("i need to go to the store for viagra")
email5 = c("i have big news")
email6 = c("meet at the store at 5")
email7 = c("call now for a night of big pleasure")
email8 = c("it is my pleasure to inform you that you win the award")
email9 = c("i am pleased to to inform you that you win the award")
email10 = c("enter to win big")
email11 = c("enter for free to win big")
emailMat = rbind(email1,email2,email3,email4,email5,email6,email7,email8,email9,email10,email11)
alpha = 0.95
n = dim(emailMat)[1]
spamicity = rep(0,n)
spam = rep(0,n)
for (i in 1:n){
x = unlist(strsplit(emailMat[i,1]," "))
y = x[na.omit(match(W,x))]
SgW = WgS[W%in%y]/(WgS[W%in%y]+WgH[W%in%y])
spamicity[i] = round(prod(SgW)/(prod(SgW)+prod(1-SgW)),3)
}
spam[spamicity>=alpha]="SPAM"
spam[spamicity<alpha]="ham"
# mock e-mails, e-mail spamicity, spam indication
cbind(emailMat,spamicity,spam)
3