Discrete Probability Theory 3: Estimation Lecture Notes

CM2104: Computational Mathematics
Discrete Probability Theory 3: Estimation
Prof. David Marshall
School of Computer Science & Informatics
Estimators
Unbiased Estimation
Maximum likelihood
Bayesian inference
Language models
Estimators
So far, we have mainly be concerned with the question: given a probability
distribution with known characteristics, what can we say about the probability
of certain events?
What is the probability of getting 4 times heads in 5 trials: given that
X ∼ B(5, 0.5), what is P(X ≥ 4)
What is the probability of seeing at least 1 car at the intersection in a 5
minute window, if there are on average 2: given X ∼ Pois(5), what is
P(X ≥ 1)
In practice, often we only know to which class of distributions a random
variable belongs, but not the actual parameters of that distribution. These
parameters then need to be estimated from observations:
Suppose in 100 coin tosses, we see heads 57 times, what can we say
about the characteristics of the coin that was used, i.e. assuming
X ∼ B(100, p), what should be the value of p?
Suppose the number of cars at the intersection in a series of four 5
minute windows was 6, 4, 2, 9. If X ∼ Pois(λ), what should be the
value of λ?
2 / 36
Estimators
Unbiased Estimation
Maximum likelihood
Bayesian inference
Language models
Estimators
Consider a random variable X with an unknown distribution.
A sample is a vector a = (a1 , ..., an ) with ai ∈ SX , where the values ai are
interpreted as the outcomes of different realisations of the corresponding
experiment.
Assume that pX is known up to the value of some parameter θ ∈ R. To
emphasise the fact that pX depends on θ, we writepX (a) = p(a; θ) for all
a ∈ SX .
The unknown parameter θ is called the estimand or true value.
An estimator is a real-valued function θ̂ which maps samples to estimations
of θ, i.e.
θ̂ :
SX × ... × SX → R
(a1 , ..., an ) 7→ θ̂(a1 , ..., an )
Note that θ̂ is itself a random variable.
3 / 36
Estimators
Unbiased Estimation
Maximum likelihood
Bayesian inference
Language models
Example
Consider the problem of estimating the probability that a given coin shows
heads.
We write X for the outcome of the experiment of flipping the coin, with X = 1
for heads and X = 0 for tails, then
pX (0) = 1 − θ
pX (1) = θ
By observing a number of flips, we obtain the following sample:
(0, 0, 1, 1, 1, 0, 1, 1)
We can use the following estimator
θ̂ :
{0, 1} × ... × {0, 1} → R
(a1 , ..., an ) 7→ θ̂(a1 , ..., an ) =
a1 + ... + an
n
which gives us
θ̂ =
5
8
4 / 36
Estimators
Unbiased Estimation
Maximum likelihood
Bayesian inference
Language models
Properties
To ensure that an estimator yields reasonable results, a number of
properties are often demanded of estimators:
The bias of an estimator is the expected value of the error it makes
(considering that estimators are random variables):
Bias(θ̂) = E [θ̂ − θ] = E [θ̂] − θ
Ideally, an estimator should be unbiased, i.e. have zero bias, in which case
E [θ̂] = θ.
The variance of an estimator is defined as usual:
Var [θ̂] = E [(θ̂ − θ)2 ]
Ideally, an estimator should have a minimal variance
Let θ̂n be an estimator, where n refers to the sample size. Then θ̂ is called
consistent if for all ε > 0:
lim P(|θ̂n − θ| > ε) = 0
n→∞
5 / 36
Estimators
Unbiased Estimation
Maximum likelihood
Bayesian inference
Language models
Unbiased Estimation
We can use the above to evaluate estimators:
Most efficient/Best Estimator
Unbiased
Smallest variance
Example: If X1 , X2 , X3 is a random sample taken from a population
with mean µ and variance σ 2 which of the following estimators are
the best (i.e. unbiased with smallest variance)?
X1 + X2 + X3
T1 =
3
X1 + 2X2
T2 =
3
X1 + 2X2 + 3X3
T2 =
3
6 / 36
Estimators
Unbiased Estimation
Maximum likelihood
Bayesian inference
Language models
Bias in T1
Note E (Xi ) = µ for i = 1 . . . 3
So
E (T1 ) = E
X1 + X2 + X3
3
1
(E (X1 ) + E (X2 ) + E (X3 ))
3
1
=
(3µ)
3
= µ
=
Therefore, Bias(T1 ) = E (T1 ) − µ = 0. UNBIASED
7 / 36
Estimators
Unbiased Estimation
Maximum likelihood
Bayesian inference
Language models
Bias in T2 and T2
Similarly, we may show the Bias in T2 and T2 .
E (T2 ) = E
X1 + 2X2
3
= µ
Therefore, Bias(T2 ) = E (T2 ) − µ = 0. UNBIASED
E (T3 ) = E
X1 + 2X2 + 3X3
3
= 2µ
Therefore, Bias(T3 ) = E (T3 ) − µ = 2µ − µ = µ 6= 0. BIASED
8 / 36
Estimators
Unbiased Estimation
Maximum likelihood
Bayesian inference
Language models
Variance in T1 and T2
Note Var (Xi ) = σ 2 for i = 1 . . . 3
Var (T1 ) = Var
=
=
X1 + X2 + X3
3
1
(Var (X1 ) + Var (X2 ) + Var (X3 ))
9
3 2
1 2
(σ ) = (σ )
9
3
Similarly.
Var (T2 ) =
5 2
(σ )
9
So Var (T1 ) has lowest variance and is unbiased so is the best
estimator.
9 / 36
Estimators
Unbiased Estimation
Maximum likelihood
Bayesian inference
Language models
Maximum likelihood estimation
The likelihood of parameter θ, given the sample a = (a1 , ..., an ) is defined as:
L(θ|a) =
n
Y
pX (ai |θ)
i=1
In other words, the parameter θ is considered likely to the extent that the
corresponding probability distribution p(.|θ) makes the sample a probable
outcome.
The maximum likelihood estimator chooses the value for θ that maximises
the likelihood:
θ̂MLE (a) = arg max L(θ|a)
θ
10 / 36
Estimators
Unbiased Estimation
Maximum likelihood
Bayesian inference
Language models
Maximum likelihood estimation
Example: Consider again the problem of finding the probability θ that a coin
shows heads, and assume that in a sequence a of n trials, we have observed k
times heads.
The likelihood is given by (θ ∈ [0, 1])
!
L(θ|a) =
n
k
n k
θ · (1 − θ)n−k
k
Since
does not depend on θ and since ln (Natural Logarithm) is a
monotonic operator, it suffices to find the value of λ which maximises:
ln(θk · (1 − θ)n−k ) = k · ln(θ) + (n − k) · ln(1 − θ)
We have
k
n−k
d
(k · ln(θ) + (n − k) · ln(1 − θ)) = −
dθ
θ
1−θ
Hence θ is maximal when:
k
n−k
k
−
= 0 iff k · (1 − θ) = (n − k) · θ iff θ =
θ
1−θ
n
11 / 36
Estimators
Unbiased Estimation
Maximum likelihood
Bayesian inference
Language models
MATLAB Example: Binomial Distribution Estimators
binoestimator.m
%
%
%
%
Suppose in 100 coin tosses, we see heads 57 times,
what can we say about the characteristics of the
coin that was used, i.e. assuming
X ~ B(100, p), what should be the value of p?
[maxlikli p_c_lims] = binofit(57,100);
%
%
p_c_lims contains a "confidence"interval in
which the true probability lies
%%% ANOTHER EXAMPLE
%%% Generate a sample from B(100,0.6)
r = binornd(100,0.6);
% Fit samples to distribution.
[maxlikli p_c_lims] = binofit(r,100)
% Note 0.6 is with the limits of p_c_lims
binofit(X,N) — Returns estimates of the probability of success for the binomial distribution where X is
number of successes and N is the number of trials: see doc/help binofit.
binornd(N,P) — generates random numbers from the binomial distribution with parameters specified by
the number of trials, N, and probability of success for each trial, P:see doc/help binornd
See also: doc/help binocdf, binoinv, binopdf, binostat
12 / 36
Estimators
Unbiased Estimation
Maximum likelihood
Bayesian inference
Language models
MATLAB Example: Poisson Distribution Estimators
poissonestimator.m
% Suppose the number of cars at the intersection in a series of four 5 minute
% windows was 6, 4, 2, 9. If X ~ Pois(?), what should be the value of ?
[max_likeli lambda_c_lim] = poissfit([6,4,2,9])
poissfit(X) — Returns the estimate of the parameter of the Poisson
distribution give the data X: see doc/help poissfit
Also poissfit(X,ALPHA) variant.
See also: doc/help poisscdf, poissinv, poisspdf, poissrnd, poisstat
13 / 36
Estimators
Unbiased Estimation
Maximum likelihood
Bayesian inference
Language models
Bayesian inference
Using Bayes’ rule, we find that:
P(θ = θ0 |X = a) =
P(X = a|θ = θ0 ) · P(θ = θ0 )
P(X = a)
We can estimate θ0 as the value that maximises the right-hand side of the
above formula.
The probability P(θ = θ0 ) then encodes our prior belief about which values of
θ are plausible, whileP(X = a|θ = θ0 ) encodes the likelihood of θ0 given the
sample a.
If the sample size n is small, the value of P(θ = θ0 |X = a) will mostly
be influenced by our prior beliefs P(θ = θ0 ).
The larger the sample size, the more P(θ = θ0 |X = a) will be influenced
by P(X = a|θ = θ0 ) = L(θ0 |a).
We callP(θ = θ0 ) the prior distribution and P(θ = θ0 |X = a) the posterior
distribution.
14 / 36
Estimators
Unbiased Estimation
Maximum likelihood
Bayesian inference
Language models
Bayesian inference
Example
Suppose we have two biscuit tins:
The first tin contains 10 chocolate and 30 plain biscuits,
The second contains 20 of each.
We choose one of the tins at random, then choose a biscuit at random from
that tin.
If a plain biscuit is observed, estimate which tin was chosen.
Let θ = 1 if the first tin was chosen, and θ = 2 if the second tin was chosen.
Before observing the biscuit, it is reasonable to suppose that each tin was
equally likely to have been selected, so we adopt the uniform prior distribution
for θ:
1
P(θ = 1) =
2
1
P(θ = 2) =
2
15 / 36
Estimators
Unbiased Estimation
Maximum likelihood
Bayesian inference
Language models
Bayesian inference
Now, let B be the event that a plain biscuit was observed.
The probability of this event (relative to the prior distribution) is:
P(B|θ = 1) · P(θ = 1) + P(B|θ = 2) · P(θ = 2)
3 1
1 1
= ( · )+( · )
4 2
2 2
5
=
8
The fact that a plain biscuit was observed allows us to update our belief
regarding the distribution of θ, by computing the posterior: distribution:
P(B)
=
P(θ = 1) · P(B|θ = 1)
=
P(θ = 1|B) =
P(B)
3
4
P(θ = 2) · P(B|θ = 2)
P(θ = 2|B) =
=
P(B)
1
2
·
1
2
5
8
·
5
8
1
2
=
3
5
=
2
5
We can then estimate the number of the tin as
θ̂ = arg max P(θ = i|B) = 1
i
16 / 36
Estimators
Unbiased Estimation
Maximum likelihood
Bayesian inference
Language models
Language models
We can think of text documents as the outcomes of a (random) experiment.
Let the random variable X = (W1 , ..., Wn ) represent a text document where
Wi corresponds to the i th word in the document.
The sample space corresponding to each Wi is a vocabulary of all terms that
are allowed to be used in the document (i.e. the set of all words that are used
in a given language)
The probability distribution of X is called a language model.
By estimating the language model underlying a document, or collection of
documents, we can make certain inferences about these documents explicit and
use this in many applications.
17 / 36
Estimators
Unbiased Estimation
Maximum likelihood
Bayesian inference
Language models
Language models: Applications
Language models are used in many areas of computer science:
Speech recognition systems use language models to improve their
accuracy
“It’s fun to recognise speech?”
“It’s fun to wreck a nice beach?”
Both sentences sound similar, so speech recognition systems need a
language model to find out that the first transcription is more likely
In information retrieval, a language model is estimated for each
document, and the probability of the query, given the language model of
a document is used to assess whether or not the document is relevant to
the query
To implement a spam filter, two language models are trained: one
corresponding to spam messages, and one corresponding to normal
messages. The probability that an email was generated using these
models is then used to assess whether it is likely to be spam.
18 / 36
Estimators
Unbiased Estimation
Maximum likelihood
Bayesian inference
Language models
Language models: Simplifying Assumptions
Different language models are based on different simplifying assumptions,
which allow us to estimate the required parameters from collections of text
documents
Simple language models ignore the word order, and treat the variables
W1 , ..., Wn as independent variables. In such a case, the language model
corresponds to a multinomial distribution.
Other language models make it more/less likely that a given word wi
succeeds a sequence of words wi−k wi−k+1 ...wi−1 .
Even more advanced language models take grammatical structure into
account
More advanced language model require more documents to obtain reliable
estimates (which may not be available) and they are more computationally
demanding.
In applications such as information retrieval and text classification, it is
therefore common to use multinomial distributions as language models
19 / 36
Estimators
Unbiased Estimation
Maximum likelihood
Bayesian inference
Language models
Bag of Words Model
Simple language models (i.e. ignoring word order) are called Bag of Words
models:
20 / 36
Estimators
Unbiased Estimation
Maximum likelihood
Bayesian inference
Language models
Bag of Words Model
Basically counts occurrence of words in phrase or sentence or paragraph
etc.
Text broken into individual elements — Tokenisation
Common words (e.g. so, and, or, the), so called Stop
Words, removed as these do not add meaningful statistics.
Punctuation removed.
Often words are converted into common forms — stemming
or lemmatisation and synonymisation.
Some form of classifier used to compare.
More on this later — also see lab class exercises
21 / 36
Estimators
Unbiased Estimation
Maximum likelihood
Bayesian inference
Language models
Spam filter
We will consider a spam filter, so we are interested in finding the probability
that an email message is spam.
To do this we assume that we have a collection of spam messages s1 , ..., sn
and non-spam messages d1 , ..., dm from which we want to estimate two
multinomial language models Θspam and Θnormal .
As Θspam and Θnormal represent multinomial distributions, they are completely
specified by encoding a probabilityp(w ; Θ) for each word w in the vocabulary
Using a maximum likelihood estimation, we can find:
number of times word w occurs in the documents s1 , ..., sn
total number of word occurrences in s1 , ..., sn
number of times word w occurs in the documents d1 , ..., dm
p(w ; Θnormal ) =
total number of word occurrences in d1 , ..., dm
p(w ; Θspam ) =
22 / 36
Estimators
Unbiased Estimation
Maximum likelihood
Bayesian inference
Language models
Spam filter
Now consider a previously unseen email e for which we want to decide whether
or not it is spam.
Suppose that this email contains fi occurrences of word wi (1 ≤ i ≤ k) and f
word occurrences in total.
Let the random variable X take the values spam or normal, and let E be the
event of seeing message e. Then we find using Bayes’ rule:
P(E |X = spam) · P(X = spam)
P(X = spam|E ) =
P(E )
P(E |X = normal) · P(X = normal)
P(X = normal|E ) =
P(E )
Because we are only interested in the value of P(X = spam|E ) relative to the
value of P(X = normal|E ), we can ignore constant factors, i.e. we just
consider:
P(X = spam|E ) ∝ P(E |X = spam) · P(X = spam)
P(X = normal|E ) ∝ P(E |X = normal) · P(X = normal)
23 / 36
Estimators
Unbiased Estimation
Maximum likelihood
Bayesian inference
Language models
Spam filter
From the definition of the multinomial distribution, we find:
k
Y
f!
p(E |X = spam) =
·
p(wi ; Θspam )fi
f1 ! · ... · fk ! i=1
k
Y
f!
·
p(wi ; Θnormal )fi
p(E |X = normal) =
f1 ! · ... · fk ! i=1
and because we can ignore constant factors:
p(E |X = spam) =
k
Y
p(wi ; Θspam )fi
i=1
p(E |X = normal) =
k
Y
p(wi ; Θnormal )fi
i=1
24 / 36
Estimators
Unbiased Estimation
Maximum likelihood
Bayesian inference
Language models
Spam filter
The prior probabilities P(X = spam) and P(X = normal) can either be chosen
to be uniform:
1
P(X = spam) = P(X = normal) =
2
or using maximum likelihood, in which case they are estimated as:
n
P(X = spam) =
n+m
m
P(X = normal) =
n+m
where n and m are the number of spam messages and normal messages,
respectively, in our collection
25 / 36
Estimators
Unbiased Estimation
Maximum likelihood
Bayesian inference
Language models
Spam filter
In summary, we end up with (when using maximum likelihood priors):
P(X = spam|E ) ∝
k
Y
p(wi ; Θspam )fi ·
i=1
P(X = normal|E ) ∝
k
Y
n
= qspam
n+m
p(wi ; Θnormal )fi ·
i=1
m
= qnormal
n+m
From which we can recover the probability that the message is spam:
P(X = spam|E ) =
qspam
+ qnormal
qspam
26 / 36
Estimators
Unbiased Estimation
Maximum likelihood
Bayesian inference
Language models
Spam filter
Note: In practice, the probability p(wi ; Θspam ) will be zero as soon aswi does
not occur in any of the known spam messages. In other words, as soon as a
spam message uses a single word which had previously not been used in a spam
message, it would not be recognised as such.
In practice, zero probabilities are avoided by applying a form of smoothing,
e.g.:
(number of times word w occurs in the documents s1 , ..., sn ) + 1
(total number of word occurrences in s1 , ..., sn ) + k
(number of times word w occurs in the documents d1 , ..., dm ) + 1
p(w ; Θnormal ) ∝
(total number of word occurrences in d1 , ..., dm ) + k
p(w ; Θspam ) ∝
There are other smoothing methods that may be used, and the performance of
the spam classifier may crucially depend on choosing the right method
27 / 36
Estimators
Unbiased Estimation
Maximum likelihood
Bayesian inference
Language models
A MATLAB Multinomial Spam Filter Example
This example illustrates classification using naive Bayes and multinomial
predictors.
This example reads text from a text file for both training and testing the
naive Bayes classifier
Sentences are assumed to be simple with words separated by a single
space char and with no other punctuation present.
‘Toy’ Training and Test Data has been created:
Spam Data — contains high frequencies of the words:
’spam’, ’Viagra’,’buy’
Ham Data — contains high frequencies of the words:
’jazz’, ’rugby’
28 / 36
Estimators
Unbiased Estimation
Maximum likelihood
Bayesian inference
Language models
A MATLAB Multinomial Spam Filter Example
We generate Spam and Ham sentence with only these 5 words
using specified token probabilities
tokens = {’spam’, ’Viagra’,’buy’, ’jazz’, ’rugby’};
% Token relative frequencies
tokenProbs = [0.2 0.35 0.3 0.1 0.05;...
0.1 0.05 0.25 0.35 0.25];
% Token relative frequencies
tokensPerEmail = 20;
% Fixed for convenience
n = 1000;
% Sample size
........
Y = randsample([-1 1],n,true);
% Random labels
X = zeros(n,5);
X(Y == 1,:) = mnrnd(tokensPerEmail,tokenProbs(1,:),sum(Y == 1)); % SPAM
X(Y == -1,:) = mnrnd(tokensPerEmail,tokenProbs(2,:),sum(Y == -1)); % HAM
and draw samples from a multinomial distribution:
mnrnd(N,P) — returns a random vector chosen from the multinomial
distribution with parameters N and P: see doc/help mnrnd.
See spam filter intdata.m for making the training data and
make corpus.m
29 / 36
Estimators
Unbiased Estimation
Maximum likelihood
Bayesian inference
Language models
Example SPAM Training/Test sentences (spam train.txt),
spam test.txt):
Viagra spam Viagra spam Viagra buy rugby Viagra Viagra buy Viagra Viagra Viagra buy Viagra Viagra
buy Viagra Viagra jazz
Viagra Viagra Viagra Viagra buy Viagra spam rugby Viagra spam spam buy buy jazz spam spam buy
jazz spam jazz
spam spam Viagra Viagra spam buy buy buy jazz buy spam Viagra Viagra buy spam Viagra Viagra
spam buy Viagra
Example HAM Training/Test sentences (ham train.txt),
ham test.txt):
buy rugby jazz rugby jazz spam Viagra jazz buy spam Viagra buy jazz buy jazz jazz rugby jazz
jazz rugby
jazz jazz buy buy jazz Viagra buy buy jazz rugby jazz rugby Viagra rugby jazz buy jazz buy
jazz jazz
Viagra jazz jazz jazz rugby buy jazz rugby jazz rugby rugby jazz rugby jazz spam buy jazz buy
jazz rugby
30 / 36
Estimators
Unbiased Estimation
Maximum likelihood
Bayesian inference
Language models
MATLAB Code: Multinomial Spam Filter Example Outline
Specify tokens: ‘spam’, ‘Viagra’,‘buy’, ’jazz’, ‘rugby’
Read and parse training sentence for Spam and Ham.
Count occurrence of specified tokens
See parse sentence.m
Train a multinomial classifier
Use fitcnb(X train,Y train,‘Distribution’,‘mn’)
specifying the multinomial, ‘mn’, distribution.
X train are the token counts for each sentence:
SPAM and HAM (concatenated)
Y train are the token label for each sentence:
1 = SPAM, -1 = HAM.
Read and parse test sentences
Classify test sentences as SPAM or HAM
31 / 36
Estimators
Unbiased Estimation
Maximum likelihood
Bayesian inference
Language models
MATLAB Code: Multinomial Spam Filter Example (1)
spam filter.m∗
% Expt Set up
tokens = {’spam’, ’Viagra’,’buy’, ’jazz’, ’rugby’};
%
%
%
%
%
%
%
%
%
Data has been output from spam_filter_intdata.m as follows
Token relative frequencies
tokenProbs = [0.2 0.35 0.3 0.1 0.05;...
0.1 0.05 0.25 0.35 0.25];
% Token relative frequencies
tokensPerEmail = 20;
% Fixed for convenience
n = 1000;
% Sample size
so ’spam’, ’Viagra’,’buy’, statistically occur more frequenctly in SPAM
’jazz’, ’rugby’ statistically occur more frequenctly in HAM
%%%% Read in training data
spam = Read_TextFile(’spam_train.txt’);
ham = Read_TextFile(’ham_train.txt’);
% Get Bag of Words --- Frequencies of each token word
spam_train = parse_sentence(tokens,spam);
ham_train = parse_sentence(tokens,ham);
%%% Make Classifier Trainin Data
X_train = [spam_train; ham_train]; % concatenate Frequencies
% Make labels for data 1 = spam, -1 = ham
Y_train = ones(1,length(X_train));
Y_train(length(spam_train)+1:length(X_train)) = -1*ones(1,length(ham_train));
∗
For supporting functions and related files download the zip file: Spam Filter.zip
32 / 36
Estimators
Unbiased Estimation
Maximum likelihood
Bayesian inference
Language models
MATLAB Code: Multinomial Spam Filter Example (2)
spam filter.m cont.
% Train the Classifier
% Train a naive Bayes classifier. Specify that the predictors are multinomial.
Mdl = fitcnb(X_train,Y_train,’Distribution’,‘mn’);
% Mdl is a trained ClassificationNaiveBayes classifier.
% Assess the in-sample performance of Mdl by estimating the misclassification error.
isGenRate = resubLoss(Mdl,’LossFun’,’ClassifErr’) % The in-sample misclassification rate
% Read in test data see spam_filter_intdata.m
spam_new = Read_TextFile(’spam_test.txt’);
ham_new = Read_TextFile(’ham_test.txt’);
% Get Back of Words --- Frequencies of each token word
spam_test = parse_sentence(tokens,spam_new);
ham_test = parse_sentence(tokens,ham_new);
%%% Make Classifier Trainin Data
X_test = [spam_test; ham_test]; % concatenate Frequencies
% Make labels for data 1 = spam, -1 = ham
Y_test = ones(1,length(X_test));
Y_test(length(spam_test)+1:length(X_test)) = -1*ones(1,length(ham_test));
% Assess Classifier Performance
% Classify the new emails using the trained naive Bayes classifier Mdl,
% and determine whether the algorithm generalises well. Low number is good.
%
oosGenRate = loss(Mdl,X_test,Y_test)
Full code (incl. supporting files) is at: (Folder) Spam Filter or Spam Filter.zip (all files zipped).
33 / 36
Estimators
Unbiased Estimation
Maximum likelihood
Bayesian inference
Language models
MATLAB Naive Bayes Classifier
For more information on the MATLAB naive Bayes Classifier see
doc ClassificationNaiveBayes
M = fitcnb(X,Y) – returns a naive Bayes model, M, for
predictors X and class labelsY with two or more classes.
Additional parameters can specify distribution (‘mn’ here)
and other properties: see doc/help fitcnb
resubLoss(M) — returns resubstitution (training data)
classification cost for model M.
loss(M,X,Y) — returns classification cost for model M
computed using matrix of predictors X and true class labels Y
34 / 36
Estimators
Unbiased Estimation
Maximum likelihood
Bayesian inference
Language models
More advanced Spam Filtering
More complex spam filters may
Stop word removal : remove common words such as ‘the’, ‘a’, ‘and’ etc.
Stemming and Lemmatisation :
Stemming describes the process of transforming a word into
its root form and may produce non-words, e.g.
A swimmer likes swimming, thus he swims.
A
swimmer
likes
↓
swimming
,
thu
he
swims
.
Lemmatisation aims to obtain a common base form of the
words: e.g. am, are, is ⇒ be , car, cars, car’s, cars’ ⇒ car
A swimmer likes swimming, thus he swims.
A
swimmer
like
↓
swimming
,
thus
he
swim
.
Synonimisation — map words of similar meaning to a common word. e.g.
v1agra to viagra for example.
Removal of punctuation — obvious.
35 / 36
Estimators
Unbiased Estimation
Maximum likelihood
Bayesian inference
Language models
More advanced Spam Filtering: N-grams
N-gram — a contiguous sequence of n items from a given sequence
(sentence). An n-gram of size 1 is referred to as a unigram;
size 2 is a bigram (or a digram); size 3 is a trigram. Larger
sizes are sometimes referred to by the value of n, e.g.,
four-gram, five-gram.
A unigram:
A
swimmer
likes
swimming
,
thus
he
swims
.
A bigram:
A swimmer
swimmer likes
likes swimming
swimming thus
...
A trigram:
A swimmer likes
swimmer likes swimming
likes swimming thus
...
See Lab Class Exercises
36 / 36