y
x1
x2
x3
Logistic regression and
Naive Bayes
Figure 6.3: Naive Bayes graphical model, with thr
given the label. This assumption is demonstrated by the
Figure 6.3.
As with logistic regression, even though we learn a gen
decision rule for labeling a point as class 1 (i.e., y = 1) is
p(y = 1|x)
p(y = 0|x)
where now p(y = 1|x) = p(x|y = 1)p(y = 1). To start,
simpler setting with binary features, and then address co
Note that naive Bayes is a linear classifier for binary featur
however, it is not necessarily a linear classifier. Note that
one in which the two classes are separated by a linear plan
boundary is according to some linear combination of feat
6.2.1
Binary features and linear classification
D = {(xi , y}ni=1 be an input data set, where X = {0, 1
* Image from YaroslavLet
Bulatov
Under the naive Bayes assumption, the features are inde
Reminders/Comments
•
Assignment 2 due next week
•
Clarifications for notes and assignment
• The number of epochs is the number of times we iterate through the
dataset in stochastic gradient descent
• Matching pursuit algorithm: simple idea of computing correlation to
residual errors. Provided a paper with nice pseudocode, if confused
2
Thought question
•
“Linear regression is very specific in the sense that Y is some
kind of special function of X, now if we just throw too many
independent variables into the mix, which may not be very
smart, it may be difficult to find hidden patterns as the
algorithm only look for linear, additive patterns among them.
How do we avoid adding unnecessary variables in our
algorithm? I think for linear regression to work we already
should have a basic understanding of the data and thus
decide our variables carefully, but is this the only way.”
• This is largely the purpose of regularization
• In particular, l1 regularization subselects features
3
6.2: Sigmoid function in [ 5, 5] interval.
Figure 6.2: Sigmoid function in [
Logistic regression is a linear
Thus, the predictor maps classifier
a (d + 1)-dimensional vector
⇤
xd ) into a zero or one.asNote
that (ŷ
P (Y
= 1|x,
w )the 0.5
negative
= 0).
Thus,
predictor maps
> x = 0 represents equation of a
. The
expression
w
Hyperplane
separates
classes
•
x = (x0 =
1, x1 , . . the
. , xtwo
a zero or one. No
d ) into
>
rates positive
negative
examples.
when
0. the
Thelogistic
expression w> x =
| x, w)
> only
0.5 only
when w x Thus,
• P(y=1and
linear classifier.
that |separates
positive
and negative
0.5 only when P(y=1
x, w) < 0.5, which
happens
• P(y=0 | x, w) > hyperplane
when w> x <
regression
model is a linear classifier.
0
1
1
1+e
conditional likelihood estimation
x2
6.1.2
Maximum
conditional
likelihoo
problem as parameter
estimation,
we
will
assume
that
(x , y )
n
i , yi )}i=1
0.9
f (t) =
t
0.8
0.7
0.6
0.5
(x2, y2)
i
0.4
i
0.3
is an i.i.d. sample
from
alearning
fixed but
unknown
To frame
the
problem
as parameter es
+
n that
on p(x, y). Even more
specifically,
we
will
assume
the data set D = {(xi , yi )}i=1 is an i.i.d. samp
+
+
process randomly draws
x, a realization
of2 x2 = 0 more spe
+ a data point
probability
distribution
+ w1 xy).
w0p(x,
1 + wEven
+
X0 = 1, X1 , . . . , Xd ), according
to p(x) andprocess
then
sets
its
the (data
randomly
draws
a
x1, y1) generating
+
+
+
ng to the Bernoulli distribution
+ vector (X0 = 1, X1 , . . . , Xd ), accor
the random
4 8
⇣
⌘class
label Y accordingx1to the Bernoulli distrib
y
1
for y = 1
0.2
0.1
0
−5
0
5
t
Figure 6.2: Sigmoid function in [ 5, 5] interval.
as negative (ŷ = 0). Thus, the predictor maps a (d + 1)-dimensio
x = (x0 = 1, x1 , . . . , xd ) into a zero or one. Note that P (Y = 1|x,
only when w> x
0. The expression w> x = 0 represents equa
hyperplane that separates positive and negative examples. Thus, t
regression model is a linear classifier.
6.1.2
Maximum conditional likelihood estimation
To frame the learning problem as parameter estimation, we will as
the data set D = {(xi , yi )}ni=1 is an i.i.d. sample from a fixed but
Logistic regression versus linear
regression
•
5
Why might one be better than the other? They both use linear
approach.
Discriminative versus generative
•
So far, have learned p(y | x) directly
•
Could learn other probability models, and using probability
rules to obtain p(y | x)
•
In generative learning,
• learn p(x | y) p(y) (which gives the joint p(x, y) = p(x|y) p(y))
• compute p(x | y) p(y), which is proportional to p(y | x)
p(x|y)p(y)
p(y|x) =
p(x)
•
6
Question: how do we use p(x|y) p(y) for prediction?
How to use generative model?
•
Our decision rule for using these probabilities is the same as
with logistic regression: pick class with the highest probability
•
Assume you’ve learned p(x | y) and p(y) from a dataset
•
Compute
f (x) = arg max p(y|x)
y2Y
= arg max p(x|y)p(y)/p(x)
y2Y
= arg max p(x|y)p(y)
y2Y
7
Pros and Cons
•
Discriminative
• focus learning on the value we care about: p(y | x)
• can be much simpler, particularly if features x complex, making p(x |
y) difficult to model without strong assumptions
•
Generative
• can be easier for expert to encode prior beliefs, e.g., for classifying
trees in evergreen or deciduous, structure/distribution over the
features (height, location) can be more clearly specified by p(x | y),
whereas p(y | x) does not allow this information to be encoded
• can sample from the generative model
•
8
naive Bayes can perform better in the small sample setting
• see “On discriminative vs. generative classifiers: A comparison of
logistic regression and naive Bayes”, Ng and Jordan, 2002
February 27
•
Assignment 2 due this week
•
Comment about optimization
• can use first-order gradient descent
• I have taught you about second-order methods, but you will not be
tested on them, nor need to use them in your assignments
•
Standard error is the sample standard deviation, divided by
square-root of number of samples. For A2 the number of
samples is the number of runs
• each run gives an estimate of expected error for a learned weight
vector on a training subsample
• we want to know the variance in error across these learned w
9
Though questions
•
About overfitting:
• Why would one want high bias, low variance?
• Shouldn't providing a more accurate function of a higher degree of
polynomial (that goes through more points) be better for future
predictions? The example you gave doesn't make sense to me as it
would appear that the model (a linear function) of the prior data was
just a line of best fit, but none of the data points where on the line of
the actual model.
• Is overfitting due to a mismatch between train and test data? i.e.,
when training distribution is different from test
10
High-bias, low-variance
We can plot four different cases representing combinations of both high and low
bias and variance.
Low Variance
High Variance
Low Bias
High Bias
Fig. 1 Graphical illustration of bias and variance.
1.3
11
Mathematical Definition
after Hastie, et al. 2009 1
If we denote the variable we are trying to predict as
and our covariates as , we
High-degree polynomial
5
f3(x)
4
f1(x)
3
2
1
x
0
12
1
2
3
4
5
4.4: Example of a linear vs. polynomial fit on a data set shown in
RedFigure
line
is
the
true
model
Figure 4.1. The linear fit, f1 (x), is shown as a solid green line, whereas the
cubic polynomial
fit, f3 (x), is shown
solid blue for
line. The
dotted predictions?
red line
Should
the polynomial
beas abetter
future
indicates the target linear concept.
Why are none of the points on the true line (red line)?
as on a large discrete set of values x 2 {0, 0.1, 0.2, . . . , 10} where the target
values will be generated using the true function 1 + x2 .
Using a polynomial fit with degrees p = 2 and p = 3 results in w2⇤ =
Mismatch between train and test
•
We do not usually refer to overfitting because of a mismatch
between training and test data
•
Usually assume train and test data is identically distributed
•
Rather, if we completely fit to any subsample of data, even if it
is representative, we could get overfitting
• e.g., imagine memorizing training points (xi, yi), and just reporting yi if
you see xi, and otherwise reporting 0
13
Why does regularization prevent
overfitting?
•
Imagine expand up features into high-dimensional polynomial
•
Can fit points exactly by using these degrees of freedom to add
and subtract different polynomial terms
• e.g., yhat = 10*x1 - 10*x2, yhat not large but coefficients might be
•
Discussed that X^T X might not be full rank, or it might be
nearly singular
• i.e., have singular values close to zero
14
•
When those singular values are inverted, it produces really
large values, so coefficient likely to be high magnitude
•
||w||_2 prefers values near zero, adds lambda to singular values
How can we determine the
order of the polynomial?
•
If too high-order (e.g., 9th degree polynomial), might overfit
• includes terms like x^9
•
If too low-order (e.g., 2nd degree polynomial), might underfit
• includes terms like x^2
15
•
Do we need an expert to decide?
•
Alternative: over-parametrize (high-degree), and use l1
regularization to sub-select features
Thought question
•
Is it possible to provide a prediction and an estimate of how
confident we are in that prediction?
• “Let's say the data you are modeling is inherently high-variance and it
may not be appropriate to give exact estimates for new data. Is it
possible to generate a y-hat that is something of a bar, e.g. a line,
linear or non-linear, but that defines a distribution range around it
implying that data will likely fall within some N(mu, sigma^2) of the
line?”
•
16
When we learn p(y | x) with mu = <x, w>, we can also learn
sigma to model how much y varies for a given x
assumption, the features are independent given the
e can write
p(x|y) =
d
Y
i=1
p(xi |y).
Naive Bayes
How do we
realisticallyis learn
p(x | y)?
• univariate
s simpler
probabilities
a Bernoulli
diss binary, giving
•
y
x1
x2
x3
Onex option is to make a (strong) conditional independence
Figure 6.3: Naive Bayes graphical model, with thr
1 xj
xj |y = c) assumption:
= pj,cj (1 pjc)the
features are independent given the label
thec),
label. This assumption is demonstrated by the
Bernoulli distributions are pj,c = p(xj = given
1|y =
Figure
6.3.
Example:
given
a
patient
does
not
have
the disease (y=0),
•
ter pj,c for each class value c and for each feature.
As with logistic regression, even though we learn a gen
attributes about patient are uncorrelated
(e.g., age & smokes)
decision rule for labeling a point as class 1 (i.e., y = 1) is
• even
96 within a class, age and smokes could be correlated
p(y = 1|x)
•
17
p(y = 0|x)
where now p(y = 1|x) = p(x|y = 1)p(y = 1). To start,
simpler
setting
with binary
features, and
Surprisingly, despite the fact that
this
seems
unrealistic,
in then address co
Note that naive Bayes is a linear classifier for binary featur
practice this can work well
however, it is not necessarily a linear classifier. Note that
in which the two classes are separated by a linear plan
algorithms on “easy” data
• one hypothesis is we are runningonethese
boundary is according to some linear combination of feat
• another is that dependencies skew the distribution equally across all
Binary features and linear classification
classes, so no one class gets an6.2.1
increased
probability
Let D = {(xi , y}n
be an input data set, where X = {0, 1
Types of features
•
Before, we had to choose the distribution over y given x
• e.g. p(y | x) is Gaussian for continuous y
• e.g. p(y | x) is Poisson for y in {1, 2, 3, …}
• e.g. p(y | x) is Bernoulli for y in {0,1}
18
•
Parameters to p(y | x) had E[y | x] = f(xw)
•
Now we need to choose the distribution over x given y; how do
we pick p( x | y)? What are the parameters?
How do we pick distributions?
•
For p(x), picked Gaussian, Bernoulli, Poisson, Gamma
• depending on the values x could take, or looking at a plot
19
•
How do we pick conditional distributions, like p(y | x)?
•
General answer: based on the properties of y. Once the given
features x are fixed, are are just learning a distribution over y
Exercise: binary features
20
•
For binary features x in {0,1}, binary y in {0,1}, what is p( x | y)?
•
How do we learn p(x | y)?
•
How do we learn p(y)?
Exercise: continuous features
21
•
For continuous features x, binary y in {0,1}, what could we
choose for p(x | y)?
•
What are the parameters and how do we learn them?
Continuous naive Bayes
22
* Image from Yaroslav Bulatov
Exercise
•
Imagine someone gives you p(x | y) and p(y)
•
They give you a test instance, with features x
• e.g., x is an image, of pixel values (values between 0 to 255)
•
Your goal is to predict the output y
• e.g., if the image contains a cat or not
•
23
How do you use p(x | y) and p(y) to make prediction?
Whiteboard
•
Multiclass classification
• Naive Bayes
24
March 1, 2017
•
Deadline change for Assignment 2 to March 2, 11:59 p.m.
• remove #import plotfcns from your file, as plotfcns not given to you
• pseudocode for all questions in notes, except for MPRegression
•
Thank you for the feedback
• concerns about pace of course
• more real-world examples
•
Update to reading assignment
• no longer have to read about factorization approaches
• added a bit of reading on running experiments
25
Difficulty of course/assignments
•
Assignments are meant to challenge you
•
Quiz/Final will have simpler questions that can be answered in
the allotted time
• Final designed to take 1 hour, you are given 2 hours
•
You will be allowed a 4 page cheat sheet
•
Lectures are supposed to complement assignments
• lectures are not really there for assignments, assignments are for you
•
26
Additional reading materials, such as Bishop book or Barber
book
A few clarifications
•
Polynomial regression is not a type of GLM (generalized linear model)
• The term Generalized means the generalization to the distribution
• Logistic regression and Linear regression are examples of GLMs, with
Bernoulli and Gaussian exponential family distributions respectively
•
OLS and maximum likelihood with a Gaussian for p(y | x) is equivalent
•
Bias unit (add an additional feature to feature vector that is always
one) versus bias from regularizer, these are different
• the term bias unit should really be called intercept unit
27
Though question
28
•
In the linear classifiers section, it mentions that x_0 = 1 is
used. Why is it better to use this bias as opposed to using an
activation function which takes outputs of 0 into account?
•
Additional question: what happens when we add an intercept
unit (bias unit) to the features for logistic regression?
Adding a column of ones to X for
logistic regression (and GLMs)
•
This provides the same outcome as for linear regression
•
g(E[y | x]) = x w —-> bias unit in x with coefficient w0 shifts
the function left or right (where g = inverse of sigmoid)
29
*Figure from http://stackoverflow.com/questions/2480650/role-of-bias-in-neural-networks
Feedback quiz
•
Difference between logistic regression and linear regression
• assume different distribution on y, given x: logistic uses Bernoulli for
p(y |x ) and linear uses Gaussian for p(y|x)
•
Question about datasets
• targets {-3.0, 2.2, -5.3, -1.0, 4.3} —> linear regression
• targets {High, Low, High, High, Low} —> logistic regression, naive
Bayes
• targets {1, 2, 3, 2, 1} —> naive Bayes, since y always in these three
categories (or multinomial logistic regression)
30
Whiteboard
•
31
Complete naive Bayes
Practical considerations:
incorporating new data
•
How do we incorporate new data into naive Bayes?
•
Learned mu_{j,c} and sigma_{j,c} for each feature j, each class c
•
Keep a running average of mu_{j,c} and sigma_{j,c}, with
number of samples n_{j,c} for feature j class c
•
For a new sample (x, y), update parameters mu_{j,y} and
sigma_{j,y} using:
• mu_{j,y} = mu_{j,y} + (x_j - mu_{j,y})/ n_{j,c}
• sndmoment_{j,y} = sndmoment_{j,y} + (x_j^2 - sndmoment_{j,y})/ n_{j,c}
• sigma_{j,c} = sqrt(sndmoment_{j,y} - mu_{j,y}^2)
32
Practical considerations:
incorporating new data
•
Imagine you have a model (say weights w) and now want to
incorporate new data
•
How can you do this with regression approaches?
•
Option 1: add data to your batch and recompute entire solution
•
Option 2: start from previous w (warm start), add data to your
batch and do a few gradient descent steps
•
Option 3: use stochastic gradient descent update and step in
the direction of the gradient for those samples
• for a constant stream of data, will only ever use one sample and then
throw it away
33
10:
return w
Stochastic gradient descent
Algorithm 2: Stochastic Gradient Descent(Err, X, y)
1:
2:
3:
4:
5:
6:
7:
8:
w
random vector in Rd
for i = 1, . . . number of epochs do
for t = 1, . . . , n do
// For many settings, we need step-size ⌘t to decrease with time
// For example, a common choice is ⌘t = ⌘0 t 1 or
// ⌘t = ⌘0 t 1/2 for some initial ⌘0 , such as ⌘0 = 1.0.
w
w ⌘t rErrt (w) = w ⌘t (x>
yt )xt
t w
return w
For logistic regression
>
rErrt (w) = xt ( (xt w)
yt )
Algorithm 3: Batch gradient descent for `1 regularized linear regression (X, y, ↵)
1:
2:
3:
4:
34
5:
6:
w
0 2 Rd
err
1
tolerance
10e
1 >
XX
nX X
1 >
Xy
nX y
⌘
1/(2XX)
4
Stochastic gradient descent
•
How do we decide when to stop? How do we select step-size?
•
See Bottou’s tricks for SGD: http://cilvr.cs.nyu.edu/diglib/lsml/
bottou-sgd-tricks-2012.pdf
• this is not a polished document, but useful anyway
•
Convergence proof if step-sizes decreasing to zero, at a
reasonable rate (that is not too quickly)
•
In practice, additional approaches to improve convergence
• monitor convergence by checking full training error periodically
• many variance-reduced approaches now (SVRG)
35
Batch GD versus SGD
•
What are some advantages of batch gradient descent?
• less noisy gradient
• more clear strategies for selecting stepsize
•
What are some advantages of SGD?
• more computationally efficient: does not waste an entire epoch to
provide an update to the weights
• convenient for updating current solution with new data
•
36
Does batch always give better solutions than SGD?
© Copyright 2026 Paperzz