Wes, Delaram, and Emily
MA751
Exercise 4.5
Consider a two-class logistic regression problem with x ∈ R. Characterize the maximum-likelihood estimates of the slope and intercept
parameter if the sample xi for the two classes are separated by a point x0 ∈ R. Generalize this result to (a) x ∈ Rp and (b) more than two
classes.
Solution: Without loss of generality, suppose that x0 = 0 and that the coding is y = 1 for xi > 0 and y = 0 for xi < 0. Now, suppose that
p(x; β) =
exp {βx + β0 }
1 + exp {βx + β0 }
so that
1 − p(x; β) =
1
.
1 + exp {βx + β0 }
Since x0 = 0 is the boundary then p(x0 ) = 1 − p(x0 ) then β0 = 0. Therefore,
p(x; β) =
exp {βx}
1 + exp {βx}
so that
1 − p(x; β) =
1
.
1 + exp {βx}
Therefore, the likelihood function
L(β; y, x) =
N
Y
p(xi ; β)yi [1 − p(xi ; β)]1−yi =
i=1
N Y
i=1
p(xi ; β)
1 − p(xi ; β)
yi
[1 − p(xi ; β)] =
N
Y
[exp {βxi }]yi [1 + exp {βxi }]
i=1
so that the log-likelihood function
l(β; y, x) =
N
X
yi [βxi ] − log [1 + exp {βxi }] .
i=1
Taking the derivative with respect to β and substituting in the proper coding of yi gives
dl(β; x, y)
dβ
=
N
X
yi −
xi
i=1
=
=
exp {βxi }
1 + exp {βxi }
X
exp {βxi }
exp {βxi }
−
xi
1 + exp {βxi }
1 + exp {βxi }
xi >0
xi <0
X
X
X
exp {βxi }
exp {βxi }
−
xi
.
xi −
xi
1 + exp {βxi }
1 + exp {βxi }
x <0
x >0
x >0
X
xi
1−
i
i
i
Setting the above equal to zero gives
X
xi =
xi >0
N
X
xi
i=1
exp {βxi }
1 + exp {βxi }
.
Clearly, for any data set {xi }N
i=1 we must have that β → ∞ for the above equality to hold.
(b) Now, suppose that there are K classes such that x1 seperates classes one and two, x2 seperates classes two and three, and so on to xK−1
that seperates classes K − 1 and K with −∞ = x0 < x1 < x2 < . . . < xK−1 < xK = ∞. Now, define probabilities
p1 (x; β)
=
p2 (x; β)
=
1+
exp {β1 x + β01 }
PK−1
j=1 exp {βj x + β0j }
exp {β2 x + β02 }
PK−1
1 + j=1 exp {βj x + β0j }
..
.
pK−1 (x; β)
=
1
1+
PK−1
j=1
1
exp {βj x + β0j }
.
2
Now, suppose that the coding is yi = 1 if xj−1 < xi < xj and yi = 0 otherwise for observation i = 1, . . . , N and class j = 1, . . . , K. Therefore,
the likelihood function
Nj
K Y
Y
L(β; y, x) =
[pj (xi ; β)]yi ,
j=1 i=1
where Nj is the number of observations in class j, so that the log-likelihood function
l(β; y, x)
=
=
Nj
K−1
X X
# N
"
#
k
X
exp {βj xi + β0j }
1
+
y
log
P
P
i
1 + K−1
1 + K−1
j=1 exp {βj x + β0j }
j=1 exp {βj xi + β0j }
j=1 i=1
i=1
Nj
Nj
K−1
K−1
K X
X X
X
X
yi [βj xi + β0j ] −
yi log 1 +
exp {βj xi + β0j } .
"
yi log
j=1 i=1
j=1 i=1
j=1
Now, we determine the values of β0j . First, note that β0j is a function of βj , xj−1 , and xj . So that the expression p(x; β) maintains proper
form, for xj−1 < x < xj we define
p(x; βj )
=
exp {βj (x − xj−1 )} − exp {βj (x − xj )}
P
1 + K−1
j=1 exp {βj xi + β0j }
=
exp {βj x} [exp {βj xj−1 } − exp {βj xj }]
P
1 + K−1
j=1 exp {βj xi + β0j }
=
1+
exp {βj x + β0j }
,
PK−1
j=1 exp {βj xi + β0j }
where β0j = log [exp {βj xj−1 } − exp {βj xj }]. The reason for the begining step of the formulation above is due to the fact that, for example,
when x ∈ (x1 , x2 ) so that x classifies to class two, the probability function appears as in the following figure, where it was assumed that x1 = 1
and x2 = 2.
1.2
1.0
0.8
0.6
0.4
0.2
0.0
0.5
1.0
1.5
2.0
2.5
3.0
Now, taking the derivative with respect to β = (β1 , . . . , βK−1 ) and substituting in the proper coding of yi gives
Nj
Nj
Nj X
X
exp {βj xj−1 } xj−1 − exp {βj xj } xj
exp {βj xj−1 } xj−1 − exp {βj xj } xj X
dl(β; x, y)
=
xi +
−
xi +
dβj
exp {βj xj−1 } − exp {βj xj }
exp {βj xj−1 } − exp {βj xj }
i=1
i=1
i=1
Note that the
exp{βj xj−1 }xj−1 −exp{βj xj }xj
exp{βj xj−1 }−exp{βj xj }
exp {βj xi − β0j }
P
1 + K−1
j=1 exp {βj xi + β0j }
!
.
term in the above is a constant in the sum over i = 1, . . . , Nj . Therefore, setting the above equal
to zero for each j = 1, . . . , K − 1 and solving for βj gives the maximum likelihood estimators in a similar fashion to the two-class case that
βj → ∞.
(a) Now, suppose that there are two classes in which x ∈ Rp . Suppose that ~
x1 and ~
x2 are two vectors that lie in the seperating hyperplane.
~ x1 − ~
Then, we have that β(~
x2 ) = 0. Now that we are back in the two class case,
n
o
~0~
~0
exp β
x+β
~ =
n
o
p(~
x; β)
~0~
~0
1 + exp β
x+β
3
so that
~ =
1 − p(~
x; β)
1
n
o.
~0~
~0
1 + exp β
x+β
~0 = β
~0~
Note that if ~
x0 lies in the seperating hyperplane then −β
x0 . Therefore,
o
n
~ 0 (~
x−~
x0 )
exp β
~ =
o
n
p(~
x; β)
~ 0 (~
x−~
x0 )
1 + exp β
so that
~ =
1 − p(~
x; β)
1
o.
n
~
x−~
x0 )
1 + exp β 0 (~
Finally, note that at this point the situation is analogous to the univariate case in that once taking derivatives of the log-likelihood function
~ = (β1 , . . . , βp ) and setting them equal to zero, the score functions reduce in a similar manner and it follows that the maximum
with respect to β
~ → ∞.
likelihood estimator is such kβk
Problem 4.6
February 8, 2010
1
Part A
Given
f (x) = β1T x + βo = 0
f (x) = β T x∗
where x∗ = [xT 1]T and β = [β1T β0 ]T .
Because the data is separable,
T
yi βsep
zi ≥ M
where zi =
such that
xi
kxi k
and M > 0. We can then choose a separating hyperplane β∗
β∗ =
and therefore
2
βsep
M
yi β∗T zi ≥ 1∀i
Part B
Given
βnew = βold + yi zi
then
kβold − βsep k2 = kβnew − βsep − yi zi k2
T
T
= kβnew − βsep k2 − βnew
yi zi + βsep
yi zi − yi ziT βnew + yi ziT βsep + kyi zi k2
T
T
= kβnew − βsep k2 − βnew
yi zi + βsep
yi zi − yi ziT βnew + yi ziT βsep + 1
From part A,
T
yi βsep
zi ≥ 1
and because of the iteration update
T
yi βnew
zi ≥ 0
1
Given that at each step, the updated decision boundary minimizes the distance
between the point update and the decision boundary, then
T
T
yi βnew
zi ≤ yi βsep
zi
T
T
yi βsep
zi − yi βnew
zi ≥ 0
must be true, so therefore
kβold − βsep k2 ≥ kβnew − βsep k2 + 1
is true, and the perceptron algorithm must converge in kβold − βsep k2 steps.
2
© Copyright 2026 Paperzz