Regression For Proportion Data

Julian Center
on
Regression for
Proportion Data
July 10, 2007
(68)
Regression For
Proportion Data
Julian Center
Creative Research Corp.
Andover, MA, USA
MaxEnt2007
Overview

Introduction






Coordinate Transformation to Facilitate Regression.
Measurement Models




What is proportion data?
What do we mean by regression?
Examples
Why should you care?
Multinomial
Laplace Approximation to Multinomial
Log-Normal
Regression Models


Kernal Regression (Nadaraya-Watson Model)
Gaussian Process Regression



With Log Normal Measurements
With Multinomial Measurements – Expectation Propagation
Conclusion
MaxEnt2007
Julian Center
What is Proportion Data?
² Proportion data = Compositional data ½ Categorical data.
² Proportion data = A (+ 1)-dimensional vector r of relative
proportions of items assigned to one of + 1 categories.
Similar to a discrete probability distribution.
² In mathematical terms, r is con…ned to the -simplex,
2 S=
r
n
r
+1
2 R
+
:
1
+ 1r
o
= 1
Here 1(+ 1) is the (+ 1)-dimensional vector of all ones, i.e.
h
1(+ 1)
MaxEnt2007
i
= 18

Julian Center
What is Regression?




Regression = Smoothing + Calibration +
Interpolation.
Relates data gathered under one set of
conditions to data gathered under similar, but
different conditions.
Accounts for measurement “noise”.
Determines p(r|x).
MaxEnt2007
Julian Center
Examples




Geostatistics: Composition of rock samples at different
locations.
Medicine: Response to different levels of treatment.
Political Science: Opinion polls across different
demographic groups.
Climate Research:



MaxEnt2007
Infer climate history from fossil pollen samples.
Calibrate model using present day samples from known
climates.
Typically, examine 400 pollen grains and sort into 14
categories
Julian Center
Why Should You Care?



Either, you have proportion data to analyze.
Or, you want to do pattern classification.
Or, you want to use a similar approach to your
problem.
Transform constrained variables so that a Laplace
approximation makes sense.
 Two different regression techniques.
 Expectation Propagation for improving model fit.

MaxEnt2007
Julian Center
Coordinate Transformation



Well-known regression methods can’t deal with
the pesky constraints of the simplex.
We need a one-to-one mapping between the dsimplex and d-dimensional real vectors.
Then we can model probability distributions on
real vectors and relate them to distributions on
the simplex.
MaxEnt2007
Julian Center
Coordinate Transformation
We can establish a one-to-one mapping between S and Rby
Symmetric Softmax Activation Function
sm :
R!
clr : S!
S;
h
sm ( f ) =
1
(+ 1)
³
exp
T f
´i¡ 1
³
exp
T f
´
R; clr ( y ) = T ln ( y )
Centered Log Ratio Linkage Function
where T is a £ (+ 1)-dimensional matrix that satis…es
TT = I
T 1(+ 1) = 0
T T +
MaxEnt2007
1
1(+ 1) 1
=
(
+
1)
+ 1
I (+ 1)
The rows of T span the orthogonal
Complement of 1(d+1)
We can always find T by the
Gram-Schmidt Process
Julian Center
Coordinate Transformation
ln(y2)
f
Softmax is
insensitive
to this direction.
ln(y1)
y2
Simplex
y1
MaxEnt2007
Image of
Simplex
Under ln
ln(y1)=- ln(y2)
Julian Center
Measurement Models


Multinomial
Log-Normal
MaxEnt2007
Julian Center
Measurement Model
- Multinomial Assume that the proportion vector r  comes from  independent
samples from the discrete probability distribution represented
by the vector y 
( r jy ) =
M (r j y ) ,
M (r j y )
Y
!
([y ] ) [ r ] 
Q
( [r ])! 
To get the likelihood function for f  = clr ( y ), we take
Q
into account the Jacobian of the transformation, [y ] .
The log-likelihood function corresponding to f  is

 ( f ) =
r =
MaxEnt2007
( + + 1) r 

 ln ( y ) + 
r  + 1(+ 1)
( + + 1)
Julian Center
Multinomial Measurement Model
S=400
Binomial Likelihood Functions
0.01
R1=
0
0.0025
0.005
likelihood
0.008
0.01
0.02
0.006
0.05
0.07
0.004
0.1
0.2
0.3
0.5
0.002
0
-6
-5
-4
-3
-2
-1
0
1
f
MaxEnt2007
Julian Center
Measurement Model
- Laplace Approximation 



Some regression methods assume a Gaussian
measurement model.
Therefore, we are tempted to approximate each
Multinomial measurement with a Gaussian
measurement.
Let’s try a Laplace approximation to each measurement.
Laplace Approximation:



MaxEnt2007
Find the peak of the log-likelihood function.
Pick a Gaussian centered at the peak with covariance matrix
that matches the negative second derivative of the loglikelihood function at the peak.
Pick an amplitude factor to match the height of the peak.
Julian Center
Measurement Model
- Laplace Approximation The value of f  that maximizes the log-likelihood is
m  = T ln ( r )
The Laplace approximation to a single measurement is
 ( f ) =
=
N ( f j m V )
 j2V j ¡
1
2
·
1
¡ 1 (f ¡ m )
exp ¡ ( f  ¡ m )  V 


2
¸
where
 =
V ¡ 1 =
MaxEnt2007
!
exp [
 ( m )]
(
[
r
])!
 h 
i

( + + 1) T D iag ( r ) ¡ r r  T 
1
j2V j 2
Q
Julian Center
Laplace Approximation to
Multinomial
r1=0/400
0.001
0.0008
p(f)
0.0006
Laplace Approx
a
Multinomial
0.0004
0.0002
0
-7
-6
-5
-4
-3
-2
-1
0
f
MaxEnt2007
Julian Center
Laplace Approximation to
Multinomial
r1=1/400
0.0014
0.0012
p(f)
0.001
Laplace Approx
0.0008
Multinomial
0.0006
0.0004
0.0002
0
-6
-5
-4
-3
-2
-1
0
f
MaxEnt2007
Julian Center
Laplace Approximation to
Multinomial
r1=2/400
0.002
p(f)
0.0015
Laplace Approx
0.001
Multinomial
0.0005
0
-6
-5
-4
-3
-2
-1
0
f
MaxEnt2007
Julian Center
Laplace Approximation to
Multinomial
r1=4/400
0.0025
p(f)
0.002
0.0015
Laplace Approx
Multinomial
0.001
0.0005
0
-5
-4
-3
-2
-1
0
f
MaxEnt2007
Julian Center
Laplace Approximation to
Multinomial
r1=80/400
0.003
Laplace Approx
0.002
p(f)
Multinomial
0.001
0
-4
-3
-2
-1
0
f
MaxEnt2007
Julian Center
Laplace Approximation to
Multinomial
r1=120/400
0.01
Laplace Approx
0.008
Multinomial
p(f)
0.006
0.004
0.002
0
-1
0
f
MaxEnt2007
Julian Center
Measurement Model
- Log-Normal ² General log-normal model form:
( f ) = N ( f j m V )
² Can match Laplace approximation to multinomial.
² Can do much more.
e.g. Over-dispersion or under-dispersion
² Basis for regression methods.
MaxEnt2007
Julian Center
Regression Models




Way of relating data taken under different
conditions.
Intuition: Similar conditions should produce
similar data.
The best to use methods depends on the
problem.
Two methods considered here:
Nadaraya-Watson model.
 Gaussian Process model.

MaxEnt2007
Julian Center
Nadaraya-Watson Model

Based on applying Parzen density estimation to the
joint distribution of f and x
General Form:
( f x ) =
X
= 1
( f x j)
Simpli…ed Model:
³
´
( f x j) = N f j bf B N ( x j x D )
MaxEnt2007
Julian Center
All Data Points
f
x
MaxEnt2007
Julian Center
Nadaraya-Watson Model
f
x
MaxEnt2007
Julian Center
Nadaraya-Watson Model
This model implies that
( x ) =
 
( x j) =
( f j x ) =
=
X
= 1
N ( x j x D )
( f x )
( x )
X
= 1
 
 ( x ) =
MaxEnt2007
( x j)
 ( x ) N
³
f j bf B
´
( x j)
( x )
Julian Center
Nadaraya Watson Model
To determine the distribution for a new measurement, we compute
Z
( r jx ) =
( r jf ) ( f j x ) f
X
=
Z
= 1
 ( x )
( r jf ) N
³
´
b
f j f B f
If we use the Laplace approximation to the multinomial,
we can solve the integrals analytically to get
X
( r jx ) = 
= 1
 ( x ) N
³
m j bf B +
´
V
where m and V are computed from r as described above.
Otherwise, we can use stochastic integration to compute the integrals.
MaxEnt2007
Julian Center
Nadaraya-Watson Model


Problem: We must compare a new point to
every training point.
Solution:
Choose a sparse set of “knots”, and center density
components only on knots.
 Adjust weights and covariances by “diagnostic
training”.
 Mixture model training tools apply.

MaxEnt2007
Julian Center
Sparse Nadaraya-Watson Model
f
x
MaxEnt2007
Julian Center
Gaussian Process Model



Probability distribution on functions.
Specified by mean function m(x) and covariance
kernel k(x1,x2).
For any finite collection of points, the
corresponding function values are jointly
Gaussian.
MaxEnt2007
Julian Center
Gaussian Process Model
f
x
MaxEnt2007
Julian Center
Applying Gaussian Process
Regression to Proportion Data



Prior – Model each component of f(x) as a
zero-mean Gaussian process with covariance
kernel k(x1,x2). Assume that the components
of f are independent of each other.
Posterior – Use the Laplace approximations to
the measurements and apply Kalman filter
methods.
Use Expectation Propagation to improve fit.
MaxEnt2007
Julian Center
Sparse Gaussian Process Model
Choose a subset of K training points to act as knots.
Rearrange latent function values at the knots in one large vector g
[ g](¡ 1) +  , [f ( x 
)]
 
 2 f 12¢¢¢g 
2 f 12¢¢¢ g
2
3
[f ( x 1)]1 [ f ( x 2)]1 ¢¢¢ [f ( x  )] 1
6 [f ( x )]
[ f ( x 2)]2 ¢¢¢ [f ( x  )] 2 77
6
1
2
6
7
..
..
..
.. .
4
5
[ f ( x 1)] [f ( x 2)] ¢¢¢ [f ( x  )]
MaxEnt2007
Julian Center
Sparse Gaussian Process Model
Under our assumptions, the prior ( g) = N ( gj 0G )
where
2
G , I - C
C 0
C
..
0 0
6 0
6
= 6 ..
4
¢¢¢
¢¢¢
.. .
¢¢¢
0
0
..
C
3
7
7
7
5
³
´
[ C ], x x  
2 f 12¢¢¢g
MaxEnt2007
Julian Center
Sparse Gaussian Process Model
( f ( x ) j g) = N [f ( x ) j H ( x ) g ( x ) I ]
where
h
H (x ) ,
 (x ) ,
[ k ( x )] ,
I -
k
i

¡
1
(x) C

( x x ) ¡ k ( x )  C ¡ 1k ( x )

( x x 
) 2 f 12¢¢¢ g
We can express this by the equation
f (x ) = H (x) g + u (x)
where u ( x ) » N [0 ( x ) I ] and u ( x ) is independent of g.
MaxEnt2007
Julian Center
Sparse Gaussian Process Model
In particular, the values of the latent function at the
training points can be expressed as
f  = H g + u 
where H  = H ( x ) and u  = u ( x ).
To simplify computations, we assume that
u  is independent of u for 6
= 
.
Note that if x  is one of the knots, i.e.,  ·  ,
then u  = 0 and H  is a £  sparse matrix
that simply selects the appropriate elements of g.
MaxEnt2007
Julian Center
GP– Log-Normal Model
Using the log-normal measurement model,
Z
( r jg) =
=

N ( f j m V ) N ( f j H g
I ) f
N ( m j H gR )
where R  = V  + I . Thus everything is Gaussian and therefo
b P ).
( gjT ) = N ( gj g
MaxEnt2007
Julian Center
GP– Log-Normal Model
b and P by the Kalman …
We can determine g
lter algorithm.
(1) Start with
b (
g
P (
0
G
(2) For  = 1 to  iterate
K (
b (
g
P (
¡
PH
 ( H PH  + R )
b)
gb + K ( m  ¡ H g
P ¡ K H P
1
If we believe that the log-normal measurement model is correct,
then we are …nished after one pass through all the training data.
MaxEnt2007
Julian Center
GP – Log-Normal Model
We can compute the evidence by
"
 = (T ) =
Y

#
1
b P )]¡ 11
N ( 0j m R ) N ( 0j 0G ) [N ( 0j g
We can determine the probability distribution of
seeing a new measurement r at x by
h
b V + U ( x ) + H ( x )
( r jx 
T ) = N m j H ( x ) g
MaxEnt2007
PH  ( x )
i
Julian Center
GP Multinomial Model
If we believe that the measurement model is really multinomial,
we can get a more accurate approximation using the
Expectation Propagation (EP) algorithm.
As before we approximate the joint distribution
( r 1r 2¢¢¢r  g) by the form
( g) =
Y

N ( H gj m R ) N ( gj 0G )
Now our aim is to adjust the ’s, m ’s, and R ’s to minimize
the Kullback-Leibler divergence
Z
 (jj) = ¡
MaxEnt2007
Ã
!
( g)
ln
( g) g
( g)
Julian Center
Expectation Propagation Method
To minimize (jj), we iteratively choose
a measurement  and minimizie  (¤jj¤) where
¤ ( g) =
¤ ( g) =
( r jg)
( g)
N ( H gj m R )
¤ N ( H gj m ¤ 
¤


 R ) ( g)
N ( H gj m R )
¤ , m ¤ , and R ¤ so that the
We can accomplish this by choosing 


moments of ¤ ( g) match those of ¤ ( g).
MaxEnt2007
Julian Center
Expectation Propagation Method
To approximate the moments, we compute
³
´
)
X  r jh (
1
¤¼
³
´

(
)
 = 1 N h j m R 
³
´
(
)
r jh

X
1
1
(
)
b
³
´
h¼ ¤
h
(
)
  = 1
N h j m R 
³
X
´
(
)
r jh

1 1
(
)
(
)
bh
b
³
´ ¡ h
W ¼ ¤
h h
  = 1
N h ( ) j m R 
where
h () »
MaxEnt2007
³
N
H gb H PH 

´
Julian Center
Expectation Propagation Method
To get ¤ to have the same moments as ¤, we choose
1
R ¤¡

=
m ¤ =
MaxEnt2007
R ¡ 1 + W ¡ 1 ¡
·
R ¤ R ¡ 1m  +
³
´¡ 1

H P H 
¸
³
´¡ 1
W ¡ 1hb ¡ H PH 
H gb

Julian Center
Expectation Propagation Method
If  is one of the knots,
³

´
(
)
r jh
³
= M
´
(
)
r j h
Otherwise, we approximate it by
³

´
(
)
r j h
Z
=
³
M
r j h () +
´
u N ( u j 0
I ) u
³
´
1 X
(
)
(
)
¼
M r j h + u

=1
) »
 u (
MaxEnt2007
N ( 0I )
Julian Center
Expectation Propagation Method
Now we can update the smoother parameters.
1 = R ¡ 1 then the error covariance P does not change
If R ¤¡


and we update the estimate of g by
¡ 1
¤
gb ( gb + P H 
R  ( m  ¡ m ) 
Otherwise, we use
³
R¢
(
K (
P (
b (
g
MaxEnt2007
´¡ 1
¤¡
1
¡
1
R ¡ R
³

P H  H PH 
+
R¢
´¡ 1
P ¡ KH
P
h  ³
´
i
¤¡
1
¤
¡
1
gb + K R ¢ R  m  ¡ R  m  ¡ H gb
Julian Center
Expectation Propagation Method
Finally, we replace the parameters for measurement 
 (
m (
R (
¤

m ¤
R ¤
and go to the next iteration.
MaxEnt2007
Julian Center
Choosing the Regression Model
Combining Samples
0.006
0.005
p(f)
If you have two samplings
taken under the same
conditions, do you want to
treat them as coming from
a bimodal distribution (NW
Model) or combine them
into one big sampling (GP
Model)?
0.004
r1=4/400
0.003
r1=40/400
0.002
r1=44/800
0.001
0
-5
-4
-3
-2
-1
0
f
MaxEnt2007
Julian Center
Conclusion



A coordinate transformation makes it possible
to analyze proportion data with known
regression methods.
The Multinomial distribution can be well
approximated by a Gaussian on the transformed
variable.
The choice of regression model depends on the
effect that you want – multimodal vs unimodal
fit.
MaxEnt2007
Julian Center
MaxEnt2007
Julian Center