Julian Center on Regression for Proportion Data July 10, 2007 (68) Regression For Proportion Data Julian Center Creative Research Corp. Andover, MA, USA MaxEnt2007 Overview Introduction Coordinate Transformation to Facilitate Regression. Measurement Models What is proportion data? What do we mean by regression? Examples Why should you care? Multinomial Laplace Approximation to Multinomial Log-Normal Regression Models Kernal Regression (Nadaraya-Watson Model) Gaussian Process Regression With Log Normal Measurements With Multinomial Measurements – Expectation Propagation Conclusion MaxEnt2007 Julian Center What is Proportion Data? ² Proportion data = Compositional data ½ Categorical data. ² Proportion data = A (+ 1)-dimensional vector r of relative proportions of items assigned to one of + 1 categories. Similar to a discrete probability distribution. ² In mathematical terms, r is con…ned to the -simplex, 2 S= r n r +1 2 R + : 1 + 1r o = 1 Here 1(+ 1) is the (+ 1)-dimensional vector of all ones, i.e. h 1(+ 1) MaxEnt2007 i = 18 Julian Center What is Regression? Regression = Smoothing + Calibration + Interpolation. Relates data gathered under one set of conditions to data gathered under similar, but different conditions. Accounts for measurement “noise”. Determines p(r|x). MaxEnt2007 Julian Center Examples Geostatistics: Composition of rock samples at different locations. Medicine: Response to different levels of treatment. Political Science: Opinion polls across different demographic groups. Climate Research: MaxEnt2007 Infer climate history from fossil pollen samples. Calibrate model using present day samples from known climates. Typically, examine 400 pollen grains and sort into 14 categories Julian Center Why Should You Care? Either, you have proportion data to analyze. Or, you want to do pattern classification. Or, you want to use a similar approach to your problem. Transform constrained variables so that a Laplace approximation makes sense. Two different regression techniques. Expectation Propagation for improving model fit. MaxEnt2007 Julian Center Coordinate Transformation Well-known regression methods can’t deal with the pesky constraints of the simplex. We need a one-to-one mapping between the dsimplex and d-dimensional real vectors. Then we can model probability distributions on real vectors and relate them to distributions on the simplex. MaxEnt2007 Julian Center Coordinate Transformation We can establish a one-to-one mapping between S and Rby Symmetric Softmax Activation Function sm : R! clr : S! S; h sm ( f ) = 1 (+ 1) ³ exp T f ´i¡ 1 ³ exp T f ´ R; clr ( y ) = T ln ( y ) Centered Log Ratio Linkage Function where T is a £ (+ 1)-dimensional matrix that satis…es TT = I T 1(+ 1) = 0 T T + MaxEnt2007 1 1(+ 1) 1 = ( + 1) + 1 I (+ 1) The rows of T span the orthogonal Complement of 1(d+1) We can always find T by the Gram-Schmidt Process Julian Center Coordinate Transformation ln(y2) f Softmax is insensitive to this direction. ln(y1) y2 Simplex y1 MaxEnt2007 Image of Simplex Under ln ln(y1)=- ln(y2) Julian Center Measurement Models Multinomial Log-Normal MaxEnt2007 Julian Center Measurement Model - Multinomial Assume that the proportion vector r comes from independent samples from the discrete probability distribution represented by the vector y ( r jy ) = M (r j y ) , M (r j y ) Y ! ([y ] ) [ r ] Q ( [r ])! To get the likelihood function for f = clr ( y ), we take Q into account the Jacobian of the transformation, [y ] . The log-likelihood function corresponding to f is ( f ) = r = MaxEnt2007 ( + + 1) r ln ( y ) + r + 1(+ 1) ( + + 1) Julian Center Multinomial Measurement Model S=400 Binomial Likelihood Functions 0.01 R1= 0 0.0025 0.005 likelihood 0.008 0.01 0.02 0.006 0.05 0.07 0.004 0.1 0.2 0.3 0.5 0.002 0 -6 -5 -4 -3 -2 -1 0 1 f MaxEnt2007 Julian Center Measurement Model - Laplace Approximation Some regression methods assume a Gaussian measurement model. Therefore, we are tempted to approximate each Multinomial measurement with a Gaussian measurement. Let’s try a Laplace approximation to each measurement. Laplace Approximation: MaxEnt2007 Find the peak of the log-likelihood function. Pick a Gaussian centered at the peak with covariance matrix that matches the negative second derivative of the loglikelihood function at the peak. Pick an amplitude factor to match the height of the peak. Julian Center Measurement Model - Laplace Approximation The value of f that maximizes the log-likelihood is m = T ln ( r ) The Laplace approximation to a single measurement is ( f ) = = N ( f j m V ) j2V j ¡ 1 2 · 1 ¡ 1 (f ¡ m ) exp ¡ ( f ¡ m ) V 2 ¸ where = V ¡ 1 = MaxEnt2007 ! exp [ ( m )] ( [ r ])! h i ( + + 1) T D iag ( r ) ¡ r r T 1 j2V j 2 Q Julian Center Laplace Approximation to Multinomial r1=0/400 0.001 0.0008 p(f) 0.0006 Laplace Approx a Multinomial 0.0004 0.0002 0 -7 -6 -5 -4 -3 -2 -1 0 f MaxEnt2007 Julian Center Laplace Approximation to Multinomial r1=1/400 0.0014 0.0012 p(f) 0.001 Laplace Approx 0.0008 Multinomial 0.0006 0.0004 0.0002 0 -6 -5 -4 -3 -2 -1 0 f MaxEnt2007 Julian Center Laplace Approximation to Multinomial r1=2/400 0.002 p(f) 0.0015 Laplace Approx 0.001 Multinomial 0.0005 0 -6 -5 -4 -3 -2 -1 0 f MaxEnt2007 Julian Center Laplace Approximation to Multinomial r1=4/400 0.0025 p(f) 0.002 0.0015 Laplace Approx Multinomial 0.001 0.0005 0 -5 -4 -3 -2 -1 0 f MaxEnt2007 Julian Center Laplace Approximation to Multinomial r1=80/400 0.003 Laplace Approx 0.002 p(f) Multinomial 0.001 0 -4 -3 -2 -1 0 f MaxEnt2007 Julian Center Laplace Approximation to Multinomial r1=120/400 0.01 Laplace Approx 0.008 Multinomial p(f) 0.006 0.004 0.002 0 -1 0 f MaxEnt2007 Julian Center Measurement Model - Log-Normal ² General log-normal model form: ( f ) = N ( f j m V ) ² Can match Laplace approximation to multinomial. ² Can do much more. e.g. Over-dispersion or under-dispersion ² Basis for regression methods. MaxEnt2007 Julian Center Regression Models Way of relating data taken under different conditions. Intuition: Similar conditions should produce similar data. The best to use methods depends on the problem. Two methods considered here: Nadaraya-Watson model. Gaussian Process model. MaxEnt2007 Julian Center Nadaraya-Watson Model Based on applying Parzen density estimation to the joint distribution of f and x General Form: ( f x ) = X = 1 ( f x j) Simpli…ed Model: ³ ´ ( f x j) = N f j bf B N ( x j x D ) MaxEnt2007 Julian Center All Data Points f x MaxEnt2007 Julian Center Nadaraya-Watson Model f x MaxEnt2007 Julian Center Nadaraya-Watson Model This model implies that ( x ) = ( x j) = ( f j x ) = = X = 1 N ( x j x D ) ( f x ) ( x ) X = 1 ( x ) = MaxEnt2007 ( x j) ( x ) N ³ f j bf B ´ ( x j) ( x ) Julian Center Nadaraya Watson Model To determine the distribution for a new measurement, we compute Z ( r jx ) = ( r jf ) ( f j x ) f X = Z = 1 ( x ) ( r jf ) N ³ ´ b f j f B f If we use the Laplace approximation to the multinomial, we can solve the integrals analytically to get X ( r jx ) = = 1 ( x ) N ³ m j bf B + ´ V where m and V are computed from r as described above. Otherwise, we can use stochastic integration to compute the integrals. MaxEnt2007 Julian Center Nadaraya-Watson Model Problem: We must compare a new point to every training point. Solution: Choose a sparse set of “knots”, and center density components only on knots. Adjust weights and covariances by “diagnostic training”. Mixture model training tools apply. MaxEnt2007 Julian Center Sparse Nadaraya-Watson Model f x MaxEnt2007 Julian Center Gaussian Process Model Probability distribution on functions. Specified by mean function m(x) and covariance kernel k(x1,x2). For any finite collection of points, the corresponding function values are jointly Gaussian. MaxEnt2007 Julian Center Gaussian Process Model f x MaxEnt2007 Julian Center Applying Gaussian Process Regression to Proportion Data Prior – Model each component of f(x) as a zero-mean Gaussian process with covariance kernel k(x1,x2). Assume that the components of f are independent of each other. Posterior – Use the Laplace approximations to the measurements and apply Kalman filter methods. Use Expectation Propagation to improve fit. MaxEnt2007 Julian Center Sparse Gaussian Process Model Choose a subset of K training points to act as knots. Rearrange latent function values at the knots in one large vector g [ g](¡ 1) + , [f ( x )] 2 f 12¢¢¢g 2 f 12¢¢¢ g 2 3 [f ( x 1)]1 [ f ( x 2)]1 ¢¢¢ [f ( x )] 1 6 [f ( x )] [ f ( x 2)]2 ¢¢¢ [f ( x )] 2 77 6 1 2 6 7 .. .. .. .. . 4 5 [ f ( x 1)] [f ( x 2)] ¢¢¢ [f ( x )] MaxEnt2007 Julian Center Sparse Gaussian Process Model Under our assumptions, the prior ( g) = N ( gj 0G ) where 2 G , I - C C 0 C .. 0 0 6 0 6 = 6 .. 4 ¢¢¢ ¢¢¢ .. . ¢¢¢ 0 0 .. C 3 7 7 7 5 ³ ´ [ C ], x x 2 f 12¢¢¢g MaxEnt2007 Julian Center Sparse Gaussian Process Model ( f ( x ) j g) = N [f ( x ) j H ( x ) g ( x ) I ] where h H (x ) , (x ) , [ k ( x )] , I - k i ¡ 1 (x) C ( x x ) ¡ k ( x ) C ¡ 1k ( x ) ( x x ) 2 f 12¢¢¢ g We can express this by the equation f (x ) = H (x) g + u (x) where u ( x ) » N [0 ( x ) I ] and u ( x ) is independent of g. MaxEnt2007 Julian Center Sparse Gaussian Process Model In particular, the values of the latent function at the training points can be expressed as f = H g + u where H = H ( x ) and u = u ( x ). To simplify computations, we assume that u is independent of u for 6 = . Note that if x is one of the knots, i.e., · , then u = 0 and H is a £ sparse matrix that simply selects the appropriate elements of g. MaxEnt2007 Julian Center GP– Log-Normal Model Using the log-normal measurement model, Z ( r jg) = = N ( f j m V ) N ( f j H g I ) f N ( m j H gR ) where R = V + I . Thus everything is Gaussian and therefo b P ). ( gjT ) = N ( gj g MaxEnt2007 Julian Center GP– Log-Normal Model b and P by the Kalman … We can determine g lter algorithm. (1) Start with b ( g P ( 0 G (2) For = 1 to iterate K ( b ( g P ( ¡ PH ( H PH + R ) b) gb + K ( m ¡ H g P ¡ K H P 1 If we believe that the log-normal measurement model is correct, then we are …nished after one pass through all the training data. MaxEnt2007 Julian Center GP – Log-Normal Model We can compute the evidence by " = (T ) = Y # 1 b P )]¡ 11 N ( 0j m R ) N ( 0j 0G ) [N ( 0j g We can determine the probability distribution of seeing a new measurement r at x by h b V + U ( x ) + H ( x ) ( r jx T ) = N m j H ( x ) g MaxEnt2007 PH ( x ) i Julian Center GP Multinomial Model If we believe that the measurement model is really multinomial, we can get a more accurate approximation using the Expectation Propagation (EP) algorithm. As before we approximate the joint distribution ( r 1r 2¢¢¢r g) by the form ( g) = Y N ( H gj m R ) N ( gj 0G ) Now our aim is to adjust the ’s, m ’s, and R ’s to minimize the Kullback-Leibler divergence Z (jj) = ¡ MaxEnt2007 à ! ( g) ln ( g) g ( g) Julian Center Expectation Propagation Method To minimize (jj), we iteratively choose a measurement and minimizie (¤jj¤) where ¤ ( g) = ¤ ( g) = ( r jg) ( g) N ( H gj m R ) ¤ N ( H gj m ¤ ¤ R ) ( g) N ( H gj m R ) ¤ , m ¤ , and R ¤ so that the We can accomplish this by choosing moments of ¤ ( g) match those of ¤ ( g). MaxEnt2007 Julian Center Expectation Propagation Method To approximate the moments, we compute ³ ´ ) X r jh ( 1 ¤¼ ³ ´ ( ) = 1 N h j m R ³ ´ ( ) r jh X 1 1 ( ) b ³ ´ h¼ ¤ h ( ) = 1 N h j m R ³ X ´ ( ) r jh 1 1 ( ) ( ) bh b ³ ´ ¡ h W ¼ ¤ h h = 1 N h ( ) j m R where h () » MaxEnt2007 ³ N H gb H PH ´ Julian Center Expectation Propagation Method To get ¤ to have the same moments as ¤, we choose 1 R ¤¡ = m ¤ = MaxEnt2007 R ¡ 1 + W ¡ 1 ¡ · R ¤ R ¡ 1m + ³ ´¡ 1 H P H ¸ ³ ´¡ 1 W ¡ 1hb ¡ H PH H gb Julian Center Expectation Propagation Method If is one of the knots, ³ ´ ( ) r jh ³ = M ´ ( ) r j h Otherwise, we approximate it by ³ ´ ( ) r j h Z = ³ M r j h () + ´ u N ( u j 0 I ) u ³ ´ 1 X ( ) ( ) ¼ M r j h + u =1 ) » u ( MaxEnt2007 N ( 0I ) Julian Center Expectation Propagation Method Now we can update the smoother parameters. 1 = R ¡ 1 then the error covariance P does not change If R ¤¡ and we update the estimate of g by ¡ 1 ¤ gb ( gb + P H R ( m ¡ m ) Otherwise, we use ³ R¢ ( K ( P ( b ( g MaxEnt2007 ´¡ 1 ¤¡ 1 ¡ 1 R ¡ R ³ P H H PH + R¢ ´¡ 1 P ¡ KH P h ³ ´ i ¤¡ 1 ¤ ¡ 1 gb + K R ¢ R m ¡ R m ¡ H gb Julian Center Expectation Propagation Method Finally, we replace the parameters for measurement ( m ( R ( ¤ m ¤ R ¤ and go to the next iteration. MaxEnt2007 Julian Center Choosing the Regression Model Combining Samples 0.006 0.005 p(f) If you have two samplings taken under the same conditions, do you want to treat them as coming from a bimodal distribution (NW Model) or combine them into one big sampling (GP Model)? 0.004 r1=4/400 0.003 r1=40/400 0.002 r1=44/800 0.001 0 -5 -4 -3 -2 -1 0 f MaxEnt2007 Julian Center Conclusion A coordinate transformation makes it possible to analyze proportion data with known regression methods. The Multinomial distribution can be well approximated by a Gaussian on the transformed variable. The choice of regression model depends on the effect that you want – multimodal vs unimodal fit. MaxEnt2007 Julian Center MaxEnt2007 Julian Center
© Copyright 2026 Paperzz