Combining averages and single measurements in a lognormal model Dr. Nagaraj K. Neerchal and Justin Newcomer Department of Mathematics and Statistics University of Maryland, Baltimore County 1000 Hilltop Circle, Baltimore, MD 21250 Motivating Example Goal: To develop a protocol (methodology) for obtaining confidence bounds for the “Mean Emissions” for each welding process and rod type combination, incorporating all of the data Three Welding Processes Three Rod Types Multiple Sources of Data Some report individual measurements Some report only averages without the original observations The Data Welding Process SMAW SMAW SMAW SMAW SMAW SMAW SMAW SMAW SMAW GMAW GMAW GMAW GMAW GMAW GMAW GMAW FCAW FCAW FCAW FCAW FCAW FCAW FCAW RODTYPE E308 E309 E309 E309 E309 E308 E308 E308 E308 E309 E309 E308 E309 E308 E308 E308 E309 E309 E309 E308 E308 E308 E308 NTESTS 3 1 1 1 3 1 1 3 3 1 1 3 1 1 1 3 1 1 1 1 1 3 3 Chromium 0.384 0.8193 0.8604 0.1995 0.484 0.9316 1.011 0.494 0.584 4.76 6.51 0.324 0.7285 0.898 1.3 0.532 2.42 2.82 2.86 1.86 3.04 0.265 1.205 Traditional Approaches Assume Normality? Sample sizes are very small for certain combinations Here the bounds obtained assuming normality give meaningless results (e.g. negative bounds) Transform the data to Normality? In environmental studies, particularly with concentration measurements, the data most often tends to be skewed, therefore there is a temptation to use the lognormal model It is hard to transform the confidence bounds back to the original scale (mean of the log is not the same log of the mean!) Traditional Approaches Weighted Regression? Estimates have good properties, such as Best Linear Unbiased Estimates, in general But the confidence bounds are sensitive to the normality assumption, especially when the sample sizes are small as in our case Nonparametric Approaches? Nonparametric approaches usually use ranks. When only averages are reported we completely lose the information regarding ranks. Therefore, means can not be incorporated into nonparametric approaches The Data – In General Not Available Individual Data Points Sample Mean Sample Variance X01, X02, … , X0n0 X0 S 02 X11, X12, … , X1n1 X1 S12 X21, X22, … , X2n2 X2 S 22 . . . . . . . . . Xk1, Xk2, … , Xknk Xk S 2k The Setup Our goal is to estimate the mean and variance from a population of lognormal random variables under the following setup Consider: k 1 groups of observatio ns - X 1 j , X 2 j ,..., X n j j , j 0,1,...k 2 ln( x ) 1 1 1 ij , xij 0 where, f X ij ( xij ) * * exp 2 2 xij The observations for the first group are available, but for the remaining k groups only the average of the observations (i.e. X 1 , X 2 ,..., X k) is available Normality Approach – Large Sample In practice it is common to assume Normality when the sample sizes are large In this case the sample means and sample variances are sufficient statistics and therefore the individual observations are not needed Normality Approach – Large Sample Assume nj’s are large 2 Then X j ~ Normal , n j where e ( 2 2 ) and e 2 ( 2 2 2 ) e ( 2 2 ) The likelihood equation then reduces to k nj 2 n0 2 1 1 ( xi 0 ) 1 k ( x j ) j 1 L * exp * * exp k 2 2 ( n1 ... nk ) / 2 2 2 2 2 n (2 ) n0 / 2 ( 2 ) n0 / 2 i 1 j 1 j (2 ) ( ) Normality Approach – Large Sample This gives us the following normal equations k ln L 0 n (x j j 0 j )2 0 2 n0 n k ln L 0 0 (x i 1 i0 ) 3 k 2 n (x j 1 j j )2 3 Which gives us the following MLE estimates k ˆmle n (x j 0 ˆ ) k n j 0 2 mle j j j k 1 n0 2 2 ( xi 0 ˆ ) n j ( x j ˆ ) n0 k i 1 j 1 Normality Approach – Large Sample Remarks Although this method works well for large samples, in practice it is common for sample means to be based on a small number of observations, such as n=2,3,4 In this case, when the original data follows a lognormal distribution, the sample mean does not follow a normal distribution Our goal then becomes finding the distribution of the sample mean from a random sample of lognormal random variables Assume Lognormal - Naïve In practice a common naïve approach is to assume that the sample means are lognormal random variables This would imply that ln X ~ Normal , 2 2 ln( X i ) ~ Normal , ni However this does not hold… Why? Direct Approach The exact approach to this problem is to derive the distribution of X j by convoluting X1 j , X 2 j ,..., X n j j , j 1,..., k Hence, we can write the likelihood function as n0 k 2 2 2 L ( , f ( xi 0 | , ) * f j ( x j | , ) i 1 j 1 where f j is the probability distribution of X j , j 1,2,..., k The problem is that the distribution of the sum of lognormal random variables does not have a closed form 2 and therefore L( , ) does not have a closed form Numerical Approximation We can approximate the convolution numerically by replacing the integral fW f X1 ( x1 ) f X 2 ( w x1 )dx1 by x1 f X1 ( x1 ) f X 2 ( w x1 ) , w x1 x2 x1 For small samples, n=2,3,4 it can be seen that w ( X 1 X 2 ... X n ) X n appears to be the plot of n n approximated better by a lognormal distribution with an adjusted mean and variance rather than an approximate normal random variable Numerical Approximation Numerical Approximation Remarks Here a separate approximation must be performed for each sample mean Therefore this approach can become computationally intensive since the numerical approximations must be computed at each iteration The simulations show that a lognormal model, with an adjusted mean and variance, is a good fit when the sample sizes are small Adjusted Lognormal Distribution Here we assume that X j , j 1,..., k approximately follows a lognormal distribution with parameters * g1 ( , 2 ) *2 g 2 ( , 2 ) We then have E X j e * * 2 2 Var X j e 2 * 2 * 2 e 2 * * 2 Adjusted Lognormal Distribution Also, since the original sample comes from a lognormal distribution we have E X j e 2 2 e 2 2 e 2 2 Var X j 2 nj 2 nj Equating the expected values and variances gives us 2 2 2 e 2 2 1 e * g1 ( , 2 ) 2( 2 2) ln 2 nj *2 2 2 2 e 2 2 e 2 g 2 ( , 2 ) ln - 2( 2) nj Adjusted Lognormal Distribution Therefore we have ln( X i 0 ) ~ Normal ( , 2 ) , i 1,2,..., n0 ln( X j ) ~ Normal ( g1 ( , ) , * 2 *2 g 2 ( , 2 ) ) , j 1,2,..., k Which gives us the following likelihood function k L 1 (2 ) n0 2 ( ) 2 n0 2 n j k (ln( x ) * ) 2 1 (ln( xi 0 ) ) 1 j 1 j * exp * * exp k k 2 *2 *2 2 2 2 2 i 1 j 1 (2 ) ( ) n0 2 Adjusted Lognormal Distribution This gives us the following normal equations * 2 1 n0 (ln xi 0 ) 2 (k ln * ) * 1 k (ln x j ) * 0 2 2 * * * 2 i 1 2 j 1 (n0 ln ) * 2 1 n0 (ln xi 0 ) 2 (k ln * ) * 1 k (ln x j ) * 0 2 2 * * * 2 2 j 1 i 1 Since and * *2 are functions of both and 2 The numerical solutions of these equations will give 2 the MLE’s of and and hence the MLE’s of and 2 (by the invariance property) Adjusted Lognormal Distribution Remarks This method works well when dealing with small sample sizes n=2,3,4 The likelihood becomes quite complicated and therefore numerical methods must be employed to obtain the MLE’s of the parameters There is an advantage over the convolution since the approximations do not need to be made at each iteration Conclusions The distribution of the mean of lognormal observations does not yield a useful closed form expression Approximations either by normal when the sample size is large or by lognormal (with appropriately chosen parameters) when the sample size is small can be used for obtaining estimates of the population parameters Future Work Implementation of these methods within standard software packages, such as PROC NLIN in SAS Performing simulation techniques, such as Monte Carlo, to explore the efficiency of these methods Other numerical methods can be explored, such as the EM algorithm, for obtaining the MLE’s Generalizing these methods to other standard power transformations Bootstrapping What is Bootstrapping? Resampling the observed data It is a simulation type of method where the observed data (not a mathematical model) is repeatedly sampled for generating representative data sets Only indispensable assumption is that “observations are a random sample from a single population” There are some fixes available when the single population assumption is violated as in our case. Can be implemented in quite a few software packages: e.g. SPLUS, SAS Millard and Neerchal (2000) gives S-Plus code Bootstrapping - The Details Data X=(X1,X2,X3,….,Xn) Statistic: T=T(X) rep #1 X*1=(X*1,X*2,X*3,….,X*n) T*1=T(X*1) rep #2 X*2=(X*1,X*2,X*3,….,X*n) T*2=T(X*2) ….. …….. ……. rep #B X*B=(X*1,X*2,X*3,….,X*n) T*B=T(X*B) Bootstrapping inference is based on the distribution of the replicated values of the statistic : T*1,T*2,….T*B. For example, Bootstrap 95% Upper Confidence Bound based on T is given by the 95th percentile of the distribution of T*s. Bootstrapping the Combined Data Group the data points according to the number of tests used in reporting the average, within each welding process and rod type combination. Then bootstrap within each such group. i.e. for GMAW and E316: ce of Data P 0587 P 0587 P 0587 B B Welding Process GMAW GMAW GMAW GMAW GMAW GMAW RODTYPE NTESTS Chromium E316 1 E316 1 E316 1 E316 3 E316 3 E316 4 Note: Each color represents a separate group 0.898 1.3 0.899 0.532 0.253 Chromium 6 (g 0. 0. 0. 0 0 0.
© Copyright 2026 Paperzz