Combining Averages and Single Measurements in a

Combining averages and single
measurements in a lognormal model
Dr. Nagaraj K. Neerchal and Justin Newcomer
Department of Mathematics and Statistics
University of Maryland, Baltimore County
1000 Hilltop Circle, Baltimore, MD 21250
Motivating Example

Goal: To develop a protocol (methodology) for
obtaining confidence bounds for the “Mean
Emissions” for each welding process and rod
type combination, incorporating all of the data

Three Welding Processes

Three Rod Types

Multiple Sources of Data


Some report individual measurements
Some report only averages without the original
observations
The Data
Welding Process
SMAW
SMAW
SMAW
SMAW
SMAW
SMAW
SMAW
SMAW
SMAW
GMAW
GMAW
GMAW
GMAW
GMAW
GMAW
GMAW
FCAW
FCAW
FCAW
FCAW
FCAW
FCAW
FCAW
RODTYPE
E308
E309
E309
E309
E309
E308
E308
E308
E308
E309
E309
E308
E309
E308
E308
E308
E309
E309
E309
E308
E308
E308
E308
NTESTS
3
1
1
1
3
1
1
3
3
1
1
3
1
1
1
3
1
1
1
1
1
3
3
Chromium
0.384
0.8193
0.8604
0.1995
0.484
0.9316
1.011
0.494
0.584
4.76
6.51
0.324
0.7285
0.898
1.3
0.532
2.42
2.82
2.86
1.86
3.04
0.265
1.205
Traditional Approaches

Assume Normality?



Sample sizes are very small for certain combinations
Here the bounds obtained assuming normality give meaningless
results (e.g. negative bounds)
Transform the data to Normality?


In environmental studies, particularly with concentration
measurements, the data most often tends to be skewed,
therefore there is a temptation to use the lognormal model
It is hard to transform the confidence bounds back to the
original scale (mean of the log is not the same log of the
mean!)
Traditional Approaches

Weighted Regression?



Estimates have good properties, such as Best Linear Unbiased
Estimates, in general
But the confidence bounds are sensitive to the normality
assumption, especially when the sample sizes are small as in our
case
Nonparametric Approaches?

Nonparametric approaches usually use ranks. When only averages
are reported we completely lose the information regarding ranks.
Therefore, means can not be incorporated into nonparametric
approaches
The Data – In General
Not
Available
Individual Data Points
Sample
Mean
Sample
Variance
X01, X02, … , X0n0
X0
S 02
X11, X12, … , X1n1
X1
S12
X21, X22, … , X2n2
X2
S 22
.
.
.
.
.
.
.
.
.
Xk1, Xk2, … , Xknk
Xk
S 2k
The Setup


Our goal is to estimate the mean and variance
from a population of lognormal random variables
under the following setup
Consider:
k  1 groups of observatio ns - X 1 j , X 2 j ,..., X n j j , j  0,1,...k
2


ln(
x
)




1
1
1
ij
  , xij  0
where, f X ij ( xij ) 
* * exp   
 2


2 xij




The observations for the first group are available,
but for the remaining k groups only the average
of the observations (i.e. X 1 , X 2 ,..., X k) is available
Normality Approach – Large Sample


In practice it is common to assume
Normality when the sample sizes are
large
In this case the sample means and
sample variances are sufficient statistics
and therefore the individual observations
are not needed
Normality Approach – Large Sample


Assume nj’s are large
2 



Then X  j ~ Normal  ,

n
j


where   e

( 
2
2
)
and   e
2
( 2   2 2 )
e
( 2   2 )
The likelihood equation then reduces to
k
nj

2
n0
2



1
1 ( xi 0   )
 1 k ( x j   ) 

j 1
L
*
exp

*
*
exp







k
2
2
( n1 ...  nk ) / 2
2 2
2

2

n
(2 ) n0 / 2 ( 2 ) n0 / 2


i 1
j 1
j

 (2 )
( )


Normality Approach – Large Sample

This gives us the following normal equations
k
 ln L
0 

 n (x
j
j 0
j
 )2
0
2
n0
n k
 ln L
0  0  

 

 (x
i 1
i0  )
3
k
2

 n (x
j 1
j
j
 )2
3
Which gives us the following MLE estimates
k
ˆmle 
 n (x
j 0
ˆ
)
k
n
j 0
2
mle
j
j
j
k

1  n0
2
2

 ( xi 0  ˆ )   n j ( x j  ˆ ) 
n0  k  i 1
j 1

Normality Approach – Large Sample

Remarks



Although this method works well for large
samples, in practice it is common for sample
means to be based on a small number of
observations, such as n=2,3,4
In this case, when the original data follows a
lognormal distribution, the sample mean does not
follow a normal distribution
Our goal then becomes finding the distribution of
the sample mean from a random sample of
lognormal random variables
Assume Lognormal - Naïve


In practice a common naïve approach is to assume
that the sample means are lognormal random
variables
This would imply that

ln X  ~ Normal  ,  2

 2 

ln( X i ) ~ Normal   ,
 ni 
 However this does not hold… Why?
Direct Approach


The exact approach to this problem is to derive the
distribution of X  j by convoluting X1 j , X 2 j ,..., X n j j , j  1,..., k
Hence, we can write the likelihood function as
n0
 k


2
2 
2
L (  ,   f ( xi 0 |  ,  ) *  f j ( x j |  ,  )
 i 1
  j 1

where f j is the probability distribution of X  j , j  1,2,..., k

The problem is that the distribution of the sum of
lognormal random variables does not have a closed form
2
and therefore L(  ,  ) does not have a closed form
Numerical Approximation

We can approximate the convolution numerically by
replacing the integral
fW   f X1 ( x1 ) f X 2 ( w  x1 )dx1 by
x1

f
X1
( x1 ) f X 2 ( w  x1 ) , w  x1  x2
x1
For small samples, n=2,3,4 it can be seen that
w ( X 1  X 2  ...  X n )
 X n appears to be
the plot of 
n
n
approximated better by a lognormal distribution
with an adjusted mean and variance rather than an
approximate normal random variable
Numerical Approximation
Numerical Approximation

Remarks



Here a separate approximation must be
performed for each sample mean
Therefore this approach can become
computationally intensive since the numerical
approximations must be computed at each
iteration
The simulations show that a lognormal
model, with an adjusted mean and variance,
is a good fit when the sample sizes are small
Adjusted Lognormal Distribution

Here we assume that X  j , j  1,..., k approximately
follows a lognormal distribution with parameters
 *  g1 (  ,  2 )


*2
 g 2 ( , 2 )
We then have
E X  j   e
  *  * 2 2 




Var X  j   e
 2  *  2 * 2 




e
 2  *  * 2 




Adjusted Lognormal Distribution

Also, since the original sample comes from a
lognormal distribution we have
E X j     e   2 
 2 e 2   2   e 2   
2
Var X j  

2
nj

2
nj
Equating the expected values and variances gives us
2   2 2   e 2   2  

1
e
 *  g1 (  ,  2 )  2(    2 2)  ln 

2 
nj


*2
2   2 2   e 2   2  

e
2
 g 2 (  ,  2 )  ln 
 - 2(    2)
nj


Adjusted Lognormal Distribution

Therefore we have
ln( X i 0 ) ~ Normal (  ,  2 ) , i  1,2,..., n0
ln( X j ) ~ Normal (   g1 (  ,  ) , 
*

2
*2
 g 2 (  ,  2 ) ) , j  1,2,..., k
Which gives us the following likelihood function
k
L
1
(2 )
n0
2
( )
2
n0
2
n
j
k (ln( x )   * ) 2 

 1 (ln( xi 0 )   ) 
1


j 1
j
* exp  
*
*
exp





k
k
2
*2
*2
2
2
2

2


i

1
j

1


 (2 ) ( )

n0
2
Adjusted Lognormal Distribution

This gives us the following normal equations
* 2
  1 n0 (ln xi 0   ) 2    (k ln  * )  *     1 k (ln x j   )   * 
 
  
 

 

0
2
2
*
* 

*


  2 i 1

     2 j 1

  
  
 (n0 ln  ) 



* 2
 1 n0 (ln xi 0   ) 2    (k ln  * )  *     1 k (ln x j   )   * 
 

 
  
 

0
2
2
*
* 

*


2







2


j 1

 
 i 1
 



Since  and 
*

*2
are functions of both  and  2
The numerical solutions of these equations will give
2


the MLE’s of
and
and hence the MLE’s of  and  2
(by the invariance property)
Adjusted Lognormal Distribution

Remarks



This method works well when dealing with
small sample sizes n=2,3,4
The likelihood becomes quite complicated
and therefore numerical methods must be
employed to obtain the MLE’s of the
parameters
There is an advantage over the convolution
since the approximations do not need to be
made at each iteration
Conclusions


The distribution of the mean of lognormal
observations does not yield a useful closed
form expression
Approximations either by normal when the
sample size is large or by lognormal (with
appropriately chosen parameters) when the
sample size is small can be used for obtaining
estimates of the population parameters
Future Work




Implementation of these methods within
standard software packages, such as PROC NLIN
in SAS
Performing simulation techniques, such as Monte
Carlo, to explore the efficiency of these methods
Other numerical methods can be explored, such
as the EM algorithm, for obtaining the MLE’s
Generalizing these methods to other standard
power transformations
Bootstrapping

What is Bootstrapping?






Resampling the observed data
It is a simulation type of method where the observed data
(not a mathematical model) is repeatedly sampled for
generating representative data sets
Only indispensable assumption is that “observations are a
random sample from a single population”
There are some fixes available when the single population
assumption is violated as in our case.
Can be implemented in quite a few software packages: e.g.
SPLUS, SAS
Millard and Neerchal (2000) gives S-Plus code
Bootstrapping - The Details
Data
X=(X1,X2,X3,….,Xn)
Statistic:
T=T(X)
rep #1 X*1=(X*1,X*2,X*3,….,X*n) T*1=T(X*1)
rep #2 X*2=(X*1,X*2,X*3,….,X*n) T*2=T(X*2)
…..
……..
…….
rep #B X*B=(X*1,X*2,X*3,….,X*n) T*B=T(X*B)
Bootstrapping inference is based on the distribution of the replicated values of
the statistic : T*1,T*2,….T*B. For example, Bootstrap 95% Upper Confidence
Bound based on T is given by the 95th percentile of the distribution of T*s.
Bootstrapping the Combined Data


Group the data points according to the number of tests
used in reporting the average, within each welding
process and rod type combination. Then bootstrap within
each such group.
i.e. for GMAW and E316:
ce of Data
P 0587
P 0587
P 0587
B
B
Welding Process
GMAW
GMAW
GMAW
GMAW
GMAW
GMAW
RODTYPE NTESTS Chromium
E316
1
E316
1
E316
1
E316
3
E316
3
E316
4
Note: Each color represents a separate group
0.898
1.3
0.899
0.532
0.253
Chromium 6 (g
0.
0.
0.
0
0
0.