The BCa Interval

Nir Keret
Based on “An Introduction to the Bootstrap”
(Efron and Tibshirani)
Let ˆ be the plug-in estimate of a parameter  , and ˆi* be the bootstrap
replication of ˆ corresponding to the bootstrap sample x * .
i
a 1-2 percentile interval is defined to be:
The percentile interval is simple to implement and understand. In addition,
it does not require estimation of  (like in bootstrap-t). But perhaps the
biggest advantage is that it is transformation respecting. It means that
contrary to bootstrap-t intervals, we do not need to know in advance
the proper transformation which normalizes the statistic.
The percentile method does it automatically.
Explanation of the transformation respecting property:
Suppose that there exists a monotonically increasing transformation
m( )   , m(ˆ)  ˆ, m(ˆ* )  ˆ* such that:
ˆ   ~ N (0,  2 ), ˆ*  ˆ ~ N (0,  2 )
then the upper endpoint of the interval for  is m 1 (ˆ  z1  )
However, ˆ  z  = F * 1 (1   ).
1
ˆ
Since m is a monotonically increasing function, Fˆ* 1 (1   )  m( Fˆ* 1 (1   ))
plugging it back in the above expression yields Fˆ* 1 (1   )
which is exactly the upper endpoint of the percentile interval.
Looks good… where does it go wrong?
Perhaps the assumption that such a transformation “m”
exists is too strong an assumption?
We require that “m” is both normalizing and variancestabilizing. Such a transformation need not necessarily
exist.
In addition, we would like to correct for potential bias
in the plug-in estimate. The percentile method does
not deal with that.
The name BCa stands for “Bias-Correction and
Acceleration”. It relaxes the assumption that the
transformation “m” is variance-stabilizing, and in
addition it corrects for potential bias in the
estimate.
A motivational example:
26 neurologically impaired children took a spatial perception test.
Results are labeled Ai i  1,..., 26
Suppose that we wish to consruct a 90% CI for   var( A)
n
1
and that we use the sample variance as estimate: ˆ   ( Ai  A) 2
n i 1
In order to evaluate the different CI methods, let us assume that the data
are normal, since we can construct an exact CI in this case. We will
build parametric bootstrap CIs and benchmark them against the exact CI.
How do BCa and ABC perform compared with the percentile?
The exact CI:
is taking into account the fact that the mean had to be estimated, hence
uses n-1 degrees of freedom for the chi-squared distribution.
In contrast, the sample variance, which we used as the plug-in estimate in
the bootstrap CIs is biased. How come the BCa and ABC do not seem biased
but the percentile is?
In addition, note that the shape of the percentile CI is nowhere near the
exact. The right part of the CI should be about 2.5 times longer than the left
part.
There is no golden standard for the non-parametric case,
but we can see that there are big differences.
The BCa endpoints are also based on the bootstrap
distribution percentiles, but not necessarily the same
ones as the regular percentile interval.
The BCa percentiles depend on two numbers:
responsible for correcting non-stable variance
(“acceleration”) and bias.
When zˆ0  0 and aˆ = 0, the BCa interval is reduced back to the
regular percentile interval
Where does this formula come from?
ˆ  
Let us assume that
~ N ( z0 ,1),    1  a .

In this way we can allow for bias and for  -dependant variance.
ˆ can be expressed as: ˆ     ( Z  z )

0
and 1  aˆ  (1  a )(1  a( Z  z0 ))
Taking logarithms from both sides we get:
ˆ    W ˆ  log(1  aˆ),   log(1  a ),W  log(1  a( Z  z0 ))
Having observed ˆ , a 1-2 central confidence interval for  is
  (ˆ  w1 , ˆ  w )
Transforming back to the  scale, we get:
ˆ  ( z1  z0 ) ˆ  ( z  z0 )
 (
,
)
1  a( z1  z0 ) 1  a( z  z0 )
And with some additional algebric manipulations we get:
[ ]
 ˆ  
z  z0
ˆ
1  a( z  z0 )
Based on our assumption regarding ˆ :
Fˆ ( x)   (( x   ) /    z0 ))
So the bootstrap CDF of ˆ* is Fˆ* ( x)  (( x  ˆ) /  ˆ  z0 ))
which produces the inverse F 1ˆ* ( )  ˆ   ˆ ( 1 ( )  z0 )
plugging  (z[ ] ), when z[ ]
Brings us back to
z0  z
 z0 
1  a( zo  z )
It means that F 1ˆ* (( z[ ] )) produces the correct lower endpoint
for the CI for  .
Since m( )   with m being a monotonically increasing transformation,
we use the same percentile for the CI for  :
[ ]  F 1ˆ ( ( z[ ] ))
*
Ofcourse F 1ˆ* , a, z 0 have to be estimated.
The rationale: in order to obtain ˆ* we apply on the bootstrap
sample the exact same mechanism that we used with regards to the original
data in order to get ˆ. Hence, if we find that the ˆ* replications are
far away from ˆ, it testifies that there is probably an inherent bias
in the mechanism that produced ˆ, and we would like to correct for that.
The number z 0 is designated to capture the magnitude of bias.
It is estimated in the following way:
zˆ0 is obtained directly from the number of bootstrap replications less
than the original estimate ˆ. Roughly speaking, zˆ measures the median
0
bias of ˆ* and ˆ in normal units. We obtain zˆ 0  0 if exactly half of the
ˆ* replications are less than ˆ.
Rationale for that formula:
One-Parameter Parametric Bootstrap
Suppose that ˆ~f .
It can be shown that in this scenario a good approximation for a is:
 log f (ˆ)
1
â= skew ˆ (  ) when  is the score function
6

3
skew = 1.5 with k the k th central moment
2
k ( X )  E{( X  E ( X )) k }
Parametric bootstrap with nuisance parameters
In most cases we will have more than one unknown
parameter, and the former formula will not apply.
The solution is to reduce the problem to a oneparameter parametric family, and apply the formula on
it.
How do we do that?
Least Favorable Family
Reminder: if g is a continuously differentiable function, then
1
ˆ
ˆ
ˆ
ˆ
the asymptotic MSE of g( ) when  is an MLE for  is (g ( ) ' I 1g ( ))
n
 log f ( X )  log f ( X )
I ij ( )  E
i
j
Let us denote the vector   I 1g ( )
This determines a one-dimensional subproblem which is asymptotically
as difficult as the original multidimensional problem.
Consider the density f  ( x)  f   ( x). The information matrix for this
problem at   0 reduces to:

 log f ( X ) 
 log f ( X )  log f ( X ) 

  log f  ( X ) 
E 
 E   i

  E   i j











 ( i , j )
i
i
j
 i 1


  ' I  g ' I 1 II 1g  g ' I 1g
2
p
2
Also, the gradient of g (   ) as a function of  at   0 reduces to:
g (   )
g ( )
  i
  ' g g ' I 1g

i
ˆ ) we get
Plugging those results in the MSE expression for g( +
g ( ) ' I 1g ( ) - same as the multidimensional case.
Hence, if Y ~ fˆ  f ˆ
  ˆ
( y)
Then our estimate for a will be:
  logˆ  ˆ ( y ) 
aˆ  {SKEW 0 
}



For the exponential family, the formula gives:
j

 (ˆ  ˆ)
( j)
Where ˆ (0) 
 j
 0
Exponential family:
f ( x)  h( x) exp(i ( )Ti ( x)  ( ))
Nonparametric Bootstrap
Instead of the actual sample space  , let us consider only distributions
supported on the observed data set ˆ  {x1 ,..., xn }. This is an n-category
multinomial family. Let  denote the vector of probabilities for each xi
  ( w1 , w2 ,..., wn ).
One can imagine setting a confidence interval for  on the basis
of a hypothetical sample x1* ,..., xn* ~ F ( ). A sufficient statistic is
the vector of proportion P  ( p1 , p2 , ..., pn ).
Now, if we suppose that we have observed P0 , that is, the hypothetical
sample x1* ,..., xn* equals the actual sample, then based on the formula
for aˆ in exponential families (which contains the multinomial), we get:
Where Ui is the i'th empirical influence component, which is also called an
infinitesimal jackknife value
Gâteaux Derivative
The Gâteaux derivative of T at F in the direction G is defined by
Mathematically speaking, the Gâteaux derivative is a generalization of the
concept of directional derivative.
Statistically speaking, it measures the rate of change in a statistical
functional upon a small amount of contamination by another distribution G.
The Influence Function
If G   x is a point mass at x then we write
LF  x   LF ( x ) and we call LF  x  the influence , thus:
The empirical influence function is defined by Lˆ ( x)  LFˆ ( x), thus:
Example: the mean
Example: the variance
We can take advantage of the chain rule:
Example: the variance:
The influence components behave a lot like the
score values:
ˆ  T ( Fˆ )
For a parametric model:
In practice, in order to make the method
general and not functional-specific, we can
take the corresponding jackknife value as an
approximation for the influence component.
The jackknife practically replaces epsilon with
The jackknife approximation yields:
Stands for “Approximate Bootstrap CIs”
It tries to mimic the BCa intervals in an
analytic fashion - without actually
performing resampling at all. It requires
second derivatives, hence works for
smoothly defined parameters in exponential
families or smoothly defined nonparametric
problems.
The ABC intervals depend on 5 quantities: (ˆ,ˆ ,ˆ ,zˆ 0 , cˆq ).
Each of the (ˆ , zˆ0 , cˆq ) corrects a deficiency.
ˆ , the standard error of ˆ is estimated with the
nonparametric delta method approximation:
aˆ is estimated like in the BCa (no bootstrap resampling was needed in the first place)
The estimate of zˆ0 involves two quantities:
The first is the bias b  E (ˆ)   . A quadratic taylor series expansion
ˆ
of   T ( P 0 ) gives approximate bias b:
n
bˆ   T / (2n 2 )
i 1
An approximate confidence point ˆ[ ] is called first order accurate
if
and second order accurate if
An approximate confidence point ˆ[ ] is called first order correct
if
and second order correct if
Correctness at a given level
Accuracy at the same level
Bootstrap t
Percentile
BCa
ABC
Second-order
correct
Transformation
- respecting