why Xbar ± 3S is not a Universal Solution

P e e r R e v i e w e d : S tat i s t i c s
Why Xbar ± 3S is not a Universal Solution
Lynn Torbeck
The one fact that most people remember incorrectly
from that long ago statistics class is the rule of thumb
Xbar ± 3S. Thus, we have the incorrect but common practice of using the sample average plus and
minus three times the sample standard deviation as a
solution for finding confidence intervals, setting alert
criteria, identifying outliers, statistical significance,
and other complex statistical questions. While applied
statistics needs to be pragmatic, it cannot be incorrect. Good science and current good manufacturing
practice (CGMP) regulations demand that correct tools
be used for a given problem. Using Xbar ± 3S in the
above topics is incorrect statistically, particularly so
for small sample sizes. Given the wide spread use and
abuse of Xbar ± 3S, it seems the topic needs to be
clarified. This paper addresses the misuse of Xbar ±
3S and compares it to confidence intervals, tolerance
intervals, control charts, and Cpk.
INTRODUCTION
Picture yourself setting in the office of Robert, the
Vice President for production, with Roger, the VP for
quality assurance (QA), during an investigation for
a potential recall due to a near potency failure. The
conversation comes around to the specification criteria and how it was set. Tom, a staff member, explains
to the VPs that the specification criterion was set according to the GMPs, specifically:
21CFR211.110(b) “Valid in-process specifications
for such characteristics shall be consistent with drug
product final specifications and shall be derived
from previous acceptable process average and process variability (standard deviation) estimates where
possible and determined by the application of suitablestatisticalprocedures[Emphasisadded]whenappropriate” (1).
Tom goes on to say that they had the average and
standard deviation from the 12 validation lots and
that they set the limits using the average plus and
minus three times the standard deviation.
At this point David, the QA statistician joins the
conversation asking, “Why did you use that?’
Tom replies, “Well in the statistics course I took,
the professor said that the mean plus and minus
three standard deviations brackets 99.73% of the
values.”
David says, “Yes, that is true in theory, but only
if you know the true population mean, mu (μ), and
the true population standard deviation, sigma (σ),
which we almost never do. Here we must estimate
the population mean and population standard deviation from the small sample of 12 values using the
average, (Xbar), and the sample standard deviation,
(S). There will be variability in both of these estimates; what we get in one sample of 12 is just one estimate from many estimates that are possible. Other
samples of size 12 will give different estimates.
“The multiplier of three is correct only if we have
an infinite sample size, but here we have only 12
values. We must take into account the variability of
both Xbar and S due to the small sample size. The
multiplier must be different than three to accommodate the uncertainty in our two estimates.”
David goes on to explain that the limits set in this
case using the sample average plus and minus three
times the sample standard deviation are too narrow
and thus the lot is on verge of rejection.
Spring 2012 Volume 16 Number 2
47
Peer Reviewed: Statistics
why the confusion?
It is clear how this confusion occurs. The statistics
professor says μ ± 3 σ brackets 99.73% of the area
under the curve or 99.73% of the population of values. This gets translated into the mean plus and minus three standard deviation, which then quickly
morphs into the average of the sample plus and minus three standard deviations of the sample or Xbar
± 3S. It is perfectly understandable but not acceptable and is a violation of CGMPs because it is not an
‘application of suitable statistical procedures.’
correct calculation
The correct way to estimate the natural limits for
the data set of 12 values is to calculate the statistical tolerance interval using Xbar ± K S where K is a
function of the sample size, a given percentage of reportable values, and a stated confidence (2). The percentage of values can be whatever we wish it to be,
95%, 99%, or even 99.73%. Also, we need to specify
how confident we wish to be in our statement. This
can be 95%, 99%, or other value. The idea of being
incorrect 5% of the time is not appealing; so many
people will chose a 99%/99% tolerance interval.
This allows them to state that they are 99% confident that 99% of the future reportable values (if nothing changes and the future looks like the past) will
lie within the calculated interval. Note that this is the
natural limit of normally distributed data, but they
can serve as warning or alert limits. Or as Hahn and
Meeker states, “Such and interval would be of particular interest in setting limits on the process capability
for product manufactured in large quantities” (3).
To set action or investigation limits, the limits
should be expanded to account for other sources of
variability. Accept or reject limits are wider and take
into account the known product stability profile.
At this point Robert, the VP for production, jumps
in to exclaim, “You mean to say that we almost recalled this batch because the limits were set using
a rule of thumb and not the most correct technique
available?”
David replies, “Well, in defense of Tom, the sample average plus and minus three times the sample
standard deviation is unfortunately widely used and
48 Journal of GXP Compliance
abused in the pharmaceutical industry. It has become a universal monkey wrench that people use for
a wide variety of statistical issues.
“This is, unfortunately, a common practice that
must be changed. This lot is only the tip of the iceberg of potential financial losses due to using a rule
of thumb in place of a best approach. I have seen this
used not only to set specification criteria but in place
of tolerance intervals, confidence intervals, significance tests, and as an outlier test. A European agency
has, reportedly, required US companies to set accept
and reject specification criteria using it. Not only is
this incorrect, it presents a high degree of risk for the
company and contradicts the CGMPs.”
correct application
There are only two applications where the multiplier
of three is accepted by the statistical community.
These are for control charts and process capability
indices such at Cpk and Ppk. Control charts were
developed in 1924 by Dr. Walter Shewhart while
working at the Hawthorne works of the Western
Electric company in Cicero, Il. It was intended as
a pragmatic tool and never as an exact probability
statement. Thus, the control limits on a control chart
are by definition (and not theory) the average plus
and minus three times the standard deviation. The
ASTM book states clearly, “The choice of the factor 3
in these limits is an economic choice based on experience that covers a wide range of industrial applications of the control chart, rather than on any exact
value of probability” (4).
The same situation exists for Cpk. The three in the
denominator is only a rough rule of thumb. Cpk was
never intended to be anything other than a crude
indicator of possible process incapability. Given that,
control charts and Cpk should not be used to make
accept/reject decisions but are an inexact suggestion
for further investigation, data collection, and a rigorously correct statistical analysis.
Every introductory statistics class eventually addresses the several characteristics of the normal
population or distribution. See the smooth curve in
Figure 1. The mean of the distribution is defined as
mu, μ, and is 100.0 here. The standard deviation of
L y n n To r b e c k
Figure 1:
Figure 2:
Histogram of n=30.
Histogram of SD n=5.
the distribution is defined as sigma, σ, and is 1.0
here. It is always pointed out that μ ± σ will bracket
68.0% of the area under the curve, μ ± 2 σ (actually 1.96) will bracket 95.0% of the area under the
curve, and that μ ± 3σ will bracket 99.73% of the
area under the curve. The last being the one that is
most remembered.
These are true statements, but as was noted, they
apply in theory not in practice where we almost never know the population mean and standard deviation. As can be seen in Figure 1 for the histogram
for 30 values, the average (i.e., the estimate of the
mean) is 99.78 and the sample standard deviation is
0.82. These are close to 100 and 1 but not exact. If
we were to take another sample, the estimated values
would be slightly different. Small samples give poor
estimates and large samples give better estimates.
In practice, we are nearly always working with
small samples as 5, 10, 20, and 30. In this context,
even samples of 100 may not large enough. As an
illustration of this, Figures 2-4 show histograms for
sample standard deviations, S, for sample sizes of
5, 30, and 100. Note that for n=5, the estimates of
S range from about 0.2 to 2.2 even when the true
value is 1.0. For n=30, they range from 0.6 to 1.4. For
n=100, they still range from 0.8 to 1.2. This is plus
and minus 20%. There is a lot of variability in our
estimates of variability. All of this comes together to
support the need for a multiplier different from three
for specific applications.
Figure 3:
Histogram of SD n=30.
Figure 4:
Histogram of SD n=100.
Spring 2012 Volume 16 Number 2
49
Peer Reviewed: Statistics
Figure 5:
Calculated 99% confidence interval vs. Xbar ±3S
Figure 6:
99%/99% tolerance interval vs. Xbar ±3S.
Figure 6 shows a 99%/99% tolerance interval vs.
Xbar ±3S. As can be seen, the tolerance limits become almost the same as Xbar ±3S for large sample
sizes greater than a hundred. But, for less than a
hundred, Xbar ±3S limits are far too narrow thus
leading to incorrect estimates.
SUMMARY
A multiplier of three is only accepted for control
charts and Cpk, and even there, it is an inexact and
crude guide. Xbar ±3S is not appropriate for any
other applications and is a possible violation of the
CGMPs. In the pharmaceutical industry, inexact
rules of thumb are not an acceptable way to protect
the public health.
REFERENCES
1. FDA, 21CFR211.110(b).
2.A search of the Internet for “Tolerance Interval Tables” will
find the tables needed.
3. G. J. Hahn and W. Q. Meeker, Statistical Intervals, John
Wiley, p 34, 1991.
4.ASTM, ASTM Manual on Presentation of Data and Control
Chart Analysis, p 78, 1976
5.The Student t tables are published in every statistics text.
GXP
confidence intervals
A statistical confidence interval on the mean is a
statement that we believe the true but unknown
mean will lay within an interval with a given confidence, say 99%. In other words, if we were to do
this many times, 99% of the time the interval would
contain the true value. This interval is found using
Xbar ± tS / (n)^0.5, t is found using a table where t
is a function of the sample size and the desired level
of confidence (5).
Figure 5 shows a calculated 99% confidence interval vs. Xbar ±3S. As can be seen, Xbar ±3S is
too narrow for small samples and too wide for large
samples. In fact the confidence interval width goes
to zero for an infinite sample size because n is in the
denominator. This makes it totally incorrect for setting alert limits.
50 Journal of GXP Compliance
ABOUT THE AUTHOR
Lynn Torbeck is a consultant specializing in applied statistics
and designed experiments for quality assurance, quality control,
validation, and manufacturing under the CGMPs. He has been
in the pharmaceutical industry for all of his career and president
of Torbeck and Associates since 1988. He may be reached by email at [email protected].