The Eglin IV - Metrics of Well Designed Experiments

AIAA 2010-1710
U.S. Air Force T&E Days 2010
2 - 4 February 2010, Nashville, Tennessee
`
The “Eglin IV”: Metrics of a Well-Designed Experiment
Gregory T. Hutto1, James R. Simpson2, James M. Higdon.3
46 Test Wing, 53d Wing Eglin AFB, Florida, 32452
Abstract. Since 1997 in Air Combat Command’s 53d Wing, the US Air Force has been
promoting the use of design of experiments as the preferred method of constructing and
analyzing test programs. The purpose of test is to mitigate program risk by revealing earlyon problems in system design or performance. An experiment (or campaign of experiments)
is a collection of test events meant to answer some question about the system under test. If
the outcome of the experiment will not impact any decision regarding an acquisition
program, then what is the purpose? Given that an experiment answers a question that
matters (i.e.: will impact our opinion in a way that changes our approach to managing the
tested system), we can agree that it is important that the test is designed such that the risk of
coming to the wrong conclusion is not a byproduct or accident, but something we measure,
manage, and control. Since our focus on risk drives us to test, risk should be the focus of
experimentation. Our approach to navigating through experimental risk must be methodical
and exhaustive in that we leave no aspect of test to chance if it has reasonable likelihood of
corrupting the answer. Programmatic test risk is the responsibility of test
directors/managers/engineers. Threats to answering the test question may include budget,
schedule, manpower, equipment, materials, or other resources. An often overlooked
category of risk is that of ‘chance’; the magnitude of that risk is buried among the test
design, collection, and analysis methods. Randomness in outcomes, background factors that
corrupt the data, or unknowns that we cannot even measure all threaten to lead us to wrong
conclusions. The science of random variation is statistics and the science of test is the
statistical Design of Experiments (DOE). The principles, processes, methods, techniques, and
tools of DOE give us the means to minimize the chance we fail to give a correct answer due
to inefficient use of resources, not investigating the important conditions, having too little
fidelity, or confusing one effect with another. The four metrics in this paper address these
risks and lays out a case for ensuring they are all covered in every program. The scope of
this paper, then, is the characteristics of a well-designed experiment using the concepts of
DOE applied to Defense acquisition programs. There are other techniques and tools of
applied statistics that are useful (and employed throughout the Defense industry), but none
that provide the overarching processes to capture and address all the risks inherent to test.
We include other statistical techniques where appropriate, but under the umbrella of our
1
Wing Operations Analyst, 46 Test Wing, Room 211 Bldg 1
Group Operations Analyst, 53d Wing, Suite 608, Bldg 351
3
Senior Analyst, Jacobs Engineering, 46 Test Squadron
2
1
This material is declared a work of the U.S. Government and is not subject to copyright protection in the United States.
`
general, DOE-guided approach. There are other sciences (mathematics, probability,
computer science) that are appropriate for some problems, even to the exclusion of statistics.
In these cases, we can still apply them with an analogous strategy of identifying and
managing risks with the right tools. This paper focuses on statistical problems, but the ideas
can be applied to other problems as well, even software test.
Nomenclature

= Statistical error Type I – false positive; declaring a difference that does not exist

= Statistical error Type II – false negative; failing to detect a true difference that does exist
1-
= Statistical Confidence – probability of avoiding a type I error
1-
= Statistical Power – probability of avoiding a type II error

= Probability of success in a Bernoulli trial or Binomial series of trials
p
= Level of significance of a test statistic – probability of a value more extreme by random chance
Y
= The test response variable(s) or matrix– often referred to as measure of performance
X
= The matrix of test conditions under which the response is measured
i
= ith Regression coefficient or matrix of regression coefficients relating Y response to the X condition

= the matrix of all noise causes: instrumentation errors, daily variation, test point inaccuracy, etc.
Genesis.
On the lead author‟s desk are two examples of “designed experiments.” Both are from a large, professional test
center, both are seriously offered as solutions to the test problems they address, and both are so poorly done that
they would earn a failing grade in any underclass course in experimental design. One is the test of a drone target for
aerial missile shots. The test calls for twelve – one hour sorties repeating a particular drone profile specified in
mach, altitude, maneuver, and countermeasures. The design is a replicated full factorial: one three level variable and
one two level variable (six unique cases) with two replicates. Unfortunately, the two variables have little to do with
the physics driving the drone‟s presentation as a target, and the response variable ( measure of performance - MOP)
is undefined – “we‟ll know good performance when we see it.” A second experiment calls for using 19 weapons
shots as follows: 14 demonstration shots and a five shot, three - variable I optimal design. Now both of these are, in
fact, designed experiments, but they are poorly designed. The first does not define success and wastes 12 flight
hours that could be more profitably spent examining hundreds of drone “presentations” that will reliably show its
worth as an aerial target. The second design is so small that it lacks statistical power to detect even the largest of
defects aside from total failure. So, clearly, simply calling for experiments to be designed experiments will not
solve the problems facing the T&E Enterprise. Additional impetus has been lent with recent DOD-level interest in
promoting DOE across both OT&E and DT&E in the Services (Gilmore, 2009).
2
We must have more objective
`
and reliable metrics that separate well-designed experiments from those that are not. That problem is the genesis of
this paper.
Introduction
A well-designed test or experiment has a number of characteristics that distinguish it from those that are less welldesigned. The experiment should be crafted by experts in the field of study, investigate an appropriate region of
operations and science, be as efficient as possible to conserve resources, and measure the performance quantities
involved rapidly and accurately. Data control should be rigorous, databases archived securely and rapidly, etc. Dr
Richard Kass of the US Army Operational Test Command sketches out a number of contextual factors that indicate
the experiment is likely to yield meaningful conclusions in his text The Logic of Warfighting Experimentation.
(Kass, 2004). This paper concerns hallmarks of well-designed experiments that are drawn from the field of Design
of Experiments (DOE.)
Since 1997 in Air Combat Command‟s 53d Wing, the US Air Force has been promoting the use of design of
experiments as the preferred method of constructing and analyzing test programs. The purpose of test is to mitigate
program risk by revealing early-on problems in system design or performance. An experiment (or campaign of
experiments) is a collection of test events meant to answer some question about the system under test. If the
outcome of the experiment will not impact any decision regarding an acquisition program, then what is the purpose?
Given that an experiment answers a question that matters (i.e.: will impact our opinion in a way that changes our
approach to managing the tested system), we can agree that it is important that the test is designed such that the risk
of coming to the wrong conclusion is not a byproduct or accident, but something we measure, manage, and control.
Since our focus on risk drives us to test, risk should be the focus of experimentation. Our approach to navigating
through experimental risk must be methodical and exhaustive in that we leave no aspect of test to chance if it has
reasonable likelihood of corrupting the answer. Programmatic test risk is the responsibility of test
directors/managers/engineers. Threats to answering the test question may include budget, schedule, manpower,
equipment, materials, or other resources. An often overlooked category of risk is that of „chance‟; the magnitude of
that risk is buried among the test design, collection, and analysis methods. Randomness in outcomes, background
factors that corrupt the data, or unknowns that we cannot even measure all threaten to lead us to wrong conclusions.
The science of random variation is statistics and the science of test is the statistical Design of Experiments (DOE).
The principles, processes, methods, techniques, and tools of DOE give us the means to minimize the chance we fail
to give a correct answer due to inefficient use of resources, not investigating the important conditions, having too
little fidelity, or confusing one effect with another. The four metrics in this paper address these risks and lays out a
case for ensuring they are all covered in every program. The scope of this paper, then, is the characteristics of a
well-designed experiment using the concepts of DOE applied to Defense acquisition programs. There are other
techniques and tools of applied statistics that are useful (and employed throughout the Defense industry), but none
that provide the overarching processes to capture and address all the risks inherent to test. We include other
statistical techniques where appropriate, but under the umbrella of our general, DOE-guided approach. There are
other sciences (mathematics, probability, computer science) that are appropriate for some problems, even to the
3
`
exclusion of statistics. In these cases, we can still apply them with an analogous strategy of identifying and
managing risks with the right tools. This paper focuses on statistical problems, but the ideas can be applied to other
problems as well, even software test. We have identified four broad classes of metrics that can be measured for any
test program. We divide the metrics into general (suitable for executive review) and expert (suitable for review by
experts in experimental design.) Both levels of metrics are important.
I A.
Macrocampaign. Sequential Experimental Design Campaign – Simulation to Live
I B.
Microcampaign. Salvo Test - Screening, Investigation, Confirmation
II A.
Design Spans Battlespace: k & levels
II B.
Design Attains Power and Confidence | N
III
Execute with Randomization and Blocking
IV
Design Enables Assessing Relationship: Separability of Causal Factors
Purpose and Audience:
Eglin I. Sequential Experimental Design Campaign.
IA. Macrocampaign. The battlespace for most systems undergoing military T&E is vast – 10‟s of possible
variables resulting in many thousands or millions of possible unique test conditions. We seldom know which of the
test conditions will matter most in affecting system performance, (though we have suspicions). Military T&E shares
this situation with many other domains, including product design, research, and manufacturing/processing.
Consequently, an experimental best practice is to structure the overall test program in such a way as to test in stages,
with appropriate objectives, designs, and periods for analysis, understanding, and re-design. It is seldom wise to
devote more than 25% of total resources to one experiment. (Box, 2004). The information revealed at each stage of
experimentation is invaluable in considering how to continue the investigation. We can use the output of early
stages to accelerate our learning about the process, validate (or improve) our various simulations, and refine the
active battlespace for later stages of test. At the outset of the test, we seldom know which factors are most
important, the appropriate range of factor values, the correct number of levels of each factor, repeatability or noise in
the process, and many other facts. We are seldom well equipped to answer these questions at the outset – but correct
answers are key to effective and efficient experimentation.
Hallmarks of staged experimentation documented in test strategies
(TS, TEMP, Test Concepts & Plans) should include a series of
objectives, factor and response tables, initial designs, and estimated
resources for each stage of experimentation, plus a management
4
`
reserve appropriate to the risk level of the process under test. To illustrate this metric, consider a current USAF
weapon program, whose staged strategy might be briefly described as follows: “Thorough factor screening with Ioptimal designs spanning 20 multi-level variables using digital simulation
(~5-10,000 runs, 10-15 designs). Objective: identify 8-12 most important test
Excellence Checklist for:
conditions, levels, and expected performance. Based on stage I results and
predicted performance, conduct campaign of ~2000 HWIL runs followed by
Sequential Experimentation
 Objectives for each stage w/
proposed level of simulation
(digital through live)
 Estimated resources for each
stage (often ~25% or less
each)
 Thorough list of candidate
test conditions and measures
 Initial experimental design
with design metrics
(remainder of Eglin Four)
 Discussion of likely re-design
strategies following initial
exploration
 Management reserve from
~10% to 30% depending on
assessed risk & technology
maturity
1000 captive carry events. Objective(s): validate strengths and weaknesses
of digital simulation, further factor screening to 4-6 most important factors
and performance predictions. Based on Stages I-III, conduct 2-level factorial
or fractional factorial for 4-6 variables (with replicated points) in 16-24 runs
in live releases. Objective(s): validate strengths and weaknesses of digital,
HWIL, captive simulations, estimate expected performance. Reserve 5 live
events to demonstrate less-important features and carry a management
reserve of 30-40% (6-8 events) for expected technical risk.
Staged experimentation is quite common in DOD programs, and are, in fact,
the way we acquire most systems through contractor R&D, contractor test,
developmental test, and operational test. This metric is different in that it
envisions designed experiments as the tool of choice to conduct the
experimentation. With a common architecture, language, and toolbox for test,
DOE uniquely enables integrated KT/DT/OT testing in a fashion that has long
been called for, but is seldom done in practice.
I B. Microcampaign. Within each experimental environment, it is
appropriate to outline a sequence of experiments. Initially, the objective will often be to screen many factors to
identify the few that drive the process outcome. Identifying the single and combined factors active in changing the
response is one key outcome of early experiments. Commonly, just a few of many factors that can possibly affect
the measures of performance actually do so. Identifying these active and inactive factors is referred to as
“screening.” The testers may then wish to reduce the size of the battlespace explored in subsequent tests. Just as
importantly, we may discover unexpected facets of system performance such as nonlinear behavior, low (or high)
noise in the experimental region, isolated unusual runs that do not conform to similar cases, features of the
battlespace that are not well represented in this stage of experimentation (target backgrounds, natural environments,
human reactions, etc). In the case of each of these outcomes, we have learned valuable information about process
performance that should be used to redesign the next stage of testing. Many options for subsequent test stages are
possible, including (Montgomery, 2004):
5

Perform one or more confirmation runs to verify predicted or unusual performance

Change the range or levels of one or more factors to further explore performance trends

Augment the existing design with additional runs to explicitly capture nonlinear performance
`

Repeat some runs either because of incorrectly run settings or to improve estimate precision

Drop factors because they have negligible effects on performance

Add one or more new factors suggested by engineering learning accompanying the test

Make engineering changes to simulations because predicted performance was not validated in live test
A typical sequence of tests that should be outlined in each stage may include screening (or confirmation of pervious
stage results), investigation, exploration and understanding, and finally prediction and confirmation. Each stage
should be appropriately budgeted and scheduled in the test strategy.
The first metric addresses a series of designs – almost always the way to
proceed. The remaining three metrics can, and should, be assessed for
Excellence Checklist for:
Spanning Battlespace
 Evidence of in-depth,
integrated brainstorming by
all parties: fishbone diagrams,
process flow diagrams.
 Thorough list of candidate
test conditions. Table should
include:


Name of test condition
Unit of measurement real
(physical) values much preferred
over labels)
Range of physical levels
Estimated priority of condition in
describing system performance
Expense/difficulty of controlling
level: easy, hard, very hard to
change.
Proposed strategy of experimental
control for each condition:
constant, matrix variable, noise. If
noise, specify covariate,
randomized, or random effect.
Levels chosen for experimental
design





 Name of the chosen Design
Strategy to assign variable
levels to runs and N, number
of total runs
each design in the campaign.
Eglin II. Design Spans Battlespace with Confidence and Power:
Factors & Levels.
IIA. Span Battlespace. In exploring the performance of multi-purpose
military systems, a huge battlespace is the norm. That is, the system is
intended to work across a number of platforms, in diverse environmental
and weather conditions, against many types of enemy systems, deceptions,
and countermeasures, and from a variety of geometric and kinematic
engagement conditions. In actual practice using experimental design,
more than 25 possible
test conditions are
commonly encountered.
The conceptual
embedded table drawn
from an AGM-65
Maverick missile test is
typical – many more than
12 conditions are possible, with multiple levels for each condition. To
determine how many unique combinations are possible, simply multiply
the number of contemplated levels together – there are nearly 140,000
unique combinations! Testing all combinations of these 12 variables is
clearly prohibitive in any real environment, and even running this many
digital simulations would be daunting. The problem, nonetheless, remains
– the battlespace is large and we are charged to explore it. The
6
`
experimental design metric described here concerns the number of factors to be explored, the levels chosen for each
factor and the experimental design strategy used to spread them across the battlespace.
A best practice is for an integrated team of scientists, engineers, operators, designers, and maintenance technicians
from the various interested organizations to spend one or two days brainstorming the factors included in the
battlespace – how the system will be used, against what targets, in what environments, etc. An experienced
facilitator, often a seasoned experimentalist, can greatly assist
the team in this endeavor with simple tools like system
operations process flow diagrams and Ishikawa or “fishbone”
diagrams. The fruit of this brainstorming are these diagrams
resulting in a list of possible test conditions and measures of
performance that define the desired battlespace. Next, the
team prioritizes these lists using the best available
understanding since it is seldom possible to consider
everything. Finally, the team considers alternate design
strategies to span the desired space (usually in a sequential
campaign of experiments – see Eglin I.) Many experimental
design strategies are available, suited for a wide range of
possible objectives and circumstances: comparative two-level designs, both 2-level and general factorial designs,
fractional factorial designs, split plot designs, response surface designs for nonlinear processes, several classes of socalled “optimal designs”, and special purpose designs such as Plackett-Burman, space-filling, and many sorts of
orthogonal arrays, to name a few. The inset graphic is a JMP-constructed scatterplot matrix showing the assignment
of runs in an 83-run mixed level factorial design suggested for a recent decoy target test at Eglin. The team should
be advised by a trained and experienced experimentalist or statistician to guide their selection. Often there are
several good choices that can be made for a given budget, but serving different experimental objectives.
Eglin IIB. Design Attains Power and Confidence for Number of Trials. In the presence of noise (measurement
error, run-to-run variation) in outcomes, we are faced with the inescapable risk of making a wrong inference about
what is happening during a test. When we observe a shift in performance during one or more trials, is the shift
indicative of a casual effect based on changing test conditions, or is the shift instead merely a product of the noise in
the system. Statisticians call the mistake of declaring a change being due to test conditions (when it is not) an error
of type I, or the Greek letter (alpha). Conversely, the mistake of failing to detect such a shift when it exists in
reality is called an error of type II or  (beta). Making correct decisions in the face of noise is a hallmark of a welldesigned experiment: 1- is called Confidence and 1- is called Power. In the world of testing to requirements and
specifications, loosely speaking, we would like to be confident that the performance changes we associate with test
conditions really are thus associated; we would also be loathe to mistakenly miss important changes in performance
because we lacked sufficient power in our designs. A few illustrations will help make these concepts more concrete.
In what follows, we will fix confidence at some acceptable level (e.g. 95%) and consider effects on power. In a
7
`
recent guided weapons test, the contractor portion of the test matrix included only 8 trials, all near the gravity-fall
line of the weapon, requiring little in the way of course corrections in the
guidance package. This test design suffered from two causes – too few trials
to create much power, and the design consisted of points that were in the
Excellence Checklist for:
Statistical Power
 Design reports Power for one
or more key performance
measures (e.g. accuracy,
P(success), task time).

Best practice – power reported for
a range of total events (N).
 Power reported lists other
power analysis parameters:






Name of experimental design
strategy (factorial, optimal,
response surface, etc.)
Size of design – number of factors
and levels planned
N – total number of trials
Chosen level of  error for these
power values
Expected process noise and basis
for estimate (sigma - )
Shift in process performance to be
detected and basis for importance
(delta – d).
center of the battlespace, thereby failing to span the intended use envelope
(Eglin metric II). As a consequence, a critical aerodynamic instability was not
discovered until operational testing began. In the opposite extreme, a recent
F-16 test to markedly reduce the magnitude of vibration in the ventral fins
was conducted in over 320 test conditions, many of them repeated multiple
times. In this case, the design was over-powerful to detect the desired
reduction, resulting in wasted resources. As a general rule, simulation and
HWIL experiments often consist of too many trials, while expensive live tests
suffer from too few trials, and lack power.
Statistical power is influenced by many factors, but one of the most important
is the total number of runs contemplated in a design.
As a general rule, more
trials (properly designed) lead to higher power.
 Name of the chosen Design
Strategy to assign variable
levels to runs and N, number
of total runs
In this embedded graphic, we see an illustration of two other factors that
figure in the computation of power – the amount of movement in the response considered important in the system,
and the amount of process noise (Greek letter sigma ). In the graphic, the shift is denoted by the Greek letter 
(delta). Tests falling in the lower left quadrant (noisy-small shift) require the most trials, while tests in the upper
right (large shifts, low noise) require fewer. Consider two examples from the lower left and upper right quadrants to
illustrate these principles. From the lower left first – in B-1 Block D testing, one of the required capabilities was to
demonstrate offset aim points of 50 feet or more for unguided weapons such as the Mk-82 500 pound bomb. In
other words, the aircrew could designate a visible target as the offset aim point and command the fire control system
8
`
to place the unguided bombs at a designated distance and angle to the aim point, e.g. 50 feet due west of the aim
point. From 20,000 feet, unguided weapon delivery accuracy on the B-1 is measured in hundreds of feet, making
the process quite noisy. Conversely, the shift required to be detected was on the order of 50 feet. Tests of this sort
require many trials to attain statistical power. In the upper right example, consider IR Countermeasures. IR
Countermeasure testing, such as the Large Aircraft IR Countermeasure system (LAIRCM), compare the miss
distances (or hit performance) of man-portable IR surface to air missile systems against aircraft such as the C-17.
Such tests are often conducted either in a hardware in the loop test facility such as the GEWF, or in open air testing
at Eglin and elsewhere. Other test variables under investigation often include different fielded versions of the
missiles, geometric variables like range, aspect and altitude of the engagement, different environmental conditions in
both the ultraviolet and infrared, and perhaps even different jamming techniques or aircraft engine settings. With no
IR countermeasures from these large targets, missile performance measured by miss distance is very small and
consistent. When the countermeasures are working as designed, miss distances are very large and quite variable.
Thus, in this case, the process noise without countermeasures is quite small and the shift desired to be detected is
much larger, requiring relatively few trials to achieve statistical power. As a general rule, power grows
dramatically in the first few trials and power increases begin to become increasingly expensive as the number of
trials increases. In other words, fewer additional trials are required to improve power from 30-40% than the number
of trials required to move from 70% to 80% power.
This principle is depicted in the embedded graphic: computations of power are given for a simple design with 3
variables each at 2 levels – comprising 8 total combinations for each replicate of the design. To tie this example in
with our previous ones, consider the B-1 Block D unguided bomb test. Such a 23 design (shorthand for 3 variables
at 2 levels each) might consist of two altitudes (15K-20K), two airspeeds (500-600 kts), and two bomb types (Mk
82-Mk 84). The graphic4 was computed in Excel using output from Dr Russell Lenth‟s Pie-Face Java Applet for
computing power (Lenth, 2006). Typically, minimum power is in excess of 80%, with a desired goal of 90-95% if
achievable and affordable. To translate desired power in terms of this B-1 example, one can see that 80% power is
Menu choices: Balanced ANOVA> Three Way ANOVA (additive model) >a=b=c=2; SD=1, delta=1,  =0.05
remainder defaults>Options>Graph>Power vs. Replicates>Show Data. C&P to Excel and replace replicates with 8*
replicates to show total weapon delivery events. Note that “additive model” is a liberal power estimate – only
three main effects and intercept in the model.
4
9
`
achieved at about 32 events (4 replicates of 8 conditions), while 95% power might require 56 events or 7 replicates
of the 8-trial design. In terms of sorties, assuming 12 passes per sortie, we are considering 3 – 5 sorties – probably
not unaffordable in this particular test case, but perhaps not important enough of a capability to devote these
resources to in the age of JDAM. One of the beauties of having this capability to computer power vs. required
resources is that such resource-priority tradeoffs can now be made with objectivity, rigor, and with a clear
understanding of the competing risks for the total test program.
Eglin III. Execute with Randomization and Blocking. An often-neglected aspect of planning for designed
experiments is controlling the fashion in which the design will be executed. Unbeknownst to the experimenter,
nuisance factors change while the test is being conducted. These factors can include lot to lot variations in raw
material, changes in the design of the system, learning, warm-up, wear-out, fatigue, boredom, weather, etc. Within a
homogeneous unit of experimentation, the solution to background
changes is randomization. That is, the order of the trials is randomized
to ensure that changes in the experimental conditions do not line up
Excellence Checklist for:
Test Execution
 Name of the chosen Design
Execution Strategy to account
for background change. E.g.:





Factorial in blocks
Split plot design with easy and
hard to change factors
Analysis of Covariance
Mixed Design – Fixed and Random
variates in model
Describe methods to control
background
with, or become confounded with the background changes. As with
many principles of experimentation, this one has its roots in
agriculture. Plant varieties were randomized in their location in fields
to eliminate local changes in drainage, fertility, sun, and other factors
from aligning with the changes in experimental conditions. Between
fields, a comparable sequence of trials were run to isolate field-to-field
differences. Randomization, then, was Fisher‟s expedient to “foil
every device of the devil” (Fisher, 1935) in confusing the outcome of
the experiment. Classical experimental designs break the total test
design into sequences of trials that can be accommodated by the
experimental set-up: number of passes per mission, number of
mixtures that raw materials can accommodate, the number of troop
evolutions that can be run during an exercise shift, etc.
More complex patterns of experimental execution can be implemented as well. Factors that change in the
background, that cannot be controlled, but can be measured, can be treated as either “covariates” and the analysis
can be augmented to include the Analysis of Covariates so that the effects of the nuisance factor can be isolated
from both the effects of the intentional experimental factors, as well as from the estimate of run to run variation (the
error term.) Hard to change factors that would otherwise argue against complete randomization can be changed
according to an experimental design called a split plot design. Again, the agricultural roots are apparent. Plots of
land were divided into sections with hard to change factors such as irrigation applied ot the whole plot, while easy to
change factors such as plant variety and weeding were randomized within the whole plots. Split plot designs are not
without drawbacks since the lack of complete randomization weakens the cause and effect relationship between the
hard-to-change factor and the response, but such designs can maintain a well-defined relationship between the easy
10
`
to change factors as well as the “whole plot” (hard to change) and “split plot” (easy to change) factor interactions.
Eglin IV. Design Enables Assessing Relationship: Separability of
Causal Factors. A unique feature of tests constructed with designed
experiments methods is the ability of the tester to link suspected cause(s) to
Excellence Checklist for:
Separability of Test
Conditions
 Name of the chosen Design
Strategy to assign variable
levels to runs. Also:

Size of design – number of factors
and levels planned
N – total number of trials

the observed effects. That is, the test matrix is designed in such a way as
to (mostly) unambiguously link changes in system performance to the
changes in the test conditions that are associated or correlated with the
performance change. It is too strong to claim “cause and effect” chains
since statistical sampling never establishes that strong a linkage, but
association and correlation can certainly be detected. Cause and effect
reasoning is the prerogative, and responsibility of, the subject matter
expert: the physicist,
Clutter
Low-Desert Medium-Forest
Fixed 0 kts
Worst - 4
Tgt Motion
Moving 30 kts Best - 4
Example of "Best-Worst" Case Design
 If less than all combos are run
(full factorial) report Design
Resolution – e.g., Resolution
III, IV, V… Also:
engineer, operator or

performance changed in the pattern observed in the test. Nonetheless,
Describe strategy to augment
fraction to resolve confoundingaugmenting, foldover, predictconfirm, etc.
Report estimate of N* number of
augmenting runs required

 Advanced] Other than for full
or classical 2-level fractional
factorial, report:

Maximum and average variance
inflation factor
Prediction error vs. fraction of the
design space
Examine X’X matrix, report
predominant design resolution
Report model supported: e.g. main
effects only, main effects plus two
factor interactions, quadratic, etc.



scientist who can
explain why the
being able to link
changes in
performance to test
conditions is a
powerful feature of
Clutter
Low-Desert Medium-Forest
Fixed 0 kts
2
2
Tgt Motion
Moving 30 kts
2
2
Example of "Factorial" Design
a well-designed experiment. We call this linkage “separability5.” To
understand when separability exists, let us first consider a toy example
when it does not.
In the graphic below, consider a test design where
engineers examine target seeker performance in two cases – the best case
with a moving target in a low clutter environment and the worst case when
the target is fixed in a moderate clutter environment. Let us say that four
trials were conducted in both the best and worst case for a total of eight
trials. Suppose further that the seeker performs well (green-shaded) in the
best case, but poorly (red-shaded) in the worst case. Should the engineer
5
Mathematically, this condition is referred to as orthogonality – that is, the design is constructed such that
patterns of test conditions that measure single factors acting alone (main effects) and those that measure factors
acting together to change the response (interactions) can be clearly distinguished by careful arrangement of test
conditions. More specifically, the columns in the matrix are mutually at right angles (orthogonal) to each other. As
the number of factors and levels increases, we carefully depart from full orthogonality.
11
`
improve the seeker clutter performance or the ability to detect targets with low relative motion? A moment‟s
thought will show that, in this test design, the effects of clutter and motion are bound together and cannot be
separated. Statisticians refer to this condition as “confounding” – the pattern of tests is such that the individual and
combined effects of the conditions cannot be separated – no matter how many trials are conducted. The solution to
confounding is to cross the test conditions – look at more combinations so that the clutter and motion effects can be
estimated separately. Such a design is referred to as a “factorial” or fully-crossed design in that we test all
combinations of the two variables, as shown in the embedded graphic at right. Note that we still have 8 total trials,
but now we examine more combinations of battlespace conditions. With this factorial,
three patterns of performance will lead us unambiguously to the conditions associated
with poor performance, and therefore provide physical clues to assist the engineer in
explaining the observed pattern of performance. In the three cases shown in the
graphic to the right, the red shading indicates seeker performance is poor in clutter,
against fixed targets, and finally against fixed targets in clutter. Any of the three of
these cases could account for the observed performance in the “best-worst” case
scenario examined earlier. In the case of clutter or motion acting alone to degrade
seeker performance, we refer to this as “main effects”, while the case where both clutter
and motion were required to explain seeker poor performance, we refer to as an
interaction – in this case, two-factor interaction. This toy example illustrates the
Clutter
Low Med
Fixed
2
2
Motion
Moving
2
2
Failure: Clutter
Clutter
Low Med
Fixed
2
2
Motion
Moving
2
2
Failure: Lack of Motion
Clutter
Low Med
Fixed
2
2
Motion
Moving
2
2
Failure: Fixed Tgt in Clutter
mathematical principles involved – extending these principles to many factors at many levels is less straightforward.
The mathematics behind this fourth metric are truly elegant and powerful, invented in the 1920‟s by Sir R.A. Fisher
of Britain. He referred to this field of statistics as “The Design of Experiments” and laid out these principles in an
eponymous book in 1935. Many variations on this theme have been derived in the ensuing 90 years.
Let us first consider the 2-variable case again, this time in
Case
1
2
3
4
A:Clutter B:Motion Mean A
B AB DetectTime(s)
low
fixed
1
-1 -1
1
6
med
fixed
1
1
-1 -1
30
low
moving
1
-1
1
-1
6
med
moving
1
1
1
1
30
Estimates 18
12 0
0
Factorial Design in Coded (Geometric) View with Response
a slightly different arrangement. For convenience, we
will code the levels of the variables in terms of +/- 1, and
assign the letter A to clutter and B to motion. For
simplicity, we will examine only the four unique
conditions formed by the two test conditions (the first
replicate of the design above.) In this matrix arrangement, we see the test conditions in both coded and natural units.
We see an invented column of responses – time for the seeker to detect the target. We also see a new row – entitled
“Estimates”. This row speaks to how much changing the factors alone (A and B columns) and combined (AB
column) affect detection time. The AB (interaction) column is constructed by multiplying A*B columns. The
“Mean” column contains the average of all four runs. A brief glance at the table shows that in low clutter (A=-1)
detection time is 6, and in high clutter (A=1), detection time is 30 seconds. In other words, when A changes two
coded units from -1 to +1, detection time increases by 24 seconds. How much does detection time change when A
changes one coded unit? One half of 24 seconds or 12 seconds. This change in the response due to one unit change
in the test condition is referred to as a “regression coefficient”. In one sense, factorial designs are highly efficient
12
`
means of estimating these regression coefficients, or an underlying, empirical input-output relationship (math
model) relating the test conditions to the responses. In the coded units, the columns are also the equations used to
make these estimates. That is, the estimate for the regression coefficient of A is simply the average response at the
high value of A minus the average response at the low value of A. Similarly with columns B and AB. The equation
for the mean is the familiar summing up of all four response values and averaging them over the four runs:
(30+30+6+6)/4 = 18. In four trials, we solve uniquely for four unknowns – the mean, the effects of A and B
variables, and the AB interaction. All estimates are independent of the others, unique, and the equations are clearly
separable6. No other arrangement of test conditions consisting of four runs is as efficient as this one. Because the
effect of all the variables can be uniquely estimated, we would refer to this design as having “full resolution.” To
investigate a design with less than full resolution, consider adding another variable, say visibility, as factor C.
Case
1
2
3
4
5
6
7
8
A:Clutter
low
med
low
med
low
med
low
med
B:Motion Mean A
B AB C:Visibility C
AC
BC
fixed
1
-1 -1
1
low
-1
1
1
fixed
1
1
-1 -1
low
-1
-1
1
moving
1
-1
1
-1
low
-1
1
-1
moving
1
1
1
1
low
-1
-1
-1
fixed
1
-1 -1
1
high
1
-1
-1
fixed
1
1
-1 -1
high
1
1
-1
moving
1
-1
1
-1
high
1
-1
1
moving
1
1
1
1
high
1
1
1
Estimates 18
12 0
0
0
0
0
Factorial Design in Coded (Geometric) View with Response
ABC DetectTime(s)
-1
6
1
30
1
6
-1
30
1
6
-1
30
-1
6
1
30
0
Once can see in the expanded matrix with 2*2*2 = 23=8 runs that the new variable (C) is made to have no effect.
The new columns AC, BC, and ABC account for the possibility that the variable C might work with (interact) the
existing variables and their interaction AB. We have added four runs (rows), and an additional four equations
(columns), allowing us to solve for eight unknowns in eight runs. Again, all estimates are separable (mutually
orthogonal.) One final example – retain the three variables, but now reduce the matrix to four runs only.
Case
1
2
3
4
A:Clutter B:Motion Mean A
B AB C:Visibility C
AC
BC ABC DetectTime(s)
low
fixed
1
-1 -1
1
high
1
-1
-1
1
6
med
fixed
1
1
-1 -1
low
-1
-1
1
1
30
low
moving
1
-1
1
-1
low
-1
1
-1
1
6
med
moving
1
1
1
1
high
1
1
1
1
30
Estimates 18
12 0
0
0
0
12
18
One Half Fractional Factorial Design in Coded (Geometric) View with Response
In this case, we have retained the same model, but the estimate for BC is now the same as for A and the estimate for
ABC is now the same as the overall mean. If one examines the patterns of plus and minuses in the columns, it can
be noted that the pattern for C is identical with AB, A=BC, B=AC, and Mean=ABC. With only four runs, we can
6
In mathematical terms, the four columns/vectors are mutually orthogonal – the product of any two columns = 0.
13
`
create only four unique equations, meaning we abandon our ability to uniquely estimate the eight possible terms in
our math model. Since the main effects A, B, and C are all identical (confounded with) two factor interactions (BC,
AC, AB) we say the design is of resolution III. Many different design resolutions are available: resolution IV
designs have main effects aliased with three-factor interactions (3+1=4) and two way interactions are aliased with
each other (2+2=4). Resolution V designs have main effects aliased with four factor interactions (4+1=5) and two
factor interactions are aliased with three factor interactions (3+2=5).
To reinforce this point, let us return to our original “best-worse” case design with these new insights. Here are
displayed the 4 replicates of the best and worst
cases. Had we tried to estimate a model with this
design7 … we would find that the effect of A and
B are bound together, but in an opposite manner
than above: A = -B. With the same data pattern as
above, we might mistakenly think both A and B
were involved, when only A was an active
variable. Similarly, the interaction AB is bound
Case
Best
Best
Best
Best
Worst
Worst
Worst
Worst
A:Clutter
low
low
low
low
med
med
med
med
B:Motion Mean A
B AB DetectTime(s)
moving
1
-1
1
-1
30
moving
1
-1
1
-1
30
moving
1
-1
1
-1
30
moving
1
-1
1
-1
30
fixed
1
1
-1 -1
6
fixed
1
1
-1 -1
6
fixed
1
1
-1 -1
6
fixed
1
1
-1 -1
6
Estimates 18 -12 12 -18
"Best-Worst" Design in Coded (Geometric) View with Response
to the opposite of the overall mean. This would be a design of Resolution II – two main effects are confounded with
each other.
To summarize, a key measure of separability is design Resolution: full factorials (all combinations) are of full
resolution (n confounding), while fractional factorials (fewer than all combinations) are less than full resolution.
Higher resolution designs are generally easier to analyze and interpret, but any fractional factorial must be executed
such that additional augmenting runs can be added to resolve uncertainties or confirm our understanding of cause
and effect.
There are many more expert-level metrics to be considered in this analysis phase of the experimental design. But it
is worthwhile to understand that the analysis implications are implicit in the design and can be evaluated and
assessed prior to collecting the first data point. A few can be mentioned to wrap up this concept paper. A relatively
new development in DOE is the fraction of the design space (FDS) plot. Such a plot shows the variance of the
predicted value of the response as a function of the volume of the design space – the battlespace the design searches.
Surely one would like such a prediction variance to be uniformly low and well-behaved across the design region.
The FDS plot shows exactly this relationship and is particularly useful to compare competing designs to examine the
same battlespace. New design developments in DOE have some serious drawbacks when examined under this glass:
space-filling designs and certain optimal designs. Such designs are easily constructed via computer algorithms to fit
almost any experimental condition (see the original examples in this paper) but the FDS plot will revela that small
designs with many factors will predict poorly. Another way of saying this is that they will fail to correspond to
reality quite often – surely an undesirable feature of a designed experiment.
7
Alpha optimality criteria – A-, D-, G-,
We would not think about model building if we lacked DOE, and would not do such a design if we did know DOE.
14
`
I- optimality criteria related to the design (X) matrix, and others all tell their tales to expert experimenters and
should be examined for any proposed design. Finally, the method of analysis proposed for a given design should be
stated and justified. Designed experiments can be analyzed by classical methods such as multiple regression and the
analysis of variance (ANOVA). They can also be examined with more complex methods such as Generalized
Linear Models, Kriging (a form of spline fitting) Generalized Regression Splines (GASP), Multiple Regression
Splines (MARS) as well as variations on these methods including CART, Bayesian modeling, and more. Each
design should have the proposed method of analysis stated so that the strengths and weaknesses as well as the fitness
of the design to the analysis method may be considered.
As a last caution, we must remember that we are fitting proposed relations from test conditions to responses in the
presence of noise. It may well be that we are mistaken in the proposed relationship. In this case, predict-confirm
runs are highly indicated. That is, if we have settled on a proposed relationship, we should be prepared to make
predictions of the performance we expect to see in the battlespace and actually run a series of trials (not many –
usually a few percent) to show that we can successfully predict performance especially for points in the battlespace
we have not tested. DOE methods allow us to make specific predictions, as well as to bound our prediction
uncertainty during these trials. Executed prediction points should fall within our bounds or we have reason to
believe there is more to learn than what we have uncovered thus far. Predict-confirm points should be explicitly
budgeted and scheduled for.
Summary. A concept paper to be sure. This paper has attempted to outline the beginnings a set of metrics for use
by the Department of Defense in grading its test programs. These metrics are particularly important in the beginning
stages of our conversion of “best efforts” to true designed experiments. Two cautionary tales at the commencement
of this paper show that simply claiming that a series of test runs is a “designed experiment” does not indicate that it
will be successful in that which we have set it to do: correctly explore a multi-dimensional battlespace and correctly
show the systems that work as desired from those that do not. And the problem is key: without a set of objective
metrics, we risk sending systems into combat (as we sometimes do today) without a rigorous test program to show
that they are fit for their intended uses. These metrics uncover whether the designs contemplated are a wellconsidered series (campaign) of investigation in increasingly expensive and credible simulations of reality; whether
process decomposition has been thoroughly done – using a team approach to specify which test conditions should be
considered and measuring what constitutes success; the design itself, whether it truly spans the battlespace and has
enough trials to arrive at the correct answer with some degree of surety (confidence and power); and finally if he
method of analysis is suitable to design class chosen, and is likely to correctly link changes in the test conditions to
associated changes in the response variable(s).
15
`
References
Box, G., Hunter, S., Statistics For Experimenters, Wiley & Sons, 2004.
Fisher, R.A., The Design of Experiments, Oxford University Press, (re-issue), New York, 1990. Originally published by
Hafner in 1935.
Gilmore, J.M., DOT&E Memo, Nov 24, 2009, Subject: Test and Evaluation Initiatives,
https://acc.dau.mil/CommunityBrowser.aspx?id=312213&lang=en-US
Hill, R., and Chambal S., “A Framework for Improving Experimental Design and Statistical Methods for Test and
Evaluation”, AIAA T&E Days 2009, Albuquerque, NM.
Hutto, Gregory. “Design of Experiments: Meeting the Central Challenge of Flight Test.” American Institute of Aeronautics
and Astronautics 2005-7607, presented at US Air Force T&E Days, 6-8 December 2005, Nashville Tennessee.
Kass, Richard, The Logic of Warfighting Experiments (Hardcover), CCRP Publication Series (January 2006), ISBN-10:
1893723194,
Kotter, J. Leading Change, Harvard Business School Press, 1996
Landman, D., Simpson, J., Mariani, R., Ortiz, F., and C. Britcher. “A High Performance Aircraft Wind Tunnel Test using
Response Surface Methodologies.” American Institute of Aeronautics and Astronautics 2005-7602, presented at the US Air
Force T&E Days 2005, Nashville TN.
Meyers, R. and Montgomery, D., Response Surface Methodology, Wiley, 2002.
Montgomery, Douglas C. Design and Analysis of Experiments, 5th edition. New York: John Wiley & Sons, Inc., 2001.
Nair, Vijayan N., ed. “Taguchi‟s Parameter Design: A Panel Discussion.” Technometrics 34, no. 2 (May 1992): 127-161.
Sanchez, S.M. and Sanchez, P.J. (2005), “Very Large Fractional Factorial and Central Composite Designs,” ACM
Transactions on Modeling and Computer Simulation, Vol. 15, No. 4, October 2005, Pages 362–377.
Simpson, James R. and Wisnowski, James W., “Streamlining Flight Test with the Design and Analysis of Experiments.”
Journal of Aircraft 38 no. 6, American Institute of Aeronautics and Astronautics 0021-8669 (November-December 2001): 11101116.
Tucker, A., Hutto, G. “Application of Design of Experiments to Flight Test: A Case Study”, AIAA T&E Days 2008, Los
Angeles, Ca.
Whitlock, M.C., “Combining probability from independent tests: the weighted Z-method is superior to Fisher's approach,”
Communication, retrieved from http://www3.interscience.wiley.com/journal/118663174/abstract?CRETRY=1&SRETRY=0,
January 2009.
Xu, H., Cheng, S.W. & Wu, C. F. J. “Optimal Projective Three-Level Designs for Factor Screening and Interaction
Detection.”Technometrics, 46 (2004), 280-292
16
`
Annotated Bibliography.
Lenth, R. V. (2006-9). Java Applets for Power and Sample Size [Computer software]. Retrieved Aug 15, 2006,
from http://www.stat.uiowa.edu/~rlenth/Power.
Dr. Russ Lenth has provided an exceptional service to the test community by creating and distributing this power
analysis tool. In his words, here are some very right things one can do with Pie-Face. Use power prospectively for
planning future studies. Software such as is provided on this website is useful for determining an appropriate
sample size, or for evaluating a planned study to see if it is likely to yield useful information. It is easy to get caught
up in statistical significance and such; but studies should be designed to meet scientific goals, and you need to keep
those in sight at all times (in planning and analysis). The appropriate inputs to power/sample-size calculations are
effect sizes that are deemed clinically important, based on careful considerations of the underlying scientific (not
statistical) goals of the study. Statistical considerations are used to identify a plan that is effective in meeting
scientific goals -- not the other way around. Do pilot studies. Investigators tend to try to answer all the world's
questions with one study. However, you usually cannot do a definitive study in one step. It is far better to work
incrementally. A pilot study helps you establish procedures, understand and protect against things that can go
wrong, and obtain variance estimates needed in determining sample size. A pilot study with 20-30 degrees of
freedom for error is generally quite adequate for obtaining reasonably reliable sample-size estimates.
Montgomery, Douglas C., Design and Analysis Of Experiments, (2004), 6th edition.
Includes a review of elementary statistical methods, designs for experiments with a single factor, incomplete block
designs, factorial designs, multifactor designs and nested arrangements, randomization restrictions, regression
analysis as a methodology for analysis of unplanned experiments, and response surface methodology. Probably the
most widely used graduate and undergraduate text, Montgomery‟s work includes numerous worked actual
engineering examples, a wealth of web-supplementary material, and a satisfying depth of mathematical detail to
show the underlying theory. We have used this book exclusively for more than 19 years to teach DOE and it has
few peers.
Myers, Raymond H. and Montgomery, Douglas C., Response Surface Methodology: Process and Product
Optimization Using Designed Experiments, (2009), 3rd Ed. hardcover with 700 pp.; $59.95
The process of identifying and fitting an appropriate response surface model from experimental data. The book deals
with the exploration and optimization of response surfaces and integrates the three topics: Statistical Experimental
Design Fundamentals, Regression Modeling Techniques, and Elementary Optimization Methods. Throughout, the
text uses the elegant matrix algebra theory underlying these methods so this book is recommended for those
comfortable in matrix theory. The third edition is complete with updates that capture the important advances in the
field of experimental design including the generalized linear model, optimal design and design augmentation, space
filling designs, and fraction of the design space prediction variance metrics allowing different designs to be
17
`
compared and contrasted. With the first third devoted to classical design of experiments, this single volume can be
used to teach both basic DOE as well as its extension into modeling nonlinear surfaces and system optimization.
Box, George E. P., Hunter, William G. and Hunter, J. Stuart, Statistics for Experimenters, (2004), 2nd Ed.
hardcover with 638 pp.; ISBN 0-471-09315-7; $69.95
Every experimenter should have this classic text in his library. The authors are well known across the world and
their presentation of design types is extremely thorough! Box, in particular, began as a practicing industrial chemist
and was drawn to DOE to solve the many vexing problems of achieving manufacturing mixtures with multiple,
conflicting qualities like cost, yield, time, profit, effectiveness, etc… Rewritten and updated, this new edition adopts
the same approach as the initial landmark edition through worked examples, understandable graphics, and
appropriate use of the leading statistics packages. Box begins each section with a problem to be solved and hen
develops both the theory and practice of its solution, building on what has come before. Updated with recent
advances in the field of experimental design, Box discusses computer algorithmic approaches to design
augmentation, split plot designs for robust product optimization, multi-response optimization both graphically and
numerically, Bayesian approaches to model selection, and projection properties of complex Plackett Burman
screening designs that make them desirable in sparse processes dominated by a few major causal factors. Replete
with physical, biological, social, chemical, and engineering examples, this book is an invaluable reference for all
practicing experimentalists.
18