AIAA 2010-1710 U.S. Air Force T&E Days 2010 2 - 4 February 2010, Nashville, Tennessee ` The “Eglin IV”: Metrics of a Well-Designed Experiment Gregory T. Hutto1, James R. Simpson2, James M. Higdon.3 46 Test Wing, 53d Wing Eglin AFB, Florida, 32452 Abstract. Since 1997 in Air Combat Command’s 53d Wing, the US Air Force has been promoting the use of design of experiments as the preferred method of constructing and analyzing test programs. The purpose of test is to mitigate program risk by revealing earlyon problems in system design or performance. An experiment (or campaign of experiments) is a collection of test events meant to answer some question about the system under test. If the outcome of the experiment will not impact any decision regarding an acquisition program, then what is the purpose? Given that an experiment answers a question that matters (i.e.: will impact our opinion in a way that changes our approach to managing the tested system), we can agree that it is important that the test is designed such that the risk of coming to the wrong conclusion is not a byproduct or accident, but something we measure, manage, and control. Since our focus on risk drives us to test, risk should be the focus of experimentation. Our approach to navigating through experimental risk must be methodical and exhaustive in that we leave no aspect of test to chance if it has reasonable likelihood of corrupting the answer. Programmatic test risk is the responsibility of test directors/managers/engineers. Threats to answering the test question may include budget, schedule, manpower, equipment, materials, or other resources. An often overlooked category of risk is that of ‘chance’; the magnitude of that risk is buried among the test design, collection, and analysis methods. Randomness in outcomes, background factors that corrupt the data, or unknowns that we cannot even measure all threaten to lead us to wrong conclusions. The science of random variation is statistics and the science of test is the statistical Design of Experiments (DOE). The principles, processes, methods, techniques, and tools of DOE give us the means to minimize the chance we fail to give a correct answer due to inefficient use of resources, not investigating the important conditions, having too little fidelity, or confusing one effect with another. The four metrics in this paper address these risks and lays out a case for ensuring they are all covered in every program. The scope of this paper, then, is the characteristics of a well-designed experiment using the concepts of DOE applied to Defense acquisition programs. There are other techniques and tools of applied statistics that are useful (and employed throughout the Defense industry), but none that provide the overarching processes to capture and address all the risks inherent to test. We include other statistical techniques where appropriate, but under the umbrella of our 1 Wing Operations Analyst, 46 Test Wing, Room 211 Bldg 1 Group Operations Analyst, 53d Wing, Suite 608, Bldg 351 3 Senior Analyst, Jacobs Engineering, 46 Test Squadron 2 1 This material is declared a work of the U.S. Government and is not subject to copyright protection in the United States. ` general, DOE-guided approach. There are other sciences (mathematics, probability, computer science) that are appropriate for some problems, even to the exclusion of statistics. In these cases, we can still apply them with an analogous strategy of identifying and managing risks with the right tools. This paper focuses on statistical problems, but the ideas can be applied to other problems as well, even software test. Nomenclature = Statistical error Type I – false positive; declaring a difference that does not exist = Statistical error Type II – false negative; failing to detect a true difference that does exist 1- = Statistical Confidence – probability of avoiding a type I error 1- = Statistical Power – probability of avoiding a type II error = Probability of success in a Bernoulli trial or Binomial series of trials p = Level of significance of a test statistic – probability of a value more extreme by random chance Y = The test response variable(s) or matrix– often referred to as measure of performance X = The matrix of test conditions under which the response is measured i = ith Regression coefficient or matrix of regression coefficients relating Y response to the X condition = the matrix of all noise causes: instrumentation errors, daily variation, test point inaccuracy, etc. Genesis. On the lead author‟s desk are two examples of “designed experiments.” Both are from a large, professional test center, both are seriously offered as solutions to the test problems they address, and both are so poorly done that they would earn a failing grade in any underclass course in experimental design. One is the test of a drone target for aerial missile shots. The test calls for twelve – one hour sorties repeating a particular drone profile specified in mach, altitude, maneuver, and countermeasures. The design is a replicated full factorial: one three level variable and one two level variable (six unique cases) with two replicates. Unfortunately, the two variables have little to do with the physics driving the drone‟s presentation as a target, and the response variable ( measure of performance - MOP) is undefined – “we‟ll know good performance when we see it.” A second experiment calls for using 19 weapons shots as follows: 14 demonstration shots and a five shot, three - variable I optimal design. Now both of these are, in fact, designed experiments, but they are poorly designed. The first does not define success and wastes 12 flight hours that could be more profitably spent examining hundreds of drone “presentations” that will reliably show its worth as an aerial target. The second design is so small that it lacks statistical power to detect even the largest of defects aside from total failure. So, clearly, simply calling for experiments to be designed experiments will not solve the problems facing the T&E Enterprise. Additional impetus has been lent with recent DOD-level interest in promoting DOE across both OT&E and DT&E in the Services (Gilmore, 2009). 2 We must have more objective ` and reliable metrics that separate well-designed experiments from those that are not. That problem is the genesis of this paper. Introduction A well-designed test or experiment has a number of characteristics that distinguish it from those that are less welldesigned. The experiment should be crafted by experts in the field of study, investigate an appropriate region of operations and science, be as efficient as possible to conserve resources, and measure the performance quantities involved rapidly and accurately. Data control should be rigorous, databases archived securely and rapidly, etc. Dr Richard Kass of the US Army Operational Test Command sketches out a number of contextual factors that indicate the experiment is likely to yield meaningful conclusions in his text The Logic of Warfighting Experimentation. (Kass, 2004). This paper concerns hallmarks of well-designed experiments that are drawn from the field of Design of Experiments (DOE.) Since 1997 in Air Combat Command‟s 53d Wing, the US Air Force has been promoting the use of design of experiments as the preferred method of constructing and analyzing test programs. The purpose of test is to mitigate program risk by revealing early-on problems in system design or performance. An experiment (or campaign of experiments) is a collection of test events meant to answer some question about the system under test. If the outcome of the experiment will not impact any decision regarding an acquisition program, then what is the purpose? Given that an experiment answers a question that matters (i.e.: will impact our opinion in a way that changes our approach to managing the tested system), we can agree that it is important that the test is designed such that the risk of coming to the wrong conclusion is not a byproduct or accident, but something we measure, manage, and control. Since our focus on risk drives us to test, risk should be the focus of experimentation. Our approach to navigating through experimental risk must be methodical and exhaustive in that we leave no aspect of test to chance if it has reasonable likelihood of corrupting the answer. Programmatic test risk is the responsibility of test directors/managers/engineers. Threats to answering the test question may include budget, schedule, manpower, equipment, materials, or other resources. An often overlooked category of risk is that of „chance‟; the magnitude of that risk is buried among the test design, collection, and analysis methods. Randomness in outcomes, background factors that corrupt the data, or unknowns that we cannot even measure all threaten to lead us to wrong conclusions. The science of random variation is statistics and the science of test is the statistical Design of Experiments (DOE). The principles, processes, methods, techniques, and tools of DOE give us the means to minimize the chance we fail to give a correct answer due to inefficient use of resources, not investigating the important conditions, having too little fidelity, or confusing one effect with another. The four metrics in this paper address these risks and lays out a case for ensuring they are all covered in every program. The scope of this paper, then, is the characteristics of a well-designed experiment using the concepts of DOE applied to Defense acquisition programs. There are other techniques and tools of applied statistics that are useful (and employed throughout the Defense industry), but none that provide the overarching processes to capture and address all the risks inherent to test. We include other statistical techniques where appropriate, but under the umbrella of our general, DOE-guided approach. There are other sciences (mathematics, probability, computer science) that are appropriate for some problems, even to the 3 ` exclusion of statistics. In these cases, we can still apply them with an analogous strategy of identifying and managing risks with the right tools. This paper focuses on statistical problems, but the ideas can be applied to other problems as well, even software test. We have identified four broad classes of metrics that can be measured for any test program. We divide the metrics into general (suitable for executive review) and expert (suitable for review by experts in experimental design.) Both levels of metrics are important. I A. Macrocampaign. Sequential Experimental Design Campaign – Simulation to Live I B. Microcampaign. Salvo Test - Screening, Investigation, Confirmation II A. Design Spans Battlespace: k & levels II B. Design Attains Power and Confidence | N III Execute with Randomization and Blocking IV Design Enables Assessing Relationship: Separability of Causal Factors Purpose and Audience: Eglin I. Sequential Experimental Design Campaign. IA. Macrocampaign. The battlespace for most systems undergoing military T&E is vast – 10‟s of possible variables resulting in many thousands or millions of possible unique test conditions. We seldom know which of the test conditions will matter most in affecting system performance, (though we have suspicions). Military T&E shares this situation with many other domains, including product design, research, and manufacturing/processing. Consequently, an experimental best practice is to structure the overall test program in such a way as to test in stages, with appropriate objectives, designs, and periods for analysis, understanding, and re-design. It is seldom wise to devote more than 25% of total resources to one experiment. (Box, 2004). The information revealed at each stage of experimentation is invaluable in considering how to continue the investigation. We can use the output of early stages to accelerate our learning about the process, validate (or improve) our various simulations, and refine the active battlespace for later stages of test. At the outset of the test, we seldom know which factors are most important, the appropriate range of factor values, the correct number of levels of each factor, repeatability or noise in the process, and many other facts. We are seldom well equipped to answer these questions at the outset – but correct answers are key to effective and efficient experimentation. Hallmarks of staged experimentation documented in test strategies (TS, TEMP, Test Concepts & Plans) should include a series of objectives, factor and response tables, initial designs, and estimated resources for each stage of experimentation, plus a management 4 ` reserve appropriate to the risk level of the process under test. To illustrate this metric, consider a current USAF weapon program, whose staged strategy might be briefly described as follows: “Thorough factor screening with Ioptimal designs spanning 20 multi-level variables using digital simulation (~5-10,000 runs, 10-15 designs). Objective: identify 8-12 most important test Excellence Checklist for: conditions, levels, and expected performance. Based on stage I results and predicted performance, conduct campaign of ~2000 HWIL runs followed by Sequential Experimentation Objectives for each stage w/ proposed level of simulation (digital through live) Estimated resources for each stage (often ~25% or less each) Thorough list of candidate test conditions and measures Initial experimental design with design metrics (remainder of Eglin Four) Discussion of likely re-design strategies following initial exploration Management reserve from ~10% to 30% depending on assessed risk & technology maturity 1000 captive carry events. Objective(s): validate strengths and weaknesses of digital simulation, further factor screening to 4-6 most important factors and performance predictions. Based on Stages I-III, conduct 2-level factorial or fractional factorial for 4-6 variables (with replicated points) in 16-24 runs in live releases. Objective(s): validate strengths and weaknesses of digital, HWIL, captive simulations, estimate expected performance. Reserve 5 live events to demonstrate less-important features and carry a management reserve of 30-40% (6-8 events) for expected technical risk. Staged experimentation is quite common in DOD programs, and are, in fact, the way we acquire most systems through contractor R&D, contractor test, developmental test, and operational test. This metric is different in that it envisions designed experiments as the tool of choice to conduct the experimentation. With a common architecture, language, and toolbox for test, DOE uniquely enables integrated KT/DT/OT testing in a fashion that has long been called for, but is seldom done in practice. I B. Microcampaign. Within each experimental environment, it is appropriate to outline a sequence of experiments. Initially, the objective will often be to screen many factors to identify the few that drive the process outcome. Identifying the single and combined factors active in changing the response is one key outcome of early experiments. Commonly, just a few of many factors that can possibly affect the measures of performance actually do so. Identifying these active and inactive factors is referred to as “screening.” The testers may then wish to reduce the size of the battlespace explored in subsequent tests. Just as importantly, we may discover unexpected facets of system performance such as nonlinear behavior, low (or high) noise in the experimental region, isolated unusual runs that do not conform to similar cases, features of the battlespace that are not well represented in this stage of experimentation (target backgrounds, natural environments, human reactions, etc). In the case of each of these outcomes, we have learned valuable information about process performance that should be used to redesign the next stage of testing. Many options for subsequent test stages are possible, including (Montgomery, 2004): 5 Perform one or more confirmation runs to verify predicted or unusual performance Change the range or levels of one or more factors to further explore performance trends Augment the existing design with additional runs to explicitly capture nonlinear performance ` Repeat some runs either because of incorrectly run settings or to improve estimate precision Drop factors because they have negligible effects on performance Add one or more new factors suggested by engineering learning accompanying the test Make engineering changes to simulations because predicted performance was not validated in live test A typical sequence of tests that should be outlined in each stage may include screening (or confirmation of pervious stage results), investigation, exploration and understanding, and finally prediction and confirmation. Each stage should be appropriately budgeted and scheduled in the test strategy. The first metric addresses a series of designs – almost always the way to proceed. The remaining three metrics can, and should, be assessed for Excellence Checklist for: Spanning Battlespace Evidence of in-depth, integrated brainstorming by all parties: fishbone diagrams, process flow diagrams. Thorough list of candidate test conditions. Table should include: Name of test condition Unit of measurement real (physical) values much preferred over labels) Range of physical levels Estimated priority of condition in describing system performance Expense/difficulty of controlling level: easy, hard, very hard to change. Proposed strategy of experimental control for each condition: constant, matrix variable, noise. If noise, specify covariate, randomized, or random effect. Levels chosen for experimental design Name of the chosen Design Strategy to assign variable levels to runs and N, number of total runs each design in the campaign. Eglin II. Design Spans Battlespace with Confidence and Power: Factors & Levels. IIA. Span Battlespace. In exploring the performance of multi-purpose military systems, a huge battlespace is the norm. That is, the system is intended to work across a number of platforms, in diverse environmental and weather conditions, against many types of enemy systems, deceptions, and countermeasures, and from a variety of geometric and kinematic engagement conditions. In actual practice using experimental design, more than 25 possible test conditions are commonly encountered. The conceptual embedded table drawn from an AGM-65 Maverick missile test is typical – many more than 12 conditions are possible, with multiple levels for each condition. To determine how many unique combinations are possible, simply multiply the number of contemplated levels together – there are nearly 140,000 unique combinations! Testing all combinations of these 12 variables is clearly prohibitive in any real environment, and even running this many digital simulations would be daunting. The problem, nonetheless, remains – the battlespace is large and we are charged to explore it. The 6 ` experimental design metric described here concerns the number of factors to be explored, the levels chosen for each factor and the experimental design strategy used to spread them across the battlespace. A best practice is for an integrated team of scientists, engineers, operators, designers, and maintenance technicians from the various interested organizations to spend one or two days brainstorming the factors included in the battlespace – how the system will be used, against what targets, in what environments, etc. An experienced facilitator, often a seasoned experimentalist, can greatly assist the team in this endeavor with simple tools like system operations process flow diagrams and Ishikawa or “fishbone” diagrams. The fruit of this brainstorming are these diagrams resulting in a list of possible test conditions and measures of performance that define the desired battlespace. Next, the team prioritizes these lists using the best available understanding since it is seldom possible to consider everything. Finally, the team considers alternate design strategies to span the desired space (usually in a sequential campaign of experiments – see Eglin I.) Many experimental design strategies are available, suited for a wide range of possible objectives and circumstances: comparative two-level designs, both 2-level and general factorial designs, fractional factorial designs, split plot designs, response surface designs for nonlinear processes, several classes of socalled “optimal designs”, and special purpose designs such as Plackett-Burman, space-filling, and many sorts of orthogonal arrays, to name a few. The inset graphic is a JMP-constructed scatterplot matrix showing the assignment of runs in an 83-run mixed level factorial design suggested for a recent decoy target test at Eglin. The team should be advised by a trained and experienced experimentalist or statistician to guide their selection. Often there are several good choices that can be made for a given budget, but serving different experimental objectives. Eglin IIB. Design Attains Power and Confidence for Number of Trials. In the presence of noise (measurement error, run-to-run variation) in outcomes, we are faced with the inescapable risk of making a wrong inference about what is happening during a test. When we observe a shift in performance during one or more trials, is the shift indicative of a casual effect based on changing test conditions, or is the shift instead merely a product of the noise in the system. Statisticians call the mistake of declaring a change being due to test conditions (when it is not) an error of type I, or the Greek letter (alpha). Conversely, the mistake of failing to detect such a shift when it exists in reality is called an error of type II or (beta). Making correct decisions in the face of noise is a hallmark of a welldesigned experiment: 1- is called Confidence and 1- is called Power. In the world of testing to requirements and specifications, loosely speaking, we would like to be confident that the performance changes we associate with test conditions really are thus associated; we would also be loathe to mistakenly miss important changes in performance because we lacked sufficient power in our designs. A few illustrations will help make these concepts more concrete. In what follows, we will fix confidence at some acceptable level (e.g. 95%) and consider effects on power. In a 7 ` recent guided weapons test, the contractor portion of the test matrix included only 8 trials, all near the gravity-fall line of the weapon, requiring little in the way of course corrections in the guidance package. This test design suffered from two causes – too few trials to create much power, and the design consisted of points that were in the Excellence Checklist for: Statistical Power Design reports Power for one or more key performance measures (e.g. accuracy, P(success), task time). Best practice – power reported for a range of total events (N). Power reported lists other power analysis parameters: Name of experimental design strategy (factorial, optimal, response surface, etc.) Size of design – number of factors and levels planned N – total number of trials Chosen level of error for these power values Expected process noise and basis for estimate (sigma - ) Shift in process performance to be detected and basis for importance (delta – d). center of the battlespace, thereby failing to span the intended use envelope (Eglin metric II). As a consequence, a critical aerodynamic instability was not discovered until operational testing began. In the opposite extreme, a recent F-16 test to markedly reduce the magnitude of vibration in the ventral fins was conducted in over 320 test conditions, many of them repeated multiple times. In this case, the design was over-powerful to detect the desired reduction, resulting in wasted resources. As a general rule, simulation and HWIL experiments often consist of too many trials, while expensive live tests suffer from too few trials, and lack power. Statistical power is influenced by many factors, but one of the most important is the total number of runs contemplated in a design. As a general rule, more trials (properly designed) lead to higher power. Name of the chosen Design Strategy to assign variable levels to runs and N, number of total runs In this embedded graphic, we see an illustration of two other factors that figure in the computation of power – the amount of movement in the response considered important in the system, and the amount of process noise (Greek letter sigma ). In the graphic, the shift is denoted by the Greek letter (delta). Tests falling in the lower left quadrant (noisy-small shift) require the most trials, while tests in the upper right (large shifts, low noise) require fewer. Consider two examples from the lower left and upper right quadrants to illustrate these principles. From the lower left first – in B-1 Block D testing, one of the required capabilities was to demonstrate offset aim points of 50 feet or more for unguided weapons such as the Mk-82 500 pound bomb. In other words, the aircrew could designate a visible target as the offset aim point and command the fire control system 8 ` to place the unguided bombs at a designated distance and angle to the aim point, e.g. 50 feet due west of the aim point. From 20,000 feet, unguided weapon delivery accuracy on the B-1 is measured in hundreds of feet, making the process quite noisy. Conversely, the shift required to be detected was on the order of 50 feet. Tests of this sort require many trials to attain statistical power. In the upper right example, consider IR Countermeasures. IR Countermeasure testing, such as the Large Aircraft IR Countermeasure system (LAIRCM), compare the miss distances (or hit performance) of man-portable IR surface to air missile systems against aircraft such as the C-17. Such tests are often conducted either in a hardware in the loop test facility such as the GEWF, or in open air testing at Eglin and elsewhere. Other test variables under investigation often include different fielded versions of the missiles, geometric variables like range, aspect and altitude of the engagement, different environmental conditions in both the ultraviolet and infrared, and perhaps even different jamming techniques or aircraft engine settings. With no IR countermeasures from these large targets, missile performance measured by miss distance is very small and consistent. When the countermeasures are working as designed, miss distances are very large and quite variable. Thus, in this case, the process noise without countermeasures is quite small and the shift desired to be detected is much larger, requiring relatively few trials to achieve statistical power. As a general rule, power grows dramatically in the first few trials and power increases begin to become increasingly expensive as the number of trials increases. In other words, fewer additional trials are required to improve power from 30-40% than the number of trials required to move from 70% to 80% power. This principle is depicted in the embedded graphic: computations of power are given for a simple design with 3 variables each at 2 levels – comprising 8 total combinations for each replicate of the design. To tie this example in with our previous ones, consider the B-1 Block D unguided bomb test. Such a 23 design (shorthand for 3 variables at 2 levels each) might consist of two altitudes (15K-20K), two airspeeds (500-600 kts), and two bomb types (Mk 82-Mk 84). The graphic4 was computed in Excel using output from Dr Russell Lenth‟s Pie-Face Java Applet for computing power (Lenth, 2006). Typically, minimum power is in excess of 80%, with a desired goal of 90-95% if achievable and affordable. To translate desired power in terms of this B-1 example, one can see that 80% power is Menu choices: Balanced ANOVA> Three Way ANOVA (additive model) >a=b=c=2; SD=1, delta=1, =0.05 remainder defaults>Options>Graph>Power vs. Replicates>Show Data. C&P to Excel and replace replicates with 8* replicates to show total weapon delivery events. Note that “additive model” is a liberal power estimate – only three main effects and intercept in the model. 4 9 ` achieved at about 32 events (4 replicates of 8 conditions), while 95% power might require 56 events or 7 replicates of the 8-trial design. In terms of sorties, assuming 12 passes per sortie, we are considering 3 – 5 sorties – probably not unaffordable in this particular test case, but perhaps not important enough of a capability to devote these resources to in the age of JDAM. One of the beauties of having this capability to computer power vs. required resources is that such resource-priority tradeoffs can now be made with objectivity, rigor, and with a clear understanding of the competing risks for the total test program. Eglin III. Execute with Randomization and Blocking. An often-neglected aspect of planning for designed experiments is controlling the fashion in which the design will be executed. Unbeknownst to the experimenter, nuisance factors change while the test is being conducted. These factors can include lot to lot variations in raw material, changes in the design of the system, learning, warm-up, wear-out, fatigue, boredom, weather, etc. Within a homogeneous unit of experimentation, the solution to background changes is randomization. That is, the order of the trials is randomized to ensure that changes in the experimental conditions do not line up Excellence Checklist for: Test Execution Name of the chosen Design Execution Strategy to account for background change. E.g.: Factorial in blocks Split plot design with easy and hard to change factors Analysis of Covariance Mixed Design – Fixed and Random variates in model Describe methods to control background with, or become confounded with the background changes. As with many principles of experimentation, this one has its roots in agriculture. Plant varieties were randomized in their location in fields to eliminate local changes in drainage, fertility, sun, and other factors from aligning with the changes in experimental conditions. Between fields, a comparable sequence of trials were run to isolate field-to-field differences. Randomization, then, was Fisher‟s expedient to “foil every device of the devil” (Fisher, 1935) in confusing the outcome of the experiment. Classical experimental designs break the total test design into sequences of trials that can be accommodated by the experimental set-up: number of passes per mission, number of mixtures that raw materials can accommodate, the number of troop evolutions that can be run during an exercise shift, etc. More complex patterns of experimental execution can be implemented as well. Factors that change in the background, that cannot be controlled, but can be measured, can be treated as either “covariates” and the analysis can be augmented to include the Analysis of Covariates so that the effects of the nuisance factor can be isolated from both the effects of the intentional experimental factors, as well as from the estimate of run to run variation (the error term.) Hard to change factors that would otherwise argue against complete randomization can be changed according to an experimental design called a split plot design. Again, the agricultural roots are apparent. Plots of land were divided into sections with hard to change factors such as irrigation applied ot the whole plot, while easy to change factors such as plant variety and weeding were randomized within the whole plots. Split plot designs are not without drawbacks since the lack of complete randomization weakens the cause and effect relationship between the hard-to-change factor and the response, but such designs can maintain a well-defined relationship between the easy 10 ` to change factors as well as the “whole plot” (hard to change) and “split plot” (easy to change) factor interactions. Eglin IV. Design Enables Assessing Relationship: Separability of Causal Factors. A unique feature of tests constructed with designed experiments methods is the ability of the tester to link suspected cause(s) to Excellence Checklist for: Separability of Test Conditions Name of the chosen Design Strategy to assign variable levels to runs. Also: Size of design – number of factors and levels planned N – total number of trials the observed effects. That is, the test matrix is designed in such a way as to (mostly) unambiguously link changes in system performance to the changes in the test conditions that are associated or correlated with the performance change. It is too strong to claim “cause and effect” chains since statistical sampling never establishes that strong a linkage, but association and correlation can certainly be detected. Cause and effect reasoning is the prerogative, and responsibility of, the subject matter expert: the physicist, Clutter Low-Desert Medium-Forest Fixed 0 kts Worst - 4 Tgt Motion Moving 30 kts Best - 4 Example of "Best-Worst" Case Design If less than all combos are run (full factorial) report Design Resolution – e.g., Resolution III, IV, V… Also: engineer, operator or performance changed in the pattern observed in the test. Nonetheless, Describe strategy to augment fraction to resolve confoundingaugmenting, foldover, predictconfirm, etc. Report estimate of N* number of augmenting runs required Advanced] Other than for full or classical 2-level fractional factorial, report: Maximum and average variance inflation factor Prediction error vs. fraction of the design space Examine X’X matrix, report predominant design resolution Report model supported: e.g. main effects only, main effects plus two factor interactions, quadratic, etc. scientist who can explain why the being able to link changes in performance to test conditions is a powerful feature of Clutter Low-Desert Medium-Forest Fixed 0 kts 2 2 Tgt Motion Moving 30 kts 2 2 Example of "Factorial" Design a well-designed experiment. We call this linkage “separability5.” To understand when separability exists, let us first consider a toy example when it does not. In the graphic below, consider a test design where engineers examine target seeker performance in two cases – the best case with a moving target in a low clutter environment and the worst case when the target is fixed in a moderate clutter environment. Let us say that four trials were conducted in both the best and worst case for a total of eight trials. Suppose further that the seeker performs well (green-shaded) in the best case, but poorly (red-shaded) in the worst case. Should the engineer 5 Mathematically, this condition is referred to as orthogonality – that is, the design is constructed such that patterns of test conditions that measure single factors acting alone (main effects) and those that measure factors acting together to change the response (interactions) can be clearly distinguished by careful arrangement of test conditions. More specifically, the columns in the matrix are mutually at right angles (orthogonal) to each other. As the number of factors and levels increases, we carefully depart from full orthogonality. 11 ` improve the seeker clutter performance or the ability to detect targets with low relative motion? A moment‟s thought will show that, in this test design, the effects of clutter and motion are bound together and cannot be separated. Statisticians refer to this condition as “confounding” – the pattern of tests is such that the individual and combined effects of the conditions cannot be separated – no matter how many trials are conducted. The solution to confounding is to cross the test conditions – look at more combinations so that the clutter and motion effects can be estimated separately. Such a design is referred to as a “factorial” or fully-crossed design in that we test all combinations of the two variables, as shown in the embedded graphic at right. Note that we still have 8 total trials, but now we examine more combinations of battlespace conditions. With this factorial, three patterns of performance will lead us unambiguously to the conditions associated with poor performance, and therefore provide physical clues to assist the engineer in explaining the observed pattern of performance. In the three cases shown in the graphic to the right, the red shading indicates seeker performance is poor in clutter, against fixed targets, and finally against fixed targets in clutter. Any of the three of these cases could account for the observed performance in the “best-worst” case scenario examined earlier. In the case of clutter or motion acting alone to degrade seeker performance, we refer to this as “main effects”, while the case where both clutter and motion were required to explain seeker poor performance, we refer to as an interaction – in this case, two-factor interaction. This toy example illustrates the Clutter Low Med Fixed 2 2 Motion Moving 2 2 Failure: Clutter Clutter Low Med Fixed 2 2 Motion Moving 2 2 Failure: Lack of Motion Clutter Low Med Fixed 2 2 Motion Moving 2 2 Failure: Fixed Tgt in Clutter mathematical principles involved – extending these principles to many factors at many levels is less straightforward. The mathematics behind this fourth metric are truly elegant and powerful, invented in the 1920‟s by Sir R.A. Fisher of Britain. He referred to this field of statistics as “The Design of Experiments” and laid out these principles in an eponymous book in 1935. Many variations on this theme have been derived in the ensuing 90 years. Let us first consider the 2-variable case again, this time in Case 1 2 3 4 A:Clutter B:Motion Mean A B AB DetectTime(s) low fixed 1 -1 -1 1 6 med fixed 1 1 -1 -1 30 low moving 1 -1 1 -1 6 med moving 1 1 1 1 30 Estimates 18 12 0 0 Factorial Design in Coded (Geometric) View with Response a slightly different arrangement. For convenience, we will code the levels of the variables in terms of +/- 1, and assign the letter A to clutter and B to motion. For simplicity, we will examine only the four unique conditions formed by the two test conditions (the first replicate of the design above.) In this matrix arrangement, we see the test conditions in both coded and natural units. We see an invented column of responses – time for the seeker to detect the target. We also see a new row – entitled “Estimates”. This row speaks to how much changing the factors alone (A and B columns) and combined (AB column) affect detection time. The AB (interaction) column is constructed by multiplying A*B columns. The “Mean” column contains the average of all four runs. A brief glance at the table shows that in low clutter (A=-1) detection time is 6, and in high clutter (A=1), detection time is 30 seconds. In other words, when A changes two coded units from -1 to +1, detection time increases by 24 seconds. How much does detection time change when A changes one coded unit? One half of 24 seconds or 12 seconds. This change in the response due to one unit change in the test condition is referred to as a “regression coefficient”. In one sense, factorial designs are highly efficient 12 ` means of estimating these regression coefficients, or an underlying, empirical input-output relationship (math model) relating the test conditions to the responses. In the coded units, the columns are also the equations used to make these estimates. That is, the estimate for the regression coefficient of A is simply the average response at the high value of A minus the average response at the low value of A. Similarly with columns B and AB. The equation for the mean is the familiar summing up of all four response values and averaging them over the four runs: (30+30+6+6)/4 = 18. In four trials, we solve uniquely for four unknowns – the mean, the effects of A and B variables, and the AB interaction. All estimates are independent of the others, unique, and the equations are clearly separable6. No other arrangement of test conditions consisting of four runs is as efficient as this one. Because the effect of all the variables can be uniquely estimated, we would refer to this design as having “full resolution.” To investigate a design with less than full resolution, consider adding another variable, say visibility, as factor C. Case 1 2 3 4 5 6 7 8 A:Clutter low med low med low med low med B:Motion Mean A B AB C:Visibility C AC BC fixed 1 -1 -1 1 low -1 1 1 fixed 1 1 -1 -1 low -1 -1 1 moving 1 -1 1 -1 low -1 1 -1 moving 1 1 1 1 low -1 -1 -1 fixed 1 -1 -1 1 high 1 -1 -1 fixed 1 1 -1 -1 high 1 1 -1 moving 1 -1 1 -1 high 1 -1 1 moving 1 1 1 1 high 1 1 1 Estimates 18 12 0 0 0 0 0 Factorial Design in Coded (Geometric) View with Response ABC DetectTime(s) -1 6 1 30 1 6 -1 30 1 6 -1 30 -1 6 1 30 0 Once can see in the expanded matrix with 2*2*2 = 23=8 runs that the new variable (C) is made to have no effect. The new columns AC, BC, and ABC account for the possibility that the variable C might work with (interact) the existing variables and their interaction AB. We have added four runs (rows), and an additional four equations (columns), allowing us to solve for eight unknowns in eight runs. Again, all estimates are separable (mutually orthogonal.) One final example – retain the three variables, but now reduce the matrix to four runs only. Case 1 2 3 4 A:Clutter B:Motion Mean A B AB C:Visibility C AC BC ABC DetectTime(s) low fixed 1 -1 -1 1 high 1 -1 -1 1 6 med fixed 1 1 -1 -1 low -1 -1 1 1 30 low moving 1 -1 1 -1 low -1 1 -1 1 6 med moving 1 1 1 1 high 1 1 1 1 30 Estimates 18 12 0 0 0 0 12 18 One Half Fractional Factorial Design in Coded (Geometric) View with Response In this case, we have retained the same model, but the estimate for BC is now the same as for A and the estimate for ABC is now the same as the overall mean. If one examines the patterns of plus and minuses in the columns, it can be noted that the pattern for C is identical with AB, A=BC, B=AC, and Mean=ABC. With only four runs, we can 6 In mathematical terms, the four columns/vectors are mutually orthogonal – the product of any two columns = 0. 13 ` create only four unique equations, meaning we abandon our ability to uniquely estimate the eight possible terms in our math model. Since the main effects A, B, and C are all identical (confounded with) two factor interactions (BC, AC, AB) we say the design is of resolution III. Many different design resolutions are available: resolution IV designs have main effects aliased with three-factor interactions (3+1=4) and two way interactions are aliased with each other (2+2=4). Resolution V designs have main effects aliased with four factor interactions (4+1=5) and two factor interactions are aliased with three factor interactions (3+2=5). To reinforce this point, let us return to our original “best-worse” case design with these new insights. Here are displayed the 4 replicates of the best and worst cases. Had we tried to estimate a model with this design7 … we would find that the effect of A and B are bound together, but in an opposite manner than above: A = -B. With the same data pattern as above, we might mistakenly think both A and B were involved, when only A was an active variable. Similarly, the interaction AB is bound Case Best Best Best Best Worst Worst Worst Worst A:Clutter low low low low med med med med B:Motion Mean A B AB DetectTime(s) moving 1 -1 1 -1 30 moving 1 -1 1 -1 30 moving 1 -1 1 -1 30 moving 1 -1 1 -1 30 fixed 1 1 -1 -1 6 fixed 1 1 -1 -1 6 fixed 1 1 -1 -1 6 fixed 1 1 -1 -1 6 Estimates 18 -12 12 -18 "Best-Worst" Design in Coded (Geometric) View with Response to the opposite of the overall mean. This would be a design of Resolution II – two main effects are confounded with each other. To summarize, a key measure of separability is design Resolution: full factorials (all combinations) are of full resolution (n confounding), while fractional factorials (fewer than all combinations) are less than full resolution. Higher resolution designs are generally easier to analyze and interpret, but any fractional factorial must be executed such that additional augmenting runs can be added to resolve uncertainties or confirm our understanding of cause and effect. There are many more expert-level metrics to be considered in this analysis phase of the experimental design. But it is worthwhile to understand that the analysis implications are implicit in the design and can be evaluated and assessed prior to collecting the first data point. A few can be mentioned to wrap up this concept paper. A relatively new development in DOE is the fraction of the design space (FDS) plot. Such a plot shows the variance of the predicted value of the response as a function of the volume of the design space – the battlespace the design searches. Surely one would like such a prediction variance to be uniformly low and well-behaved across the design region. The FDS plot shows exactly this relationship and is particularly useful to compare competing designs to examine the same battlespace. New design developments in DOE have some serious drawbacks when examined under this glass: space-filling designs and certain optimal designs. Such designs are easily constructed via computer algorithms to fit almost any experimental condition (see the original examples in this paper) but the FDS plot will revela that small designs with many factors will predict poorly. Another way of saying this is that they will fail to correspond to reality quite often – surely an undesirable feature of a designed experiment. 7 Alpha optimality criteria – A-, D-, G-, We would not think about model building if we lacked DOE, and would not do such a design if we did know DOE. 14 ` I- optimality criteria related to the design (X) matrix, and others all tell their tales to expert experimenters and should be examined for any proposed design. Finally, the method of analysis proposed for a given design should be stated and justified. Designed experiments can be analyzed by classical methods such as multiple regression and the analysis of variance (ANOVA). They can also be examined with more complex methods such as Generalized Linear Models, Kriging (a form of spline fitting) Generalized Regression Splines (GASP), Multiple Regression Splines (MARS) as well as variations on these methods including CART, Bayesian modeling, and more. Each design should have the proposed method of analysis stated so that the strengths and weaknesses as well as the fitness of the design to the analysis method may be considered. As a last caution, we must remember that we are fitting proposed relations from test conditions to responses in the presence of noise. It may well be that we are mistaken in the proposed relationship. In this case, predict-confirm runs are highly indicated. That is, if we have settled on a proposed relationship, we should be prepared to make predictions of the performance we expect to see in the battlespace and actually run a series of trials (not many – usually a few percent) to show that we can successfully predict performance especially for points in the battlespace we have not tested. DOE methods allow us to make specific predictions, as well as to bound our prediction uncertainty during these trials. Executed prediction points should fall within our bounds or we have reason to believe there is more to learn than what we have uncovered thus far. Predict-confirm points should be explicitly budgeted and scheduled for. Summary. A concept paper to be sure. This paper has attempted to outline the beginnings a set of metrics for use by the Department of Defense in grading its test programs. These metrics are particularly important in the beginning stages of our conversion of “best efforts” to true designed experiments. Two cautionary tales at the commencement of this paper show that simply claiming that a series of test runs is a “designed experiment” does not indicate that it will be successful in that which we have set it to do: correctly explore a multi-dimensional battlespace and correctly show the systems that work as desired from those that do not. And the problem is key: without a set of objective metrics, we risk sending systems into combat (as we sometimes do today) without a rigorous test program to show that they are fit for their intended uses. These metrics uncover whether the designs contemplated are a wellconsidered series (campaign) of investigation in increasingly expensive and credible simulations of reality; whether process decomposition has been thoroughly done – using a team approach to specify which test conditions should be considered and measuring what constitutes success; the design itself, whether it truly spans the battlespace and has enough trials to arrive at the correct answer with some degree of surety (confidence and power); and finally if he method of analysis is suitable to design class chosen, and is likely to correctly link changes in the test conditions to associated changes in the response variable(s). 15 ` References Box, G., Hunter, S., Statistics For Experimenters, Wiley & Sons, 2004. Fisher, R.A., The Design of Experiments, Oxford University Press, (re-issue), New York, 1990. Originally published by Hafner in 1935. Gilmore, J.M., DOT&E Memo, Nov 24, 2009, Subject: Test and Evaluation Initiatives, https://acc.dau.mil/CommunityBrowser.aspx?id=312213&lang=en-US Hill, R., and Chambal S., “A Framework for Improving Experimental Design and Statistical Methods for Test and Evaluation”, AIAA T&E Days 2009, Albuquerque, NM. Hutto, Gregory. “Design of Experiments: Meeting the Central Challenge of Flight Test.” American Institute of Aeronautics and Astronautics 2005-7607, presented at US Air Force T&E Days, 6-8 December 2005, Nashville Tennessee. Kass, Richard, The Logic of Warfighting Experiments (Hardcover), CCRP Publication Series (January 2006), ISBN-10: 1893723194, Kotter, J. Leading Change, Harvard Business School Press, 1996 Landman, D., Simpson, J., Mariani, R., Ortiz, F., and C. Britcher. “A High Performance Aircraft Wind Tunnel Test using Response Surface Methodologies.” American Institute of Aeronautics and Astronautics 2005-7602, presented at the US Air Force T&E Days 2005, Nashville TN. Meyers, R. and Montgomery, D., Response Surface Methodology, Wiley, 2002. Montgomery, Douglas C. Design and Analysis of Experiments, 5th edition. New York: John Wiley & Sons, Inc., 2001. Nair, Vijayan N., ed. “Taguchi‟s Parameter Design: A Panel Discussion.” Technometrics 34, no. 2 (May 1992): 127-161. Sanchez, S.M. and Sanchez, P.J. (2005), “Very Large Fractional Factorial and Central Composite Designs,” ACM Transactions on Modeling and Computer Simulation, Vol. 15, No. 4, October 2005, Pages 362–377. Simpson, James R. and Wisnowski, James W., “Streamlining Flight Test with the Design and Analysis of Experiments.” Journal of Aircraft 38 no. 6, American Institute of Aeronautics and Astronautics 0021-8669 (November-December 2001): 11101116. Tucker, A., Hutto, G. “Application of Design of Experiments to Flight Test: A Case Study”, AIAA T&E Days 2008, Los Angeles, Ca. Whitlock, M.C., “Combining probability from independent tests: the weighted Z-method is superior to Fisher's approach,” Communication, retrieved from http://www3.interscience.wiley.com/journal/118663174/abstract?CRETRY=1&SRETRY=0, January 2009. Xu, H., Cheng, S.W. & Wu, C. F. J. “Optimal Projective Three-Level Designs for Factor Screening and Interaction Detection.”Technometrics, 46 (2004), 280-292 16 ` Annotated Bibliography. Lenth, R. V. (2006-9). Java Applets for Power and Sample Size [Computer software]. Retrieved Aug 15, 2006, from http://www.stat.uiowa.edu/~rlenth/Power. Dr. Russ Lenth has provided an exceptional service to the test community by creating and distributing this power analysis tool. In his words, here are some very right things one can do with Pie-Face. Use power prospectively for planning future studies. Software such as is provided on this website is useful for determining an appropriate sample size, or for evaluating a planned study to see if it is likely to yield useful information. It is easy to get caught up in statistical significance and such; but studies should be designed to meet scientific goals, and you need to keep those in sight at all times (in planning and analysis). The appropriate inputs to power/sample-size calculations are effect sizes that are deemed clinically important, based on careful considerations of the underlying scientific (not statistical) goals of the study. Statistical considerations are used to identify a plan that is effective in meeting scientific goals -- not the other way around. Do pilot studies. Investigators tend to try to answer all the world's questions with one study. However, you usually cannot do a definitive study in one step. It is far better to work incrementally. A pilot study helps you establish procedures, understand and protect against things that can go wrong, and obtain variance estimates needed in determining sample size. A pilot study with 20-30 degrees of freedom for error is generally quite adequate for obtaining reasonably reliable sample-size estimates. Montgomery, Douglas C., Design and Analysis Of Experiments, (2004), 6th edition. Includes a review of elementary statistical methods, designs for experiments with a single factor, incomplete block designs, factorial designs, multifactor designs and nested arrangements, randomization restrictions, regression analysis as a methodology for analysis of unplanned experiments, and response surface methodology. Probably the most widely used graduate and undergraduate text, Montgomery‟s work includes numerous worked actual engineering examples, a wealth of web-supplementary material, and a satisfying depth of mathematical detail to show the underlying theory. We have used this book exclusively for more than 19 years to teach DOE and it has few peers. Myers, Raymond H. and Montgomery, Douglas C., Response Surface Methodology: Process and Product Optimization Using Designed Experiments, (2009), 3rd Ed. hardcover with 700 pp.; $59.95 The process of identifying and fitting an appropriate response surface model from experimental data. The book deals with the exploration and optimization of response surfaces and integrates the three topics: Statistical Experimental Design Fundamentals, Regression Modeling Techniques, and Elementary Optimization Methods. Throughout, the text uses the elegant matrix algebra theory underlying these methods so this book is recommended for those comfortable in matrix theory. The third edition is complete with updates that capture the important advances in the field of experimental design including the generalized linear model, optimal design and design augmentation, space filling designs, and fraction of the design space prediction variance metrics allowing different designs to be 17 ` compared and contrasted. With the first third devoted to classical design of experiments, this single volume can be used to teach both basic DOE as well as its extension into modeling nonlinear surfaces and system optimization. Box, George E. P., Hunter, William G. and Hunter, J. Stuart, Statistics for Experimenters, (2004), 2nd Ed. hardcover with 638 pp.; ISBN 0-471-09315-7; $69.95 Every experimenter should have this classic text in his library. The authors are well known across the world and their presentation of design types is extremely thorough! Box, in particular, began as a practicing industrial chemist and was drawn to DOE to solve the many vexing problems of achieving manufacturing mixtures with multiple, conflicting qualities like cost, yield, time, profit, effectiveness, etc… Rewritten and updated, this new edition adopts the same approach as the initial landmark edition through worked examples, understandable graphics, and appropriate use of the leading statistics packages. Box begins each section with a problem to be solved and hen develops both the theory and practice of its solution, building on what has come before. Updated with recent advances in the field of experimental design, Box discusses computer algorithmic approaches to design augmentation, split plot designs for robust product optimization, multi-response optimization both graphically and numerically, Bayesian approaches to model selection, and projection properties of complex Plackett Burman screening designs that make them desirable in sparse processes dominated by a few major causal factors. Replete with physical, biological, social, chemical, and engineering examples, this book is an invaluable reference for all practicing experimentalists. 18
© Copyright 2026 Paperzz