Experimental Design

SPH 247
Statistical Analysis of
Laboratory Data
April 2, 2010
SPH 247 Statistical Analysis of Laboratory Data
1
Basic Principles of Experimental
Investigation
 Sequential Experimentation
 Comparison
 Manipulation
 Randomization
 Blocking
 Simultaneous variation of factors
 Main effects and interactions
 Sources of variability
 Issues with two-color arrays
April 2, 2010
SPH 247 Statistical Analysis of Laboratory Data
2
Sequential Experimentation
 No single experiment is definitive
 Each experimental result suggests other experiments
 Scientific investigation is iterative.
 “No experiment can do everything; every experiment
should do something,” George Box.
April 2, 2010
SPH 247 Statistical Analysis of Laboratory Data
3
Analyze Data
from
Experiment
Plan
Experiment
Perform
Experiment
April 2, 2010
SPH 247 Statistical Analysis of Laboratory Data
4
Comparison
 Usually absolute data are meaningless, only
comparative data are meaningful
 The level of mRNA in a sample of liver cells is not
meaningful
 The comparison of the mRNA levels in samples from
normal and diseased liver cells is meaningful
April 2, 2010
SPH 247 Statistical Analysis of Laboratory Data
5
Internal vs. External Comparison
 Comparison of an experimental results with historical
results is likely to mislead
 Many factors that can influence results other than the
intended treatment
 Best to include controls or other comparisons in each
experiment
April 2, 2010
SPH 247 Statistical Analysis of Laboratory Data
6
Manipulation
 Different experimental conditions need to be imposed
by the experimenters, not just observed, if at all
possible
 The rate of complications in cardiac artery bypass graft
surgery may depend on many factors which are not
controlled and may be hard to measure
April 2, 2010
SPH 247 Statistical Analysis of Laboratory Data
7
April 2, 2010
SPH 247 Statistical Analysis of Laboratory Data
8
Randomization
 Randomization limits the difference between groups
that are due to irrelevant factors
 Such differences will still exist, but can be quantified
by analyzing the randomization
 This is a method of controlling for unknown
confounding factors
April 2, 2010
SPH 247 Statistical Analysis of Laboratory Data
9
 Suppose that 50% of a patient population is female
 A sample of 100 patients will not generally have exactly
50% females
 Numbers of females between 40 and 60 would not be
surprising
 In two groups of 100, the disparity between the number
of females in the two groups can be as big as 20% simply
by chance
 This also holds for factors we don’t know about
April 2, 2010
SPH 247 Statistical Analysis of Laboratory Data
10
 Randomization does not exactly balance against any
specific factor
 To do that one should employ blocking
 Instead it provides a way of quantifying possible
imbalance even of unknown factors
 Randomization even provides an automatic method of
analysis that depends on the design and randomization
technique.
April 2, 2010
SPH 247 Statistical Analysis of Laboratory Data
11
The Farmer from Whidbey Island
 Visited the University of Washington with a
Whalebone water douser
 10 Dixie cups, 5 with water, 5 empty, covered with
plywood
 If he gets all 10 right, is chance a reasonable
explanation?
April 2, 2010
SPH 247 Statistical Analysis of Laboratory Data
12
 10 
   252
5
1
 .004
252
 The randomness is produced by the process of
randomly choosing which 5 of the 10 are to
contain water
 There are no other assumptions
April 2, 2010
SPH 247 Statistical Analysis of Laboratory
Data
13
 If the randomization had been to flip a coin for each of
the 10 cups, then the probability of getting all 10 right
by chance is different
 There are 210 = 1024 ways for the randomization to
come out, only one of which is right, so the chance is
1/1024 = .001
 The method of randomization matters
April 2, 2010
SPH 247 Statistical Analysis of Laboratory Data
14
Randomization Inference
 20 tomato plants are divided 10 groups of 2 placed next
to each other in the greenhouse
 In each group of 2, one is chosen to receive fertilizer A
and one to receive fertilizer B
 The yield of each plant is measured
April 2, 2010
SPH 247 Statistical Analysis of Laboratory Data
15
1
2
3
4
5
6
7
8
9
10
A
132
82 109 143 107
66
95 108
88 133
B
140
88 112 142 118
64
98 113
93 136
diff
April 2, 2010
8
6
3
-1
11
-2
SPH 247 Statistical Analysis of Laboratory Data
3
5
5
3
16
 Average difference is 4.1
 Could this have happened by chance?
 Is it statistically significant?
 If A and B do not differ in their effects (null hypothesis is
true), then the plants’ yields would have been the same
either whether A or B is applied
 The difference would be the negative of what it was if the
coin flip had come out the other way
April 2, 2010
SPH 247 Statistical Analysis of Laboratory Data
17
 In pair 1, the yields were 132 and 140.
 The difference was 8, but it could have been -8
 With 10 coin flips, there are 210 = 1024 possible outcomes
of + or – on the difference
 These outcomes are possible outcomes from our action of
randomization, and carry no assumptions
April 2, 2010
SPH 247 Statistical Analysis of Laboratory Data
18
 Of the 1024 possible outcomes that are all equally likely
under the null hypothesis, only 3 had greater values of
the average difference, and only four (including the one
observed) had the same value of the average difference
 The likelihood of this happening by chance is
[3+4/2]/1024 = .005
 This does not depend on any assumptions other than that
the randomization was correctly done
April 2, 2010
SPH 247 Statistical Analysis of Laboratory Data
19
1
2
3
4
5
6
7
8
9
10
A
132
82 109 143 107
66
95 108
88 133
B
140
88 112 142 118
64
98 113
93 136
diff
April 2, 2010
8
6
3
-1
11
-2
SPH 247 Statistical Analysis of Laboratory Data
3
5
5
3
20
d  4.1
sd  3.872
4.1
4.1
t9 

 3.35
3.872 / 10 1.224
p  .0043 by t-test
p  .0049 by true randomization distribution
same range for simulation randomization distributions
The t-test can be thought of as an approximation
to the randomization distribution.
April 2, 2010
SPH 247 Statistical Analysis of Laboratory Data
21
Randomization in practice
 Whenever there is a choice, it should be made using a
formal randomization procedure, such as Excel’s
rand() function.
 This protects against unexpected sources of variability
such as day, time of day, operator, reagent, etc.
April 2, 2010
SPH 247 Statistical Analysis of Laboratory Data
22
Pair
Number
1
2
3
4
5
6
7
8
9
10
April 2, 2010
First Sample Treatment
A or B?
A or B?
A or B?
A or B?
A or B?
A or B?
A or B?
A or B?
A or B?
A or B?
SPH 247 Statistical Analysis of Laboratory Data
23
Pair
Num
1
2
3
4
5
6
7
8
9
10
April 2, 2010
First Sample random
Treatment
number
A or B?
0.871413
A or B?
0.786036
A or B?
0.889785
A or B?
0.081120
A or B?
0.297614
A or B?
0.540483
A or B?
0.824491
A or B?
0.624133
A or B?
0.913187
A or B?
0.001599
SPH 247 Statistical Analysis of Laboratory Data
24
 =rand() in first cell
 Copy down the column
 Highlight entire column
 ^c (Edit/Copy)
 Edit/Paste Special/Values
 This fixes the random numbers so they do not
recompute each time
 =IF(C3<0.5,"A","B") goes in cell C2, then copy
down the column
April 2, 2010
SPH 247 Statistical Analysis of Laboratory Data
25
Plant
Pair
1
2
3
4
5
6
7
8
9
10
April 2, 2010
First Plant
Treatment
B
B
B
A
A
B
B
B
B
A
random
number
0.871413
0.786036
0.889785
0.081120
0.297614
0.540483
0.824491
0.624133
0.913187
0.001599
SPH 247 Statistical Analysis of Laboratory Data
26
 To randomize run order, insert a column of random
numbers, then sort on that column
 More complex randomizations require more care, but
this is quite important and worth the trouble
 Randomization can be done in Excel, R, or anything
that can generate random numbers
April 2, 2010
SPH 247 Statistical Analysis of Laboratory Data
27
Blocking
 If some factor may interfere with the experimental
results by introducing unwanted variability, one can
block on that factor
 In agricultural field trials, soil and other location
effects can be important, so plots of land are
subdivided to test the different treatments. This is the
origin of the idea
April 2, 2010
SPH 247 Statistical Analysis of Laboratory Data
28
 If we are comparing treatments, the more alike the units
are to which we apply the treatment, the more sensitive
the comparison.
 Within blocks, treatments should be randomized
 Paired comparisons are a simple example of randomized
blocks as in the tomato plant example
April 2, 2010
SPH 247 Statistical Analysis of Laboratory Data
29
Simultaneous Variation of Factors
 The simplistic idea of “science” is to hold all things
constant except for one experimental factor, and then
vary that one thing
 This misses interactions and can be statistically
inefficient
 Multi-factor designs are often preferable
April 2, 2010
SPH 247 Statistical Analysis of Laboratory Data
30
Interactions
 Sometimes (often) the effect of one variable depends
on the levels of another one
 This cannot be detected by one-factor-at-a-time
experiments
 These interactions are often scientifically the most
important
April 2, 2010
SPH 247 Statistical Analysis of Laboratory Data
31
 Experiment 1. I compare the room before and after I
drop a liter of gasoline on the desk. Result: we all leave
because of the odor.
April 2, 2010
SPH 247 Statistical Analysis of Laboratory Data
32
 Experiment 1. I compare the room before and after I
drop a liter of gasoline on the desk. Result: we all leave
because of the odor.
 Experiment 2. I compare the room before and after I
drop a lighted match on the desk. Result: no effect
other than a small scorch mark.
April 2, 2010
SPH 247 Statistical Analysis of Laboratory Data
33
 Experiment 1. I compare the room before and after I
drop a liter of gasoline on the desk. Result: we all leave
because of the odor.
 Experiment 2. I compare the room before and after I
drop a lighted match on the desk. Result: no effect
other than a small scorch mark.
 Experiment 3. I compare all four of ±gasoline and
±match. Result: we are all killed.
 Large Interaction effect
April 2, 2010
SPH 247 Statistical Analysis of Laboratory Data
34
Statistical Efficiency
 Suppose I compare the expression of a gene in a cell
culture of either keratinocytes or fibroblasts,
confluent and nonconfluent, with or without a
possibly stimulating hormone, with 2 cultures in each
condition, requiring 16 cultures
April 2, 2010
SPH 247 Statistical Analysis of Laboratory Data
35
 I can compare the cell types as an average of 8
cultures vs. 8 cultures
 I can do the same with the other two factors
 This is more efficient than 3 separate experiments
with the same controls, using 48 cultures
 Can also see if cell types react differently to hormone
application (interaction)
April 2, 2010
SPH 247 Statistical Analysis of Laboratory Data
36
Fractional Factorial Designs
 When it is not known which of many factors may be
important, fractional factorial designs can be helpful
 With 7 factors each at 2 levels, ordinarily this would
require 27 = 128 experiments
 This can be done in 8 experiments instead
April 2, 2010
SPH 247 Statistical Analysis of Laboratory Data
37
1
2
3
4
5
6
7
8
April 2, 2010
F1 F2 F3 F4 F5 F6 F7
H H H H H H H
H H
L H
L
L
L
H
L H
L H
L
L
H
L
L
L
L H H
L H H
L
L H
L
L H
L
L H
L H
L
L H H
L
L H
L
L
L H H H
L
SPH 247 Statistical Analysis of Laboratory Data
38
1
2
3
4
5
6
7
8
April 2, 2010
F1 F2 F3 F4 F5 F6 F7
H H H H H H H
H H
L H
L
L
L
H
L H
L H
L
L
H
L
L
L
L H H
L H H
L
L H
L
L H
L
L H
L H
L
L H H
L
L H
L
L
L H H H
L
SPH 247 Statistical Analysis of Laboratory Data
39
1
2
3
4
5
6
7
8
April 2, 2010
F1 F2 F3 F4 F5 F6 F7
H H H H H H H
H H
L H
L
L
L
H
L H
L H
L
L
H
L
L
L
L H H
L H H
L
L H
L
L H
L
L H
L H
L
L H H
L
L H
L
L
L H H H
L
SPH 247 Statistical Analysis of Laboratory Data
40
1
2
3
4
5
6
7
8
April 2, 2010
F1 F2 F3 F4 F5 F6 F7
H H H H H H H
H H
L H
L
L
L
H
L H
L H
L
L
H
L
L
L
L H H
L H H
L
L H
L
L H
L
L H
L H
L
L H H
L
L H
L
L
L H H H
L
SPH 247 Statistical Analysis of Laboratory Data
41
Main Effects and Interactions
 Factors Cell Type (C), State (S), Hormone (H)
 Response is expression of a gene
 The main effect C of cell type is the difference in
average gene expression level between cell types
April 2, 2010
SPH 247 Statistical Analysis of Laboratory Data
42
 For the interaction between cell type and state, compute
the difference in average gene expression between cell
types separately for confluent and nonconfluent
cultures. The difference of these differences is the
interaction.
 The three-way interaction CSH is the difference in the
two way interactions with and without the hormone
stimulant.
April 2, 2010
SPH 247 Statistical Analysis of Laboratory Data
43
Sources of Variability in Laboratory Analysis
 Intentional sources of variability are treatments and
blocks
 There are many other sources of variability
 Biological variability between organisms or within an
organism
 Technical variability of procedures like RNA
extraction, labeling, hybridization, chips, etc.
April 2, 2010
SPH 247 Statistical Analysis of Laboratory Data
44
Replication
 Almost always, biological variability is larger than
technical variability, so most replicates should be
biologically different, not just replicate analyses of the
same samples (technical replicates)
 However, this can depend on the cost of the
experiment vs. the cost of the sample
 2D gels are so variable that replication is required
April 2, 2010
SPH 247 Statistical Analysis of Laboratory Data
45
Quality Control
 It is usually a good idea to identify factors that
contribute to unwanted variability
 A study can be done in a given lab that examines the
effects of day, time of day, operator, reagents, etc.
 This is almost always useful in starting with a new
technology or in a new lab
April 2, 2010
SPH 247 Statistical Analysis of Laboratory Data
46
Possible QC Design
 Possible factors: day, time of day, operator, reagent
batch
 At two levels each, this is 16 experiments to be done
over two days, with 4 each in morning and afternoon,
with two operators and two reagent batches
 Analysis determines contributions to overall variability
from each factor
April 2, 2010
SPH 247 Statistical Analysis of Laboratory Data
47
References
 Statistics for Experimenters, Box, Hunter, and Hunter,
John Wiley
April 2, 2010
SPH 247 Statistical Analysis of Laboratory Data
48
Exercise
 You have a clinical study in which 10 patients will
either get the standard treatment or a new
treatment
 Randomize which 5 of the 10 get the new
treatment so that all possible combinations can
result. Use Excel or another formal randomization
method.
 Randomize so that in each pair of pair of patients
entered by date, one has the standard and one the
new treatment.
 What are the advantages of each method?
 Why is randomization important?
April 2, 2010
SPH 247 Statistical Analysis of Laboratory Data
49
Course Website
 http://dmrocke.ucdavis.edu
April 2, 2010
SPH 247 Statistical Analysis of Laboratory Data
50