Merck Uses STATLIA Parallelism Metric to Set Acceptance Criteria

Statistical Methods for Setting
Acceptance Criteria on
Parallelism in Bioassays
Daniel Joelsson
Senior Research Biologist
Merck & Co., Inc.
West Point, PA
Parallelism Acceptance Criteria
• Setting acceptance criteria on assay parameters is
critical for proper interpretation of the results
– Bioassays measure both potency and stability of samples
• Current methods for setting criteria on parallelism
– Rely heavily on the expertise of an experienced biometrician
– Often involve judgment calls on what data to include in an
analysis
• New method using bootstrapping
– Involves less subjectivity
– Rugged and statistically defensible criteria
Potency Assays and Parallelism
Potency is a shift in the X-axis compared to a reference standard
120
100
80
60
Y Data
40
20
0
0
2
4
6
8
10
12
X Data
•
•
Strictly speaking, curves must be the same exact shape for
potency to have meaning
This is usually called parallelism, but more accurately known as
similarity
Parallelism in the real world
• Bioassays have high variability
• Replication commonly performed to control the mean
• Curves are samples from all possible curves
120
120
100
100
80
80
60
60
Y Data
Y Data
40
20
40
20
0
0
0
2
4
6
8
10
12
-4
-2
0
2
X Data
4
6
X Data
Are these curves parallel?
How do we know?
8
10
12
14
Measurements of parallelism
•
•
How to measure parallelism?
Several different methods available:
– Slope ratios
– Hypothesis testing
• F test and Chi-Square test
•
– Equivalence testing
Each generates a ‘metric’ for parallelism
– How parallel are these two curves?
•
What is the cutoff?
– How far off do the curves have to be before they are no longer
parallel?
120
120
100
100
80
80
60
60
Y Data
Y Data
40
40
20
20
0
0
0
2
4
6
X Data
8
10
12
-4
-2
0
2
4
6
X Data
8
10
12
14
Acceptance Criteria
• How do we set the cutoff?
• Ideally set by clinical data
– Not realistic in early development
• Assay capability cutoffs
– Based on 95% or 99% limits
– How do we set these?
– Run 2000 or so assays with the standard compared
to another replicate of the standard.
• “Parallel” by default
• Distribution of metric due solely to assay variability
Bioassay for a monoclonal antibody project
Ligand
Serial Dilution
Competitive Antibody
Receptor
R efe rence S tanda rd
Sa mpe
l
Pos/N
e g Co
n tr ols
P
Luminescence Readout
StatLIA:
Potency Calculation
Chi-Square parallelism metric
Traditional Method: Actual data from 95 runs
Histogramof Chi-square values
40
Frequency
30
20
10
0
2
4
6
8
10
12
14
200
0
– Low sample number
• 95% cutoff based on only ~ 4 runs
100
Frequency
50
• Eliminated outliers
0
– Some of these runs are probably NOT parallel
– Data ‘massaging’
150
‘Traditional Method’ has problems:
7
8
9
10
11
12
13
95% percentile values, Chi-square (df = 4)
End results = 95% cutoff at 7.28
S imula ted da ta wti h 100 r n
u s.
a rg
L
e va riabilit y n
i 95%c u
t off
Standard Curves
•
•
Can we use the data we already have?
Many replicates of the standard
– StatLIA reference set
– 30 representative assays
• Why 30?
18000
16000
14000
12000
RLU
10000
8000
6000
4000
2000
0
0
1
2
3
4
5
6
7
8
9
Normalized Data
Normalized to a B-zero control (100% signal)
120
100
% RLU
80
60
40
20
0
0
1
2
3
4
5
6
7
8
9
Combining standards
120
100
Sample 2 curves
80
% RLU
120
100
60
% RLU
80
40
60
40
20
20
0
0
1
2
3
4
5
6
7
8
9
0
0
1
2
3
4
5
6
7
8
• Use StatLIA to generate chi-square value
• 435 possible combinations of curves
9
Results of combinations
Histogram of Chi-square value
100
Frequency
80
60
40
20
0
0
15
30
45
60
Chi-square value
75
90
• Not good – bimodal distribution?
• 95% cutoff at ~62
105
Residuals
120
100
80
60
40
20
0
0
2
4
6
8
10
12
Residual = The distance from each data point to the fitted curve
Due to the underlying variability of the assay
Distribution of residuals
Low end of the curve
High end of the curve
Histogram
Histogram
Normal
Normal
20
M ean
StDev
N
0.8379
5.356
90
15
Frequency
Frequency
15
10
5
0
•
•
Mean
StDev
N
20
10
5
-15
-10
-5
0
5
10
15
0
-2
Close to normally distributed
Residuals larger at higher signal levels
– Need to do weighted regression
-1
0
1
2
3
0.2266
1.034
90
Bootstrap strategy
Take the mean at each concentration of the 30 reference curves
100
Relative Signal
80
60
40
20
0
0
1
2
3
4
5
6
7
8
9
Concentration
Concentration
1
2
3
4
5
6
7
8
Mean
95.52
94.26
88.56
68.55
35.99
19.37
14.39
10.64
Bootstrap strategy
Randomly sample (with replacement) 3 of the 90 residuals at each
concentration and add to the mean to make a standard curve
120
100
Relative Signal
80
60
40
20
0
0
1
2
3
4
5
6
7
8
9
Concentration
Concentration
1
2
3
4
5
6
7
8
Residual 1
9.46
-1.43
3.38
0.92
1.22
1.59
2.36
0.08
Residual 2
2.28
2.76
-6.56
6.75
1.58
0.50
-0.28
0.13
Residual 3
-5.50
0.02
-8.73
2.04
-4.89
-1.88
1.62
0.54
Bootstrap strategy
• Using this strategy we can generate
8 x 10^(46) possible standard curves
• That’s 4 x 10^(46) pairs of curves!
• Each pair generates one chi-square metric
• We generated 2000 of those possible values
– Same mean function = parallel curves by default
– Differences are due to assay variability
Results
Histogram of Chi-square values
250
Frequency
200
150
100
50
0
0.0
2.5
5.0
7.5
10.0
12.5
Chi-square value
15.0
17.5
(Ov
e ra
l y: Chi-squ
a re wit h 4d f )
95t h percentile is at 7.499
*Estimated cutoff based on 95 runs = 7.28
*Estimated chi-square distribution cutoff = 9.483
Is the mean appropriate?
“Low” mean
“High” mean
Histogram of Low mean
350
350
300
300
250
250
Frequency
Frequency
Histogram of High mean
200
150
200
150
100
100
50
50
0
0
2
4
6
8
Chi-square value
10
95% cutoff = 5.932
12
0
0
4
8
12
16
20
Chi-square value
95% cutoff = 11.21
Wider distribution with a low mean function.
Weighting plays a role.
24
Sampling the mean function
Added in an extra step – a new mean for each pair of curves
Randomly selected one of the 30 original curves as the mean
Histogram of Chi-square value
500
Frequency
400
300
200
100
0
0
4
8
12
16
Chi-square value
20
24
28
Distribution shifted slightly to the right
95t h percentile moved to 8.701 from 7.499
Can we use less information?
Bootstrapping 20 curves
Bootstrapping 10 curves
Histogram of Chi-square value
Histogram of Chi-square value
300
300
250
200
200
Frequency
Frequency
250
150
150
100
100
50
50
0
0
3
6
9
12
Chi-square value
15
18
95% cutoff = 7.017
21
0
0
3
6
9
12
Chi-square value
15
18
95% cutoff = 7.683
21
Summary
95 actual runs of the assay
2000 bootstrapped runs
Histogram of Chi-square value
Histogram of Chi-square values
500
40
400
Frequency
Frequency
30
20
10
0
300
200
100
0
2
4
6
8
10
12
14
95% cutoff = 7.28
Questionable data in the tail of the
distribution.
0
0
4
8
12
16
Chi-square value
20
24
95% cutoff = 8.701
Consequence – would fail 7.7% of assays instead of 5%
28
Conclusions
Bootstrapping to set a parallelism acceptance
criterion has several advantages over traditional
methods:
•
•
•
Eliminates the need for multiple assays comparing the reference
standard to itself
Eliminates the subjectivity involved with selecting the appropriate
runs to include in the analysis
Based on empirically observed data
– no simulation necessary
•
•
Can be used to set acceptance criteria very early in development
(10 curves) and refined as more data is available
Repeatable and defensible
Acknowledgements
Yue Wang - Senior Biometrician
Don Warakomski
Todd Ranheim
Scott Barmat