CIEE_Stat_Application_in_Research

Applications of
Statistics in Research
Bandit Thinkhamrop, Ph.D.(Statistics)
Department of Biostatistics and Demography
Faculty of Public Health
Khon Kaen University
Begin at the conclusion
Type of the study outcome: Key for
selecting appropriate statistical methods
Study outcome
– Dependent variable or response variable
– Focus on primary study outcome if there are
more
Type of the study outcome
– Continuous
– Categorical (dichotomous, polytomous, ordinal)
– Numerical (Poisson) count
– Even-free duration
Continuous outcome
Primary target of estimation:
– Mean (SD)
– Median (Min:Max)
– Correlation coefficient: r and ICC
Modeling:
– Linear regression
The model coefficient = Mean difference
– Quantile regression
The model coefficient = Median difference
Example:
– Outcome = Weight, BP, score of ?, level of ?, etc.
– RQ: Factors affecting birth weight
Categorical outcome
Primary target of estimation :
– Proportion or Risk
Modeling:
– Logistic regression
The model coefficient = Odds ratio (OR)
Example:
– Outcome = Disease (y/n), Dead(y/n),
cured(y/n), etc.
– RQ: Factors affecting low birth weight
Numerical (Poisson) count outcome
Primary target of estimation :
– Incidence rate (e.g., rate per person time)
Modeling:
– Poisson regression
The model coefficient = Incidence rate ratio (IRR)
Example:
– Outcome =
Total number of falls
Total time at risk of falling
– RQ: Factors affecting tooth elderly fall
Event-free duration outcome
Primary target of estimation :
– Median survival time
Modeling:
– Cox regression
The model coefficient = Hazard ratio (HR)
Example:
– Outcome = Overall survival, disease-free
survival, progression-free survival, etc.
– RQ: Factors affecting survival
The outcome determine statistics
Continuous
Mean
Median
Categorical
Proportion
(Prevalence
Or
Risk)
Linear Reg.
Count
Survival
Rate per “space”
Median survival
Risk of events at T(t)
Logistic Reg. Poisson Reg.
Cox Reg.
Statistics quantify errors for judgments
Parameter estimation
[95%CI]
Hypothesis testing
[P-value]
Statistics quantify errors for judgments
Parameter estimation
[95%CI]
Hypothesis testing
[P-value]
7
Caution about biases
Selection bias
Information bias
Confounding bias
Research Design
-Prevent them
-Minimize them
Caution about biases
Selection bias (SB)
Information bias (IB)
Confounding bias (CB)
If data available:
SB & IB can be assessed
CB can be adjusted using
multivariable analysis
Generate a mock data set
General format of the data layout
id
1
2
3
4
5
…
n
y
x1
x2
X3
Generate a mock data set
Continuous outcome example
id
1
2
3
4
5
…
n
y
2
2
0
2
14
x1
1
0
1
0
1
x2
21
12
4
89
0
X3
22
19
20
21
18
6
0
45
21
Mean (SD)
Common types of the statistical goals
Single measurements (no comparison)
Difference (compared by subtraction)
Ratio (compared by division)
Prediction (diagnostic test or predictive
model)
Correlation (examine a joint distribution)
Agreement (examine concordance or
similarity between pairs of observations)
Back to the conclusion
Continuous
Categorical
Count
Survival
Appropriate statistical methods
Mean
Median
Proportion
(Prevalence or Risk)
Rate
per “space”
Median survival
Risk of events at T(t)
Magnitude of effect
95% CI
Answer the research question
based on lower or upper limit of the CI
P-value
Always report the magnitude of
effect and its confidence interval
Absolute effects:
– Mean, Mean difference
– Proportion or prevalence, Rate or risk, Rate or Risk difference
– Median survival time
Relative effects:
– Relative risk, Rate ratio, Hazard ratio
– Odds ratio
Other magnitude of effects:
–
–
–
–
Correlation coefficient (r), Intra-class correlation (ICC)
Kappa
Diagnostic performance
Etc.
Touch the variability (uncertainty)
to understand statistical inference
id
1
A
2
2
3
4
5
2
0
2
14
-2
-4
-2
10
4
16
4
100
Sum ()
Mean(X)
20
4
0
0
128
32.0
SD
Median
(x-X ) (x- X ) 2
-2
4
2+2+0+2+14 = 20
2+2+0+2+14 = 20 = 4
5
5
0
2
2
2
14
Variance = SD2
5.66
2
Standard deviation = SD
Touch the variability (uncertainty)
to understand statistical inference
id
1
A
2
2
3
4
5
2
0
2
14
-2
-4
-2
10
4
16
4
100
Sum ()
Mean(X)
20
4
0
0
128
32.0
SD
Median
(x-X ) (x- X ) 2
-2
4
Measure of
central tendency
5.66
2
Measure of
variation
Standard deviation (SD) = The average distant between
each data item to their mean
 X  X 

SD  

n

1

2
Degree of freedom




Same mean BUT different variation
id
1
2
3
A
2
2
0
id
1
2
3
B
0
3
4
id
1
2
3
C
3
4
4
4
5
Sum ()
2
14
20
4
5
Sum ()
5
8
20
4
5
Sum ()
4
5
20
Mean
SD
Median
4
5.66
2
Mean
SD
Median
4
2.91
4
Mean
SD
Median
4
0.71
4
Heterogeneous data
Heterogeneous data
Homogeneous data
Skew distribution
Symmetry distribution
Symmetry distribution
Facts about Variation
Because of variability, repeated samples will
NOT obtain the same statistic such as mean or
proportion:
– Statistics varies from study to study because of the
role of chance
– Hard to believe that the statistic is the parameter
– Thus we need statistical inference to estimate the
parameter based on the statistics obtained from a
study
Data varied widely = heterogeneous data
Heterogeneous data requires large sample size
to achieve a conclusive finding
The Histogram
id
A
id
B
1
2
1
4
2
2
2
3
3
0
3
5
4
2
4
4
5
14
5
4
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
The Frequency Curve
id
A
id
B
1
2
1
4
2
2
2
3
3
0
3
5
4
2
4
4
5
14
5
4
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
Area Under The Frequency Curve
id
A
id
B
1
2
1
4
2
2
2
3
3
0
3
5
4
2
4
4
5
14
5
4
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
Central Limit Theorem
Right Skew
X1
Symmetry
X2
Left Skew
X3
X1
XX

Xn
Normally distributed
Central Limit Theorem
X1
Distribution of
the raw data
X2
X3
X1
XX

Xn
Distribution of
the sampling mean
Central Limit Theorem
Distribution of
the raw data
X1
XX

Xn
Distribution of
the sampling mean
Large sample
(Theoretical) Normal Distribution
Central Limit Theorem
Many X, X , SD
X1
Xn
XX

Standard deviation of the sampling mean
Standard error (SE)
Estimated by
Many X , XX , SE
Large sample
Standardized for whatever n,
Mean = 0, Standard deviation = 1
SE =
SD
n
(Theoretical) Normal Distribution
(Theoretical) Normal Distribution
99.73% of AUC
Mean ± 3SD
95.45% of AUC
Mean ± 2SD
68.26% of AUC
Mean ± 1SD
Sample
n = 25
X = 52
SD = 5
Population
Parameter estimation
[95%CI]
Hypothesis testing
[P-value]
Z = 2.58
Z = 1.96
Z = 1.64
SD
SE 
n
5
SE 
25
5
5
= 1
Z = 2.58
Z = 1.96
Z = 1.64
Sample
n = 25
X = 52
SD = 5
SE = 1
Population
Parameter estimation
[95%CI] :
52-1.96(1) to 52+1.96(1)
50.04 to 53.96
We are 95% confidence that the population mean would lie between 50.04 and 53.96
Sample
n = 25
X = 52
SD = 5
SE = 1
Population
Hypothesis testing
H0 :  = 55
HA :   55
Z = 55 – 52
1
3
52
-3SE
55
+3SE
Hypothesis testing
H0 :  = 55
HA :   55
Z = 55 – 52
3
P-value = 1-0.9973 = 0.0027
1
If the true mean in the population is 55, chance to obtain a sample mean of 52 or
more extreme is 0.0027.
P-value is the magnitude of chance
NOT magnitude of effect
P-value < 0.05 = Significant findings
Small chance of being wrong in rejecting the null
hypothesis
If in fact there is no [effect], it is unlikely to get the
[effect] = [magnitude of effect] or more extreme
Significance DOES NOT MEAN importance
Any extra-large studies can give a very small Pvalue even if the [magnitude of effect] is very
small
P-value is the magnitude of chance
NOT magnitude of effect
P-value > 0.05 = Non-significant findings
High chance of being wrong in rejecting the null
hypothesis
If in fact there is no [effect], the [effect] =
[magnitude of effect] or more extreme can be
occurred chance.
Non-significance DOES NOT MEAN no
difference, equal, or no association
Any small studies can give a very large P-value
even if the [magnitude of effect] is very large
P-value vs. 95%CI (1)
An example of a study with dichotomous outcome
A study compared cure rate between Drug A and Drug B
Setting:
Drug A = Alternative treatment
Drug B = Conventional treatment
Results:
Drug A: n1 = 50, Pa = 80%
Drug B: n2 = 50, Pb = 50%
Pa-Pb
= 30% (95%CI: 26% to 34%; P=0.001)
P-value vs. 95%CI (2)
Pa > Pb
Pb > Pa
Pa-Pb = 30% (95%CI: 26% to 34%; P< 0.05)
P-value vs. 95%CI (3)
Adapted from: Armitage, P. and Berry, G. Statistical methods in medical research. 3rd edition. Blackwell Scientific Publications, Oxford. 1994. page 99
Tips #6 (b)
P-value vs. 95%CI (4)
Adapted from: Armitage, P. and Berry, G. Statistical methods in medical research. 3rd edition. Blackwell Scientific Publications, Oxford. 1994. page 99
There were statistically
significant different
between the two groups.
Tips #6 (b)
P-value vs. 95%CI (5)
Adapted from: Armitage, P. and Berry, G. Statistical methods in medical research. 3rd edition. Blackwell Scientific Publications, Oxford. 1994. page 99
There were no
statistically significant
different between the
two groups.
P-value vs. 95%CI (4)
Save tips:
– Always report 95%CI with p-value, NOT report
solely p-value
– Always interpret based on the lower or upper
limit of the confidence interval, p-value can be
an optional
– Never interpret p-value > 0.05 as an indication
of no difference or no association, only the CI
can provide this message.
Q&A
Thank you