18a_DiseaseSamplingVar_IntervalModel

Adjusting for sampling variability in
disease mapping
using an interval soft data model
Sampling Variability
•
Disease rates are a common measure of disease risk since
higher rates reflect greater chances for contracting the disease
•
Mapping raw rates may obscure the underlying (latent) risk due
to noise generated by sampling variability.
- Dominated by areas with small populations since small
changes to the observed cases will result in large changes
to the rate
- High rates based on small populations likely to be artificially
elevated due to high variability in the estimates
- Particularly affects maps of rare diseases calculated from
health data indexed at a small geographical resolution
(sparse data)
Sampling Variability
•
Ex: Study Area Rate =
2
= 0.0008
2500
2
2500
(0.0008)
Sampling Variability
•
Ex: Study Area Rate =
2
= 0.0008
2500
1
1000
Subregional Rate =
Cases
Population
=
0
0
500
100
(0.001)
0
1
500
100
(0.01)
0
75
0
0
0
100
75
50
Sampling Variability
•
Upscaling causes a loss in the resolution of the data
•
‘Spatial smoothing’ allows areas to borrow strength from
neighboring regions to produce a more stable estimate of the
value associated with each region. Ex:
- Disk smoothing obtains a smoothed value for each region
by averaging the rates of all regions falling within a defined
neighborhood window
- Empirical Bayes and Bayesian hierarchical modeling that
‘shrink’ rates toward a mean trend as estimated by known
covariates (Clayton and Kaldor, 1987)
Model Definitions
•
Disease risk may be defined as the probability of disease x
occurring in a population of size n as n approaches infinity.
Yit
X it  lim
,
nit  nit
i  1,... I ; t  1,..., T
•
In real world situations, only a finite population nit may be
sampled, producing a finite number of cases Yit and an
observed rate Rit = Yit / nit .
•
The difference between Rit and Xit may be expressed as:
Rit  X it   it
where it is the error due to sampling variability.
Model Definitions
•
Given Xit and Nit measured without error and representative
of the population at risk
Yit = round (Xit nit)
•
Due to rounding error:
 Y  0.5 Yit  0.5 

X it   it
,
N it 
 N it
•
or

0.5
0.5 

X it   Rit 
, Rit 
N
N
it
it 

Smaller population sizes generate larger confidence
intervals and higher variability around observed rates
Model Definitions
•
•
In most datasets, there will be additional errors associated with the
measurement of cases given a sample of the population at risk.
To include additional uncertainty, we define a smoothing factor 
such that:
 = 0.5 + 
where  is the error in observed cases due to random effects other
than sample size.
•
Therefore,


 

X it   Rit 
, Rit 
N it
N it 

In the special case that Yit and nit are measured without error,  =
0 and  = 0.5.
Conceptual Example
Random number generator
3
2
1
0
Yit = round (Xit
3
2
1
0
Rit = Yit / Nit
Spatial Covariance
•
experimental
— model
(a) Observed rate field R(s)
•
experimental
— model
(b) Risk field X(s)
Model Fit
MSE: 1.52e-06
MSE: 6.43e-07
Observed Rate as Risk Estimate
MSE: 1.52e-06
MSE: 2.75e-07
Application: Spatio-temporal
Mapping of
HIV in North Carolina
North Carolina
I-77
I-85
I-95
Greensboro
I-40
Raleigh
Asheville
Fayetteville
Charlotte
Wilmington
Urbanized Area
Miles
0 15 30
60
90
120
Interstate Hwy
N
North Carolina VCT Locations



 

    







 

 




 





 

 








    
 
 











 
 








 

  



 
 







 





Miles
0 15 30
60
90
120

VCT Site
N
Hard/Soft Data Cross-validation
Threshold
(Percentile of
Testing Population)
0.6
0.7
0.8
0.9
MSE Hard
Data
MSE Soft
Data ( = 0.5)
0.000053
0.000051
0.000048
0.000054
0.000054
0.000049
0.000043
0.000045
Error reduction
(%MSE from
Hard to Soft Data)
1.7
-3.6
-11.6
-17.3
• Hard Data
  = 0.5
  = 1.0
o  = 2.0