How do we identify which patients to study?

Phenotype generation from EMR by tensor
factorization
SEDI Durham Cohort
James Lu M.D. Ph.D.
Department of Electrical and Computer Engineering
Department of Medicine
Health System Under Pressure
3.2 Trillion / yr (~21% of GDP)
Operations
Novel technology
Where do I
achieve cost
arbitrage?
Where is my
patient going
to do next?
Can we
reorganize
patient flow?
Align incentives, risk sharing, quality
metrics, reducing readmissions, six
sigma/ lean, …
Small Molecules, Medical Devices,
Biologics, diagnostics, genomics,
transcriptomics….
How do we identify
which patients to
study?
Computable phenotypes are a top down process
PheKB, Northwestern
Many variations of computable phenotypes require
adjudication by physicians.
Expensive and time consuming
Richesson, et al. 2013
EMR Data is large and Complicated
Durham County, 2007-2011
Patient level
Visit level
Intervention level
 >240,000 patients
 4.4 Million patient visits
> 60,000 types of observations
 Birthday
 Death (where available)
 Average 18 measurements recorded
per visit
• CPT
• ICD9 diagnoses
 Gender
 Indicator of presence/absence of
particular diseases (computed)
 Race
 Encounter date (start, end)
 Ethnicity
 Location (DHRH, DUH, DRH)
• Lab values
 Path (ED -> inpatient for example)
 Inpatient / Outpatient
• ICD9 procedures
• Medications
• Vitals
• Caveats:
• Temporal gaps – People are only patients when they are sick
• We want to incorporate all of this information
• Don’t want to be fooled by mistakes and bias
Decompose each touch with the health care system
into its parts
●
Each visit is a 5-D tensor (~1 billion
elements)
●
●
●
●
●
●
Patient
Diagnosis/ Billing Codes
Labs
Medications
Time
Medications
𝜆𝑟 x
+⋯
Model as Counts
𝑅
𝜆𝑟 ∙ 𝑢𝑟1 ⊗ ⋯ ⊗ 𝑢𝑟𝐾 )
𝒴~ 𝑃𝑜𝑖𝑠(
●
Decompose into set of K rank 1
vectors
Codes
𝑟=1
With Piyush Rai and Changwei Hui
Computational phenotypes are a bottom-up
process. Factors represent latent phenotypes
Evaluate 11242 pts with ~23MM data-points with morbidity outcomes in diabetes
Factor 2
Systemic Lupus Erythematosus
Shoulder Pain
Urate
Calcidiol
Side Effects from Statins
Jo-1
Alprazolam
Factor 10
Clinical Trial Participation
External Catheter Set
Malignant Neoplasm
Prostate
Evening Primrose Oil
Secondary Malignant
Neoplasms of Bone
CEA
AG 15-3
Allopurinol
Patients are composites of common and rare
latent phenotypes.
Patient by Factor Score Matrix, 40 most common phenotypes
ER/ EKG
Standard Labs (i.e. CBC/ BMP)
Kidney Disease
Hypertension
Surgical Patient
Compare Outcome prediction to Known
Algorithm (UKPDS)
 UKPDS: UK Prospective Diabetes
Study outcomes model used to
predict MI, Death, and Stroke
 7 demographic + lab variables:
 age, ethnicity, smoking status
 A1c, HDL, Total Cholesterol and
Systolic BP
 Dataset




Original 7 variable model
All Data
Non Matrix Factorization
Tensor Factorization
 Can we predict outcome in next
year
 Death
 AMI
 Stroke
 Classification Model:
 Fit data with Random Forests
 10 fold cross validation
With Joseph Lucas
Tensor derived factors performs better than
original UKPDS in all outcomes, provides
comparable performance to “all-data” model
Stroke is similar to Dat