Phenotype generation from EMR by tensor factorization SEDI Durham Cohort James Lu M.D. Ph.D. Department of Electrical and Computer Engineering Department of Medicine Health System Under Pressure 3.2 Trillion / yr (~21% of GDP) Operations Novel technology Where do I achieve cost arbitrage? Where is my patient going to do next? Can we reorganize patient flow? Align incentives, risk sharing, quality metrics, reducing readmissions, six sigma/ lean, … Small Molecules, Medical Devices, Biologics, diagnostics, genomics, transcriptomics…. How do we identify which patients to study? Computable phenotypes are a top down process PheKB, Northwestern Many variations of computable phenotypes require adjudication by physicians. Expensive and time consuming Richesson, et al. 2013 EMR Data is large and Complicated Durham County, 2007-2011 Patient level Visit level Intervention level >240,000 patients 4.4 Million patient visits > 60,000 types of observations Birthday Death (where available) Average 18 measurements recorded per visit • CPT • ICD9 diagnoses Gender Indicator of presence/absence of particular diseases (computed) Race Encounter date (start, end) Ethnicity Location (DHRH, DUH, DRH) • Lab values Path (ED -> inpatient for example) Inpatient / Outpatient • ICD9 procedures • Medications • Vitals • Caveats: • Temporal gaps – People are only patients when they are sick • We want to incorporate all of this information • Don’t want to be fooled by mistakes and bias Decompose each touch with the health care system into its parts ● Each visit is a 5-D tensor (~1 billion elements) ● ● ● ● ● ● Patient Diagnosis/ Billing Codes Labs Medications Time Medications 𝜆𝑟 x +⋯ Model as Counts 𝑅 𝜆𝑟 ∙ 𝑢𝑟1 ⊗ ⋯ ⊗ 𝑢𝑟𝐾 ) 𝒴~ 𝑃𝑜𝑖𝑠( ● Decompose into set of K rank 1 vectors Codes 𝑟=1 With Piyush Rai and Changwei Hui Computational phenotypes are a bottom-up process. Factors represent latent phenotypes Evaluate 11242 pts with ~23MM data-points with morbidity outcomes in diabetes Factor 2 Systemic Lupus Erythematosus Shoulder Pain Urate Calcidiol Side Effects from Statins Jo-1 Alprazolam Factor 10 Clinical Trial Participation External Catheter Set Malignant Neoplasm Prostate Evening Primrose Oil Secondary Malignant Neoplasms of Bone CEA AG 15-3 Allopurinol Patients are composites of common and rare latent phenotypes. Patient by Factor Score Matrix, 40 most common phenotypes ER/ EKG Standard Labs (i.e. CBC/ BMP) Kidney Disease Hypertension Surgical Patient Compare Outcome prediction to Known Algorithm (UKPDS) UKPDS: UK Prospective Diabetes Study outcomes model used to predict MI, Death, and Stroke 7 demographic + lab variables: age, ethnicity, smoking status A1c, HDL, Total Cholesterol and Systolic BP Dataset Original 7 variable model All Data Non Matrix Factorization Tensor Factorization Can we predict outcome in next year Death AMI Stroke Classification Model: Fit data with Random Forests 10 fold cross validation With Joseph Lucas Tensor derived factors performs better than original UKPDS in all outcomes, provides comparable performance to “all-data” model Stroke is similar to Dat
© Copyright 2026 Paperzz