slides - UC San Diego Health Sciences

Case-based reasoning using electronic health
records efficiently identifies eligible patients for
clinical trials
Riccardo Miotto and Chunhua Weng
Department of Biomedical Informatics
Columbia University, New York City
Acknowledgments
The research was supported by two grants:
1. R01 LM009886 “Bridging the semantic gap between
clinical research eligibility criteria and clinical data”
from The National Library of Medicine (PI: Weng)
2. UL1 TR000040 Clinical and Translational Science
Award to Columbia University (PI: Ginsberg)
2
The Top 5 Challenges for Clinical Research
1. Recruitment
2. Recruitment
3. Recruitment
4. Recruitment
5. Recruitment
3
The Opportunity
Samson W. Tu, Carol A. Kemper, Nancy M. Lane,
Robert W. Carlson, and Mark A. Musen. A
methodology for determining patients’ eligibility
for clinical trials. Journal of Methods of
Information in Medicine, 32(4):317–325, 1993.
EHRs contains rich information for identifying eligible
patients for clinical trials
As of 2015, EHRs have become pervasive in medicine
4
Our Initial Plan
To derive a computable representation of the clinical trial
eligibility criteria and to align it with patient EHR records
Free-text Eligibility Criteria
Text-based Knowledge Acquisition
Semantic Alignment
Potentially Eligible Patients
5
… and Attempts
6
… and Attempts
7
… and Attempts
8
… and Attempts
9
… and Attempts
10
… and Attempts
11
… and Challenges
• Data quality issues (Weiskopf, JAMIA 2013)
–
–
–
–
–
•
•
•
•
•
Completeness
Concordance
Correctness
Plausibility
Currency
Data representation heterogeneity: structured, unstructured
Lack of common information models
Concept granularity discrepancies even using the same T
Incomplete knowledge, imprecise disease classification
Workflow and practical issues: recent visit, belief in
research, etc.
12
An Alternative
To derive a computable representation of the clinical trial
eligibility criteria target patient and to align it with patient EHR
records find patients similar to the target
13
Our Conceptual Framework
1. A set of eligible patients or clinical trial participants is
manually identified
 their EHRs are aggregated to derive the “target patient”
2. The target patient is applied to any unseen patient of a
clinical data warehouse to check the eligibility status
3. For each patient the framework returns a relevance
score
 the higher the score, the more likely the patient is eligible for the trial
4. Patients can be be ranked by their relevance scores
 potentially eligible patients are at the top of the list
• manual identification performed by the investigator can be done quickly
14
Example Related Work
15
Methods: comparison to related work
• State-of-the-art:
 train a binary classifier to determine if a patient is eligible or not
 the classifier is the “target patient”
• Limitation of the state-of-the-art
 training a binary classifier requires a list of participants and also a list of
ineligible patients so that finding ineligible patients can be as difficult as
laborious as finding eligible patients
• Alternative Approach
 generate the “target patient” by modeling only the trial participants
• Advantages
 Rely only on a patient representation using comprehensive data without
formally representing eligibility criteria
16
A Pilot Study
• Study Goal
 to show the feasibility of using only minimal trial participants to discover
new potentially eligible patients
• Our implementation favored a simple design to ensure a
focused and correct evaluation
 more complicated implementations are more likely to introduce mistakes in
the process
• Flexible customization on how to:
 process and summarize patient EHR data
 represent the clinical trial participants
 discover potential eligible patients
17
Patient EHR Processing (1)
• EHR data types
 medication orders, ICD-9 diagnosis, laboratory results, clinical notes
• Each participant is represented by four vectors, one per
data type
• Medication orders
 count the presence of each code in every participant
• Laboratory results
 count the presence of each test if results were categorical or expressed with
different scales
 average test result values if expressed using the same unit measure
18
Patient EHR Processing (2)
• ICD9 codes
 count the presence of each code in every participant
• Clinical notes
 extract relevant tags
• limited presence of stop words, matching to UMLS
 use UMLS to normalize tags
• aggregate synonyms and semantically similar concepts under the same tag
 remove negated tags
 remove temporally consecutive similar notes
 topic modeling
 unsupervised inference process that captures patterns of word co-occurrences
within a heterogeneous set of notes to define topics
 represent each note as a a vector of topic probabilities
19
Target Patient (1)
• Participant EHR representations are aggregated to derive
the “target patient”
20
Target Patient (2)
• Target patient
 for each data type, retain only the concepts frequently shared by the
participants
 motivated by the small number of participants available for some trials
 a concepts is frequent when appears in at least 60% of the participants
 average concept occurrences over all participants
• Target patient represented by four vectors
 aggregated common patterns among the trial’s participants
 same structure of a regular patient representation
 favor direct comparisons with patient data
21
Finding New Eligible Patients
• EHR data of unseen patients are matched to the “target
patient” of a clinical trial
• Relevance score indicating the patient eligible likelihood
 pairwise cosine similarity within each data type
 aggregating the scores using a weighted linear combination
 w-comb
• The higher the relevance score, the more likely is the
patient eligible
• List of patients sorted by score
 the most likely candidates at the top of the list to speed up manual review
22
Evaluation: study design
• Dataset
 13 clinical trials from Columbia University
 262 unique participants
 additional 30k patients extracted at random from the data warehouse
• 2-fold cross validation
 half participants to derive the “target patient”
 half participants plus the 30k random patients to test
• Ranking experiment
 for each trial rank all the patients in the test set by their relevance score with
the corresponding “target patient”
 measure at which point of the list the participants in the test set are ranked
23
Evaluation: 13 multi-site trials
24
Evaluation: 13 multi-site trials
25
Evaluation: comparisons
• “w-comb” obtains the best results
 take advantage of the heterogeneous data types
AMIA 2014 Annual Symposium, 15 – 19 November 2014
26
Evaluation: AUC
• Area under the ROC Curve = 0.952
27
Evaluation: results
• Precision-at-5 for each trial in each fold
 every trial has at least one potentially eligible patient within the top 5
position of the ranking list (P5 ≥ 0.2)
28
Conclusions
• EHR data of clinical trial participants can be used to
recommend new eligible patients
 ranking results consistent among multiple trials of different medical
conditions
 satisfactory results regardless the number of participants used for training
• Potential applicative scenarios
 self-standing tool constantly monitoring a clinical data warehouse and
alerting investigators when a new potentially eligible patient is identified
 to be used on request to rank all patients in a data warehouse to find eligible
patients for a specific trial
 component to integrate approaches processing eligibility criteria
AMIA 2014 Annual Symposium, 15 – 19 November 2014
29
Ongoing Works
• Patient representation
 test more sophisticated techniques to represent EHR data
 model laboratory results accounting for the temporal trends of the values
 model diagnosis and medications using a well-chosen probability distribution to
handle the incompleteness problem of EHR data
• Improve the “target patient”
 statistical model that can be trained by only estimating the distributions of
participants associated with each trial
 e.g., hidden Markov models, mixture models
• Extend the experimental framework
 more and diverse clinical trials
 more patients in the dataset
• Tackle the complexity of trial design, e.g., the need for >
1 target model
30
Chunhua Weng, Ph.D.
[email protected]