Webinar 3 PowerPoint - Patient Matching Algorithm Challenge

Patient Matching Algorithm Challenge
Informational Webinar
Caitlin Ryan, PMP| IRIS Health Solutions LLC, Contract Support to ONC
Adam Culbertson, M.S., M.S. | HIMSS Innovator in Residence, ONC
Agenda
• ONC Overview
• Background on Matching
• About the Challenge
» Eligibility Requirements
» Registration
» Project Submissions
» Winners and Prizes
• Calculating Metrics
• Creating Test Data
• Challenge Q&A
2
Office of the National Coordinator for Health IT (ONC)
• The Office of the National Coordinator for Health Information
Technology (ONC) is at the forefront of the administration’s
health IT efforts and is a resource to the entire health system to
support the adoption of health information technology and the
promotion of nationwide health information exchange to
improve health care,
• ONC is organizationally located within the Office of the Secretary
for the U.S. Department of Health and Human Services (HHS),
• ONC is the principal federal entity charged with coordination of
nationwide efforts to implement and use the most advanced
health information technology and the electronic exchange of
health information.
ONC Challenges Overview
• The statutory authority for Challenges are hosted under Section
105 of the America COMPETES Reauthorization Act of 2010
(Public L. No 111-358).
• ONC Tech Lab - Innovation
» Spotlight areas of high interest to ONC and HHS
» Direct attention to new market opportunities
» Continue work with start-up community and administer challenge
contests
» Increase awareness and uptake of new standards and data
4
ONC Roadmap
• Connecting Health and Care for the Nation: A Shared
Nationwide Interoperability Roadmap
» Released in 2015
» A 10 Year Vision to Achieve An Interoperable Health IT Infrastructure
» Section L: Accurate Individual Data Matching, states that patient
matching is fundamental requirement in achieving interoperability.
5
Patient Matching Definition
Patient matching: Comparing data from multiple
sources to identify records that represent the
same patient. Also called merge-purge, record
linkage and entity resolution in other fields.
Source: Culbertson, A. Patient Matching A-Z, Wednesday, March 2ndst HIMSS 2016, Las Vegas, NV
Significant Dates in (Patient) Matching
Campbell, K et al
A Comparison of Link Plus,
The Link King, and a “Basic”
Deterministic Algorithm
RAND Health
Report
1918
Soundex
US Patent
1261167
Dunn
Fellegi & Sunter
Record
Linkage
A Theory of
Record Linkage
1946
1959
Newcombe,
Kennedy, & Axford
Automatic Linkage
of Vital Records
1969
Identity Crisis: An
Examination of the
Costs and Benefits
of a Unique Patient
Identifier for the US
Health Care System
2002
2008
HIMSS
Patient Identify
Integrity Toolkit,
Patient Key
Performance
Indicators
Winkler
Matching and
Record Linkage
2009
Grannis, et al
Grannis, et al
Analysis of Identifier
Performance Using a
Deterministic Linkage
Algorithm
Privacy and Security
Solutions for Interoperable
Health Information
Exchange
HIMSS
Patient Identity
Integrity
2011
A Framework for
Cross-Organizational
Patient Identity
Management
Kho, Abel N., et al
Design and
Implementation of
a Privacy Preserving
Electronic Health
Record Linkage Tool
2014
2015
Audacious Inquiry
and ONC
Patient Identification and
Matching Final Report
Joffe et al
A Benchmark Comparison
of Deterministic and
Probabilistic Methods for
Defining Manual Review
Datasets in Duplicate
Records Reconciliation
Dusetzina, Stacie B., et al
Linking Data for Health
Services Research: A Framework
and Instructional Guide
HIMSS hires Innovator
In Residence (IIR) focused
on Patient Matching
Source: Culbertson, A. & Miller, K., Patient Matching EHR Ailments: Going from Placebo to Cure, Tuesday, March 1st HIMSS 2016 Las Vegas, NV
The 5 Step Data Match Process
Pre-processing: characterizes the data
and ensures the elements have the same
structure and the content follows the
same format
Data preprocessing
Indexing
Comparison: identifying
the similarity between
two records seeking
comparison vectors
Comparison
Classification
Evaluation: comparing
match results with the
known ground truth or
gold standard.
Indexing: organizing the
data to support better
pairing (blocking and the
use of a blocking key
common)
Classification: based on
comparison results,
records are found to be
matches, non-matches or
potential matches
Evaluation
Data Matching; Concepts and techniques for record linkage, entity resolutions, and duplication detection, Peter Christen 2012
Problem
• Patient data matching has been noted as one key barriers to
achieving interoperability in the Nations Road Map for Health
IT
• Patient Matching causes issues for over 50% of health
information managers1
• Problem will increases as we increase the volume of health
data sharing
• Data quality issues make matching more complicated
• Lack of knowledge of patient matching algorithms
performance, adoption of metrics
1) https://ehrintelligence.com/news/patient-matching-issues-hindering-50-of-him-professionals
Data Quality
• Data Quality is a Key
» Garbage in and Garbage out
• Data entry errors are compound data matching complexity
» Various algorithmic solutions to address these, not perfect
• Types of errors:
» Missing or Incomplete Values
» Inaccurate data
» Fat finger errors
» Information is out of date
» Transposed names
» Misspelled names
Solution
"If you can't measure it, you can't improve it.“
-Peter Drucker
12
ONC’s Patient Matching Algorithm Challenge
The goal of this challenge is to:
1. Bring about greater transparency and data on
the performance of existing patient matching
algorithms,
2. Spur the adoption of performance metrics for
patient data matching algorithm vendors,
3. and positively impact other aspects of patient
matching such as deduplication and linking to
clinical data.
Website: www.patientmatchingchallenge.gov
13
Eligibility Requirements
There is no age requirement for this challenge. All members of a team
must meet the eligibility requirements.
• Shall have registered to participate in the Challenge under the
challenge requirements by ONC.
• Shall have complied with all the stated requirements of the
Challenge.
• Businesses must be incorporated in and maintained a primary place
of business in the United States; Individuals must be a citizen or
permanent resident of the United States.
• Shall not be an HHS employee.
14
Eligibility Requirements (cont’d)
• May not be a federal entity or federal employee acting within
the scope of their employment.
• Federal grantees may not use federal funds to develop
COMPETES Act challenge applications unless consistent with
the purpose of their grant award.
• Federal contractors may not use federal funds from a contract
to develop COMPETES Act challenge applications or to fund
efforts in support of a COMPETES Act challenge submission.
• Participants must also agree to indemnify the Federal
Government against third party claims for damages arising
from or related to Challenge activities.
15
Challenge Process
• Register your team
• Contestants will unlock a test data set provided by ONC to run
algorithms
• Run Algorithm
• Submit results for evaluation which will be scored against an
“answer key”
• Receive performance scores and appear on a Challenge
leaderboard.
• Repeat submission until satisfied with the result or have hit 100
submissions or end date has passed.
16
Challenge Process
Download Data in CSV File
Synthetic
Data
Set
Submit Linked Data
Returns a Score
Submit Results to Leader Board
Scoring Server
Scoring Server
(Gold Standard))
Registration
• Visit the challenge website and fill in all required fields of the
registration form
• Create a username and password (1 account per team)
• Enter a team name which will be used on the leader board
» Can be used to keep team identities private
• Acknowledge and agree to all terms and rules of the Challenge
18
Challenge Dataset
Dataset synthetically generated by Just Associates using a
proprietary software algorithm
•
Based on real-world data in an MPI and actual data
discrepancies in each field across the fields
•
Known potential duplicate pairs mimic real world scenarios
•
Does not contain PHI
•
Approximately 1M Patient Records
• Available early June, will send out email when data set is made
available
19
Challenge Dataset
• Fields include:
» Enterprise ID, LAST NAME, FIRST NAME, MIDDLE NAME, SUFFIX, DOB, GENDER,
SSN, ADDRESS1, ADDRESS2, CITY, STATE, ZIP, PHONE, PHONE2, EMAIL, ALIAS,
MOTHERS_MAIDEN_NAME, MRN, SSNs(Most are within the 800 range)
» Data Format
– CSV
– Also available in FHIR bundle
20
Challenge Dataset
1 John Smith 1-1-1990 202-223-9910 Washington, DC
2 Carol Jones 2-4-1973 230-298-0001 Bethesda, MD
3 Bobby Johnson 3-09-1955 340-345-9234 Arlington, VA
4 Johnny Smith 1-1-1990 202-223-9910 Washington, DC
Scoring Server
1 John Smith 1-1-1990 202-223-9910 Washington, DC
4 Johnny Smith 10-1-1990 202-223-9910 Washington, DC
2 Carol Jones 2-4-1973 230-298-0001 Bethesda, MD
3 Bobby Johnson 3-09-1955 340-345-9234 Arlington, VA
Submission Process
• One dataset will be provided to all participants
• Participants will submit their matches to the ONC scoring server
• The answer key, separate from the dataset provided to participants, will be
used to score submissions.
• Submission Data format
» CSV,
– Enterprise ID 1, Enterprise ID 4, 0.90
» Optionally FHIR bundle
• Submit
» Enterprise ID, Enterprise ID linked to,
» Optional submit confidence score for probabilistic algorithms
22
Project Submissions
• Teams will submit the results, matched records, of their
algorithm tests
» The submission period* will be open for 3 months
» 100 submissions from each individual/team
» Can submit at any time during the “submission period”
» Challenge will open on June 12th 12:00 p.m. E.S.T.
» Submissions will be allowed until 11:59pm on the last day of the
submission period
*Submission period dates have not been determined. Once the test data set is available
we will add these dates to the challenge website.
23
Project Submissions
• Calculation
» Precision
» Recall
•
• Tradeoffs between Precision and Recall
» F-Score is the harmonic mean between precision
and recall
Returned Results
• Participants will receive a
» F-Score
» Precision
» Recall
» Run ID
• Month one will include a beta period
» New matches found will be manually reviewed to determine match
status
» Previous submissions will be rescored with updated answer key
and leader boards updated
» After beta period all future submissions will be scored against
updated answer key only
25
Leader Board Example
Leader Board Example
27
Winners and Prizes
The Total Prize Purse for this challenge is $75,000
Judging will be based upon the empirical evaluation of the performance
of the algorithms.
Highest F-Score
1st- $25,000
2nd- $20,000
3rd- $15,000
Best in Category: ($5,000 ea. Category)
• Precision
• Recall
• Best first f-score run
28
Best in Category
• Best F-Score
» 1st Place
» 2nd Place
» 3rd Place
• Best 1st Run F-Score: This prize will be awarded to the
contestant/team whose first submission to the scoring server
results in the highest f-score.
• Precision: Best precision with recall >= 90%
• Recall: Best Recall with precision >= 90%
29
Metrics for Algorithm Performance
Ideal outcome of any matching exercise is correctly answering this one question
hundreds or thousands of times, Are these two things the same thing?
» Correctly identifying all the true positives and true negatives while minimizing
the number of errors, false positives and false negatives
Patient Matching Goal
Source: Culbertson, A. Patient Matching A-Z, Wednesday, March 2ndst HIMSS 2016, Las Vegas, NV
Patient Matching Terminology
• True Positive- The two records represent the
same patient
• True Negative- The two records don't represent
the same patient
Source: Culbertson, A. Patient Matching A-Z, Wednesday, March 2ndst HIMSS 2016, Las Vegas, NV
Patient Matching Terminology
• False Negative: The algorithm misses a record that
should be matched
• False Positive: The algorithm creates a link to two
records that don’t actually match
Source: Culbertson, A. & Miller, K., Patient Matching EHR Ailments: Going from Placebo to Cure, Tuesday, March 1st HIMSS 2016 Las Vegas, NV
Evaluation
Good
EHR A
EHR B
Truth
(Gold Standard)
Algorithm
Match
Type
Jonathan
Jonathan
Match
Match
True Positive
Jonathan
Sally
Non-Match
Non-Match
True Negative
Jonathan
Sally
Non-Match
Match
False Positive
Jonathan
Jon
Match
Non-Match
False Negative
Bad
Source: Culbertson, A. Patient Matching A-Z, Wednesday, March 2ndst HIMSS 2016, Las Vegas, NV
Evaluation
EHR A
EHR B
Truth
(Gold Standard)
Algorithm
Match
Type
Jonathan
Jonathan
Match
Match
True Positive
Jonathan
Sally
Non-Match
Non-Match
True Negative
Jonathan
Sally
Non-Match
Match
False Positive
Jonathan
Jon
Match
Non-Match
False Negative
Bad
Source: Culbertson, A. Patient Matching A-Z, Wednesday, March 2ndst HIMSS 2016, Las Vegas, NV
Evaluation
EHR A
EHR B
Truth
(Gold Standard)
Algorithm
Match
Type
Jonathan
Jonathan
Match
Match
True Positive
Jonathan
Sally
Non-Match
Non-Match
True Negative
Jonathan
Sally
Non-Match
Match
False Positive
Jonathan
Jon
Match
Non-Match
False Negative
Bad
Source: Culbertson, A. Patient Matching A-Z, Wednesday, March 2ndst HIMSS 2016, Las Vegas, NV
Precision = True Positives / (True Positives
+ False Positives)
Evaluation
Truth
Positive
Negative
Positive
True Positive
False Positive
Negative
False Negative
True Negative
Precision
Algorithm
Recall
Recall = True Positives /
(True Positives + False
Negatives)
Source: Culbertson, A. Patient Matching A-Z, Wednesday, March 2ndst HIMSS 2016, Las Vegas, NV
Evaluation
• Calculation
» Precision
» Recall
•
• Tradeoffs between Precision and Recall
» F-Score is the harmonic mean between precision
and recall
Creating Test Data Sets
Development of Test Data Set
Patient Database
Select Potential Matches
(aka Adjudication Pool)
Manual
Reviewer 1
Manual
Reviewer 2
Manual
Reviewer 3
Human-Reviewed Match Decisions
(Answer Key == Ground Truth Data Set)
Compare Algorithm and
Test Data Set
Source: Culbertson, A. & Miller, K., Patient Matching EHR Ailments: Going from Placebo to Cure, Tuesday, March 1st HIMSS 2016 Las Vegas, NV
Development of Ground Truth Sets
• Identify data set that reflects real word use case
• Develop potential duplicates
• Human adjudication review and classification
» Match or Non-Match
• Estimate truth
» Pooled methods using multiple matching methods
Issues In Establishing Ground Truth
• First step in evaluation is to determine why the evaluation is being
conducted
• Different truth for different applications
» Security Applications vs Patient Health Record
• What is the cost of missing a match?
Summary for
Healthcare Use
Case
» Security: Lives are lost
» Health: Patient safety event, missed medications, allergies, etc… death But…this is
situation today.
• What is the cost of wrongly identifying a match?
» Security : Passenger is inconvenienced / delayed
» Health: Patient safety event, wrong medication, treatment, liability, death
• Criteria for truth must be carefully established and well-understood
» E.g. Question posed to annotators must be carefully phrased
Issues In Establishing Ground Truth (cont’d)
• Different truth for different applications
» Credit check
» Security applications
» Customer support
» De-duplication of mailing lists
• What is the cost of missing a match?
» New record entered into database
» Irritated customer
» Lives are lost
• Criteria for truth must be carefully established and well-understood by
annotators
» Question posed to annotators must be carefully phrased
Issues In Establishing Ground Truth (cont’d)
• How much time / expertise is available to judge (/discount)
false positives?
• Needs to reflect real word test use case
• Evaluation results are only as good as the truth on which they
are based
» And only as appropriate as the evaluation is to the task that will be
performed with the operational system
• Absolute recall impossible to measure without completely
known test set (i.e. “You don’t know what you’re missing.”)
» Estimate with pooled results
Examples
B Smith  Bill Smythe  William Smythe  W Smith ??
DOB: 10/12/1972  October 11, 1972 
December 10, 1972  12/10/72  October 12, 1927
Get Involved
• Webinars on how to participate and challenge overview
» May 24th
• Kicking off Patient Data Matching Algorithm Challenge in June
• Participant Discussion Board
• Website: www.patientmatchingchallenge.com
46
Acknowledgments
Thank you to the following individuals and organizations for their
involvement in the planning and development of this challenge :
» Debbie Bucci and the ONC Team
» HIMSS North America, Tom Leary, HIMSS
» Greg Downing, HHS Idea Lab, ONC
» Jerry and Beth Just and the Just Associates Team
» Keith Miller and Andy Gregorowicz, MITRE
» Caitlin Ryan, IRIS Health Solutions
» Capital Consulting Corporation Team
47
Additional Questions
FOR ADDITIONAL QUESTIONS/INFORMATION
CONTACT:
Adam Culbertson, [email protected]
Debbie Bucci, [email protected] (preferred)
Phone: 202-690-0213
48
Thank you for your interest!
The ONC Team
@ONC_HealthIT
@HHSONC