Incorporating Singe-Site and Network-based Data Quality Assessment in the SAFTINet Distributed Research Network Lisa Schilling, MD, MSPH Department of Medicine University of Colorado School of Medicine Colorado Health Outcomes Program Addressing Variations in Data Quality to Facilitate Multi-Institutional Comparative Effectiveness Research AcademyHealth 2013 Annual Research Meeting Monday June 24, 2013 11:30A-1:00P Funding provided by AHRQ 1R01HS019908 (Scalable Architecture for Federated Translational Inquiries Network) Acknowledging Collaborations OBSERVATIONAL MEDICAL OUTCOMES PARTNERSHIP Partners and Collaborators • • • • • • • • • • • • University of Colorado School of Medicine American Academy of Family Physicians (AAFP) Ohio State University (OSU) Department of Biomedical Informatics Colorado Community Managed Care Network and the Colorado Associated Community Health Information Enterprise (CCMCN/CACHIE) Salud Family Health Centers Metro Community Provider Network (MCPN) Denver Health and Hospital Authority (DHHA) Cherokee Health Systems (CHS) Colorado Department of Health Care Policy & Financing (HCPF) QED Clinical d/b/a CINA Observational Medical Outcomes Partnership (OMOP) Recombinant Data Corporation, Inc SAFTINet DRN Architecture Partner Network Why ROSITA? • ROSITA: Reusable OMOP and SAFTINet Interface Adaptor • ROSITA: The only bilingual Muppet • Converts EHR data into research limited data set 1. 2. 3. 4. 5. Replaces local codes with standardized codes Replaces direct identifiers with random identifiers Supports clear-text and encrypted record linkage Provides data quality metrics Pushes data sets to grid node for distributed queries ROSITA: From EHR to CER data Or flat files Concept mapping Profiling Data • Profiling and mapping EHR or surrogate EHR data is huge! – Difficult to validate – Continuous responsibility – Low cost, low burden • Three large data “sources” to assess: – The “raw” source data from ETL (input) – The post-processed data in OMOP CDM V4 (ROSITA output) – Cross-grid comparisons (SAFTINet query portal) Single site data quality assessment using ROSITA • ROSITA reporting system based on JasperServer Community Edition • LZ (landing zone) = Data with sitespecific values & coding schema • OMOP = transformed data into CDM V4 format and OMOP conceptIDs Three types of DQ profiling reports: 1. LZ (original data) 2. OMOP (transformed data) 3. LZ-OMOP comparison (what changed during transformation?) Data profiling using OMOP OSCAR • Rules-based data profiling system originally created by OMOP investigators • Calculates different DQ stats based on data type • DQ results are stored in a dedicated table in ROSITA • DQ results are pushed to the grid node for multisite queries SAFTINet OSCAR rules by variable type Statistic Type 1 – Count (of records) 2 – Mean 3 – Standard Deviation 4 – Minimum 5 – 25th Percentile 6 – Median 7 – 75th Percentile 8 – Maximum 9 – Number of NULL Values 10 – Number of Empty String Values 11 – Count (distinct) Numeric * Variable Type Continuous Date ID Categorical * * * This will return the count of empty string values when the underlying column being analyzed is of type VARCHAR. Otherwise it will return 0. Example – Continuous ID The following rule would be used to gather Continuous statistics for the visit_occurrence_id field in the visit_occurence table: 1= count Column Name Value 9= # of null source_schema_name omop source_table_name visit_occurrence 10= # of empty variable_name visit_occurrence_id string variable_type 1 – Continuous 11= count distinct data_type 3 – ID variable_formula visit_occurrence_id And the following results would be generated for this rule: source_schema_ name source_table_ name variable_name omop omop omop omop visit_occurrence visit_occurrence visit_occurrence visit_occurrence visit_occurrence_id visit_occurrence_id visit_occurrence_id visit_occurrence_id variable_ variable_ type value 1 1 1 1 NULL NULL NULL NULL statistic_ type 1 9 10 11 statistic_ value 564 0 0 564 Variable type: 1=catgricl,2=cntns Stat type: 1= count by record OSCAR Results in ROSITA Jasper Reports • JREPORT001 Landing Zone Key Summary Statistics: – – – – • JREPORT002 Random records for manual chart validation: – • Summary statistics for Landing Zone database; Includes, for select fields: Number of records in every table Summary statistics (mean, minimum/maximum, number of missing) on numeric fields Summary statistics (frequency) on categorical/character fields (excluding direct identifiers, such as Social Security Number, Medical Record Number, names, and addresses) Selection of records for chart review Randomly selects 25 visit occurrence records (and all associated records from person, provider, care site, organization, drug exposure, procedure occurrence, condition occurrence, and observation tables) from the ROSITA landing zone database. JREPORT003 OMOP Database Key Summary Statistics: – Summary statistics for OMOP database – Source: SAFTINet Data Validation V1.0 2012 Nov 20.docx ROSITA DQ Reporting Care site statistics by visits / by patients (LZ) Care site statistics by visits / by patients (OMOP) Drug Exposure statistics (LZ) Network-based DQ comparison queries (Future work) 1. Implement DQ query in DCQL 2. Submit query to grid 3. Returns DQ statistics across all nodes Conclusions • A rules-based data quality assessment infrastructure enables standardized DQ measures across all data partners • Three levels of DQ assessment: – LandingZone (raw) – OMOP (transformed) – Grid (multi-site) • Exploring graphical visualizations (in R) as rapid screening DQ Visualization: Thousands of data values in one glance….. Thank you! Questions? [email protected]
© Copyright 2026 Paperzz