The Conditional Independence Assumption in Probabilistic Record Linkage Methods Stephen Sharp National Records of Scotland Ladywell Road Edinburgh EH12 7TF [email protected] The record linkage problem • Given two files A and B, the aim is to find record pairs which refer to the same person. • This is done on the basis of linking fields common to the two files such as first name, last name, date of birth and postcode • The data matrix therefore looks like With four linking fields Source of record Linking field 1 Linking field 2 Linking field 3 Linking field 4 File A A1 A2 A3 A4 File B B1 B2 B3 B4 What is the assumption of conditional independence? • The likelihood that the two records refer to the same person is measured by a log likelihood ratio P AB1 AB2 AB3 AB4 | match w log P AB1 AB2 AB3 AB4 | nonmatch What is the assumption of conditional independence? • This is much easier to work out if the observations are independent conditional on match status because now P ABi | match w log P ABi | nonmatch i Why is the assumption of conditional independence important? • It keeps the numbers of parameters manageable – linear rather than exponential relation to the number of linking fields • Enables the use of frequency based agreement weights • Speeds up computing time • Improves stability of parameter estimation • But is almost always wrong e.g. gender is almost wholly predictable from first name • But does it matter? Who adopts the conditional independence assumption? • Rec Link (US Census Bureau) – yes • Link Plus (US Centers for Disease Control and Prevention) – yes • GRLS/Fundy (Statistics Canada) – yes • ORLS – yes (probably) • RELAIS (Italian Statistical Institute) - no Two questions • To what extent is the assumption violated in real data sets? • How much effect does it have on the output of linkage software? What does the assumption look like in practice? A = Agree M = Match D = Disagree N = Non-match Linkage score Field 1 Field 2 Field 3 Field 4 Match status High A A A A M High A A A A M High A A A A M High A A A M …… …… …… …… …… Medium A D A M Medium D A Medium A D A N Medium D A A M …… …… …… …… …… …… Low D D D A N Low D D A D N Low A D D D N Low D D D Low D A D …… A N N D N Calculating the correlations between linkage fields • Run 1 – Rec Link - a 10% sample of the 2001 Scottish Census and the 2001 census coverage survey – one blocking field and seven linkage fields • Run 2 – Link Plus – a sample of the Scottish NHSCR data base and HESA records of Scottish students studying in England or Wales Run 1 - tetrachoric correlations for matches in the Census/CCS data – medium linkage scores only Matches N < 1707 first name last name house no dob year dob mon dob day post code gender first name 1.00 -0.11 0.09 -0.33 -0.46 -0.40 -0.17 -0.12 last name -0.11 1.00 0.01 -0.27 -0.36 -0.40 -0.29 -0.08 house number 0.09 0.01 1.00 -0.43 -0.32 -0.41 -0.03 -0.13 year of birth -0.33 -0.27 -0.43 1.00 0.17 0.24 -0.12 -0.02 month of birth -0.46 -0.36 -0.32 0.17 1.00 0.47 -0.07 -0.04 day of birth -0.40 -0.40 -0.41 0.24 0.47 1.00 -0.13 -0.01 post code -0.17 -0.29 -0.03 -0.12 -0.07 -0.13 1.00 -0.05 gender -0.12 -0.08 -0.13 -0.02 -0.04 -0.01 -0.05 1.00 Run 1 - tetrachoric correlations for non-matches in the Census/CCS data – medium linkage scores only Non-matches N < 303 first name last name house no dob year dob mon dob day post code gender first name 1.00 0.19 -0.34 -0.15 0.13 -0.25 -0.70 0.22 last name 0.19 1.00 -0.14 -0.32 -0.10 -0.46 -0.46 -0.11 house number -0.34 -0.14 1.00 0.03 -0.40 0.00 0.22 -0.18 year of birth -0.15 -0.32 0.03 1.00 -0.03 0.22 0.02 -0.13 0.13 -0.10 -0.40 -0.03 1.00 -0.03 -0.33 -0.08 day of birth -0.25 -0.46 0.00 0.22 -0.03 1.00 0.17 -0.16 post code -0.70 -0.46 0.22 0.02 -0.33 0.17 1.00 -0.16 0.22 -0.11 -0.18 -0.13 -0.08 -0.16 -0.16 1.00 month of birth gender Run 2 - tetrachoric correlations for matches in the NHSCR/HESA data – medium linkage scores only Matches N < 450 first name last name first name 1.00 -0.07 -0.07 -0.65 0.10 last name -0.07 1.00 -0.03 -0.03 -0.01 date of birth -0.07 -0.03 1.00 -0.26 -0.01 post code -0.65 -0.03 -0.26 1.00 -0.13 0.10 -0.01 -0.01 -0.13 1.00 gender birth date post code gender Run 2 - tetrachoric correlations for non-matches in the NHSCR/HESA data – medium linkage scores only Non Matches N < 131 first name last name birth date post code first name 1.00 0.01 -0.66 -0.15 -0.54 last name 0.01 1.00 0.24 -0.44 -0.06 date of birth -0.66 0.24 1.00 -0.49 0.19 post code -0.15 -0.44 -0.49 1.00 -0.07 gender -0.54 -0.06 0.19 -0.07 1.00 gender So the assumption of independence is significantly violated. Does it matter? • Runs 3, 4 and 5. All using the census/CCS data and with Link Plus but different treatments of the date of birth • Run 3 – specific to date format treating the date as one field (so not assuming independence) but with “intelligence” • Run 4 – day, month and year treated as three separate fields (and therefore as independent) • Run 5 – day, month and year concatenated and treated as one field (so not assuming independence) but with no “intelligence” Is run 4 worse than runs 3 and 5? Fig 1: Census/CCS data with three date treatments True links 7500 7000 Run 3 - one component, date specific rule 6500 Run 4 - three components, exact rule 6000 Run 5 - one component, exact rule 5500 0 100 200 False links 300 Run 6 – the Clackmannanshire data Fig 2: Clackmannanshire Census/CCS data True Links 655 645 Rec Link Link Plus Relais 635 625 0 2 4 6 False Links 8 10 Conclusions • Work in progress and limited amounts of data currently available • No evidence that the assumption of conditional independence has negative effects on output quality • Future intentions include bringing in more packages such as RELAIS v2.2 and wider variety of data sets where training data is available • For the moment, any views on the methods used and/or findings so far? The Conditional Independence Assumption in Probabilistic Record Linkage Methods Stephen Sharp National Records of Scotland Ladywell Road Edinburgh EH12 7TF [email protected]
© Copyright 2026 Paperzz