matches - Office for National Statistics

The Conditional Independence Assumption in
Probabilistic Record Linkage Methods
Stephen Sharp
National Records of Scotland
Ladywell Road
Edinburgh EH12 7TF
[email protected]
The record linkage problem
• Given two files A and B, the aim is to find record pairs
which refer to the same person.
• This is done on the basis of linking fields common to the
two files such as first name, last name, date of birth and
postcode
• The data matrix therefore looks like
With four linking fields
Source of
record
Linking
field 1
Linking
field 2
Linking
field 3
Linking
field 4
File A
A1
A2
A3
A4
File B
B1
B2
B3
B4
What is the assumption of conditional
independence?
• The likelihood that the two records refer to the same
person is measured by a log likelihood ratio
P AB1 AB2 AB3 AB4 | match
w  log
P AB1 AB2 AB3 AB4 | nonmatch
What is the assumption of conditional
independence?
• This is much easier to work out if the observations are
independent conditional on match status because now
P ABi | match
w   log
P ABi | nonmatch
i
Why is the assumption of conditional
independence important?
• It keeps the numbers of parameters manageable – linear
rather than exponential relation to the number of linking
fields
• Enables the use of frequency based agreement weights
• Speeds up computing time
• Improves stability of parameter estimation
• But is almost always wrong e.g. gender is almost wholly
predictable from first name
• But does it matter?
Who adopts the conditional independence assumption?
• Rec Link (US Census Bureau) – yes
• Link Plus (US Centers for Disease Control and
Prevention) – yes
• GRLS/Fundy (Statistics Canada) – yes
• ORLS – yes (probably)
• RELAIS (Italian Statistical Institute) - no
Two questions
• To what extent is the assumption violated in real data
sets?
• How much effect does it have on the output of
linkage software?
What does the assumption look like in practice?
A = Agree
M = Match
D = Disagree
N = Non-match
Linkage
score
Field 1
Field 2
Field 3
Field 4
Match
status
High
A
A
A
A
M
High
A
A
A
A
M
High
A
A
A
A
M
High
A
A
A
M
……
……
……
……
……
Medium
A
D
A
M
Medium
D
A
Medium
A
D
A
N
Medium
D
A
A
M
……
……
……
……
……
……
Low
D
D
D
A
N
Low
D
D
A
D
N
Low
A
D
D
D
N
Low
D
D
D
Low
D
A
D
……
A
N
N
D
N
Calculating the correlations between linkage fields
• Run 1 – Rec Link - a 10% sample of the 2001
Scottish Census and the 2001 census coverage
survey – one blocking field and seven linkage fields
• Run 2 – Link Plus – a sample of the Scottish NHSCR
data base and HESA records of Scottish students
studying in England or Wales
Run 1 - tetrachoric correlations for matches in the
Census/CCS data – medium linkage scores only
Matches
N < 1707
first
name
last
name
house
no
dob
year
dob
mon
dob
day
post
code
gender
first name
1.00
-0.11
0.09
-0.33
-0.46
-0.40
-0.17
-0.12
last name
-0.11
1.00
0.01
-0.27
-0.36
-0.40
-0.29
-0.08
house number
0.09
0.01
1.00
-0.43
-0.32
-0.41
-0.03
-0.13
year of birth
-0.33
-0.27
-0.43
1.00
0.17
0.24
-0.12
-0.02
month of birth
-0.46
-0.36
-0.32
0.17
1.00
0.47
-0.07
-0.04
day of birth
-0.40
-0.40
-0.41
0.24
0.47
1.00
-0.13
-0.01
post code
-0.17
-0.29
-0.03
-0.12
-0.07
-0.13
1.00
-0.05
gender
-0.12
-0.08
-0.13
-0.02
-0.04
-0.01
-0.05
1.00
Run 1 - tetrachoric correlations for non-matches in the
Census/CCS data – medium linkage scores only
Non-matches
N < 303
first
name
last
name
house
no
dob
year
dob
mon
dob
day
post
code
gender
first name
1.00
0.19
-0.34
-0.15
0.13
-0.25
-0.70
0.22
last name
0.19
1.00
-0.14
-0.32
-0.10
-0.46
-0.46
-0.11
house number
-0.34
-0.14
1.00
0.03
-0.40
0.00
0.22
-0.18
year of birth
-0.15
-0.32
0.03
1.00
-0.03
0.22
0.02
-0.13
0.13
-0.10
-0.40
-0.03
1.00
-0.03
-0.33
-0.08
day of birth
-0.25
-0.46
0.00
0.22
-0.03
1.00
0.17
-0.16
post code
-0.70
-0.46
0.22
0.02
-0.33
0.17
1.00
-0.16
0.22
-0.11
-0.18
-0.13
-0.08
-0.16
-0.16
1.00
month of birth
gender
Run 2 - tetrachoric correlations for matches in the
NHSCR/HESA data – medium linkage scores only
Matches
N < 450
first
name
last
name
first name
1.00
-0.07
-0.07
-0.65
0.10
last name
-0.07
1.00
-0.03
-0.03
-0.01
date of birth
-0.07
-0.03
1.00
-0.26
-0.01
post code
-0.65
-0.03
-0.26
1.00
-0.13
0.10
-0.01
-0.01
-0.13
1.00
gender
birth
date
post
code
gender
Run 2 - tetrachoric correlations for non-matches in the
NHSCR/HESA data – medium linkage scores only
Non Matches
N < 131
first
name
last
name
birth
date
post
code
first name
1.00
0.01
-0.66
-0.15
-0.54
last name
0.01
1.00
0.24
-0.44
-0.06
date of birth
-0.66
0.24
1.00
-0.49
0.19
post code
-0.15
-0.44
-0.49
1.00
-0.07
gender
-0.54
-0.06
0.19
-0.07
1.00
gender
So the assumption of independence is significantly
violated. Does it matter?
• Runs 3, 4 and 5. All using the census/CCS data and
with Link Plus but different treatments of the date of
birth
• Run 3 – specific to date format treating the date as
one field (so not assuming independence) but with
“intelligence”
• Run 4 – day, month and year treated as three
separate fields (and therefore as independent)
• Run 5 – day, month and year concatenated and
treated as one field (so not assuming independence)
but with no “intelligence”
Is run 4 worse than runs 3 and 5?
Fig 1: Census/CCS data with three date treatments
True links
7500
7000
Run 3 - one component,
date specific rule
6500
Run 4 - three
components, exact rule
6000
Run 5 - one component,
exact rule
5500
0
100
200
False links
300
Run 6 – the Clackmannanshire data
Fig 2: Clackmannanshire Census/CCS data
True Links
655
645
Rec Link
Link Plus
Relais
635
625
0
2
4
6
False Links
8
10
Conclusions
• Work in progress and limited amounts of data
currently available
• No evidence that the assumption of conditional
independence has negative effects on output quality
• Future intentions include bringing in more packages
such as RELAIS v2.2 and wider variety of data sets
where training data is available
• For the moment, any views on the methods used
and/or findings so far?
The Conditional Independence Assumption in
Probabilistic Record Linkage Methods
Stephen Sharp
National Records of Scotland
Ladywell Road
Edinburgh EH12 7TF
[email protected]