Using SQL Joins to Perform Weighted Matches on

NESUG 2009
Coders' Corner
Using SQL Joins to Perform Weighted Matches on Multiple Identifiers
Jedediah J. Teres, MDRC, New York, NY
ABSTRACT
Matching observations from different data sources is problematic without a reliable shared identifier. Using
multiple identifiers can be more restrictive as it requires multiple exact matches. One way around this is to create
a match score based on the number of matching identifiers. This score can be weighted to favor certain matches
or sets of matches (e.g., first name and last name) over others (e.g., first name and date of birth). This paper will
describe a technique for using a match score in the context of a SQL join to find matches in cases when all
identifiers are not exactly the same.
INTRODUCTION
Using a MERGE statement in a DATA step to combine data sets is a reliable and intuitive technique. However,
match-merges require a common variable on all data sets being combined and exact matches on this variable.
This paper explores methods of combining data sets using multiple variables as matching criteria in PROC SQL.
Knowledge of using a match-merge in a DATA step as well as the use of the SELECT statement in PROC SQL is
assumed.
SAMPLE DATA SETS
Two SAS data sets are used for illustration purposes in this paper: REF and CHK.
Data set REF:
Obs
1
2
3
4
5
6
7
8
FNAME
LNAME
George
Paul
John
Ringo
Pete
Roger
John
Keith
Harison
McCartney
Lemon
Starr
Townshend
Daltrey
Entwistle
Moon
DOB
SSN
ID
02/25/1943
06/18/1942
10/09/1940
02/25/1943
05/19/1945
03/01/1944
10/09/1944
08/23/1946
123-45-6789
234-56-7890
345-67-8901
456-78-9012
567-89-0123
678-90-1234
789-01-2345
890-12-3456
98765
87654
76543
65432
54321
43210
32109
21098
Data set REF contains 5 variables and 8 observations. Each observation represents a unique sample member.
Data set CHK:
Obs
1
2
3
4
5
6
7
8
9
10
FNAME
LNAME
George
Paul
John
Richard
Pete
Roger
Jon
Keith
Roger
Saul
Harrison
McCartney
Lennon
Starkey
Townsend
Daltery
Entwistle
Moon
Daltrey
McCartney
DOB
SSN
ID
02/25/1948
06/13/1942
10/09/1940
02/25/1943
05/19/1945
03/01/1944
09/10/1944
08/23/1946
04/30/1955
06/18/1942
123-45-6789
234-56-7890
345-67-8907
456-78-9012
567-89-0123
678-90-1234
789-01-2346
890-21-3456
987-65-4231
234-56-8901
98765
87654
76543
65432
54321
43210
32109
21098
71395
82651
BAND
THE
THE
THE
THE
THE
THE
THE
THE
BEATLES
BEATLES
BEATLES
BEATLES
WHO
WHO
WHO
WHO
Data set CHK contains 6 variables and 10 observations. Each observation represents a unique individual. The
data set CHK contains the same variables as the data set REF plus an additional variable BAND. The goal is to
merge the data set CHK to the data set REF in order to pick up the BAND variable for the people in the sample.
1
NESUG 2009
Coders' Corner
USING A MATCH-MERGE TO COMBINE DATA SETS
Perhaps the most straightforward method of combining two data sets is using a match-merge in a DATA step.
However, successfully combining data sets requires a common identifier between the two files. Here, a matchmerge is used to combine data sets REF and CHK. SSN is specified as the common identifier.
proc sort data = ref ;
by ssn ;
run ;
proc sort data = chk ;
by ssn ;
run ;
data merged1
merge ref
chk
by ssn ;
if ref and
run ;
;
(in = ref)
(in = chk) ;
chk ;
The resulting data set contains only 5 observations. The original data set REF had 8, so 3 observations did not
have a match in CHK.
Obs
1
2
3
4
5
FNAME
George
Paul
Richard
Pete
Roger
LNAME
Harrison
McCartney
Starkey
Townsend
Daltery
DOB
SSN
ID
02/25/1948
06/13/1942
02/25/1943
05/19/1945
03/01/1944
123-45-6789
234-56-7890
456-78-9012
567-89-0123
678-90-1234
98765
87654
65432
54321
43210
BAND
THE
THE
THE
THE
THE
BEATLES
BEATLES
BEATLES
WHO
WHO
A closer inspection of the data reveals that there are several SSNs that do not match exactly.
George
Paul
John
Ringo
Pete
Roger
John
Keith
Harison
McCartney
Lemon
Starr
Townshend
Daltrey
Entwistle
Moon
123456789
234567890
345678901
456789012
567890123
678901234
789012345
890123456
George
Paul
John
Richard
Pete
Roger
Jon
Keith
Roger
Saul
Harrison
McCartney
Lennon
Starkey
Townsend
Daltery
Entwistle
Moon
Daltrey
McCartney
123456789
234567890
345678907
456789012
567890123
678901234
789012346
890213456
987654231
234568901
The inconsistencies between values of SSN between the two files are fairly minor and seem like plausible data
entry errors. With a data set this small, values could be hardcoded so they match on both files. In a larger data
set, that would not be a viable solution, so it's well worth considering what other options are available.
Since there are other identifiers available on the file, a combination of criteria for the match-merge would
intuitively result in a greater number of matched observations. A match-merge using SSN and LNAME is shown
below:
2
NESUG 2009
Coders' Corner
proc sort data = ref ;
by ssn lname ;
run ;
proc sort data =
by ssn lname ;
run ;
data merged2 ;
merge ref (in
chk (in
by ssn lname ;
if ref and chk
run ;
chk ;
= ref)
= chk) ;
;
The resulting data set, however, has only 1 observation.
Obs
FNAME
LNAME
DOB
SSN
ID
BAND
1
Paul
McCartney
06/13/1942
234-56-7890
87654
THE BEATLES
Why did the match-merge produce so few results? Upon further inspection, it becomes clear that only Paul
McCartney has the same SSN and spelling of his last name on both files.
George
Paul
John
Ringo
Pete
Roger
John
Keith
Harison
McCartney
Lemon
Starr
Townshend
Daltrey
Entwistle
Moon
123456789
234567890
345678901
456789012
567890123
678901234
789012345
890123456
George
Paul
John
Richard
Pete
Roger
Jon
Keith
Roger
Saul
Harrison
McCartney
Lennon
Starkey
Townsend
Daltery
Entwistle
Moon
Daltrey
McCartney
123456789
234567890
345678907
456789012
567890123
678901234
789012346
890213456
987654231
234568901
Using multiple criteria in the match-merge restricted the pool of matches, rather than expanding it. Therefore, if
the goal is to use multiple identifiers to combine these data sets, the DATA step is not sufficient.
USING AN SQL JOIN TO COMBINE DATA SETS
Using a JOIN in SQL offers more flexibility when combining data sets.
The basic syntax for combing data sets using a join in PROC SQL follows.
proc sql ;
create table inner_join1 as
select ref.*, band
from ref inner join chk
on (ref.ssn eq chk.ssn) ;
quit ;
The results of this inner join are the same as the match-merge on SSN where only observations in both REF and
CHK were kept.
Obs
1
2
3
4
5
FNAME
LNAME
George
Paul
Ringo
Pete
Roger
Harison
McCartney
Starr
Townshend
Daltrey
DOB
SSN
ID
02/25/1943
06/18/1942
02/25/1943
05/19/1945
03/01/1944
123-45-6789
234-56-7890
456-78-9012
567-89-0123
678-90-1234
98765
87654
65432
54321
43210
3
BAND
THE
THE
THE
THE
THE
BEATLES
BEATLES
BEATLES
WHO
WHO
NESUG 2009
Coders' Corner
SSN and LNAME can be used as criteria for the inner join, but we can specify that either SSN match or LNAME
match; recall that the match-merge required that both SSN and LNAME match.
proc sql ;
create table inner_join2 as
select ref.*, band
from ref inner join chk
on (ref.ssn eq chk.ssn) or (ref.lname eq chk.lname) ;
quit ;
The join produces a larger data set. There are still some problems, however. There are now duplicate records for
Paul McCartney and Roger Daltrey, because they were also matched to Saul McCartney and a different Roger
Daltrey in the data set CHK. Also, John Lennon is not included in the join.
Obs
1
2
3
4
5
6
7
8
9
FNAME
LNAME
George
Paul
Paul
Ringo
Pete
Roger
John
Keith
Roger
Harison
McCartney
McCartney
Starr
Townshend
Daltrey
Entwistle
Moon
Daltrey
DOB
SSN
ID
02/25/1943
06/18/1942
06/18/1942
02/25/1943
05/19/1945
03/01/1944
10/09/1944
08/23/1946
03/01/1944
123-45-6789
234-56-7890
234-56-7890
456-78-9012
567-89-0123
678-90-1234
789-01-2345
890-12-3456
678-90-1234
98765
87654
87654
65432
54321
43210
32109
21098
43210
BAND
THE BEATLES
THE BEATLES
THE
THE
THE
THE
THE
BEATLES
WHO
WHO
WHO
WHO
Since using two identifiers did not produce a match for everyone in the sample, what would happen if all
identifiers were used as criteria in the join? The code below allows for matches on any of the identifiers in the data
sets.
proc sql ;
create table inner_join3 as
select ref.*, band
from ref inner join chk
on ((ref.ssn eq chk.ssn) or (ref.id eq chk.id) or
(ref.fname eq chk.fname) or (ref.lname eq chk.lname) or
(ref.dob eq chk.dob)) ;
quit ;
This query creates the following data set:
Obs
1
2
3
4
5
6
7
8
9
10
11
12
FNAME
LNAME
George
George
Paul
Paul
John
Ringo
Pete
Roger
Roger
John
John
Keith
Harison
Harison
McCartney
McCartney
Lemon
Starr
Townshend
Daltrey
Daltrey
Entwistle
Entwistle
Moon
DOB
SSN
ID
02/25/1943
02/25/1943
06/18/1942
06/18/1942
10/09/1940
02/25/1943
05/19/1945
03/01/1944
03/01/1944
10/09/1944
10/09/1944
08/23/1946
123-45-6789
123-45-6789
234-56-7890
234-56-7890
345-67-8901
456-78-9012
567-89-0123
678-90-1234
678-90-1234
789-01-2345
789-01-2345
890-12-3456
98765
98765
87654
87654
76543
65432
54321
43210
43210
32109
32109
21098
BAND
THE BEATLES
THE BEATLES
THE BEATLES
THE
THE
THE
THE
BEATLES
BEATLES
WHO
WHO
THE BEATLES
THE WHO
THE WHO
Now everyone in REF is accounted for, but at the cost of having four sets of duplicate records on the file. Clearly,
a method of refining the matching process is necessary.
4
NESUG 2009
Coders' Corner
CREATING A MATCH SCORE
A match score is a variable created during the join. In this case, it is equal to the sum of the number of matching
identifiers. Instead of requiring two numbers to be equal, it is also possible to specify a tolerable range of error
(e.g., matching on three out of five identifiers). Here, a match score, MS, is created that is equal to the sum of all
matching identifiers.
proc sql ;
create table inner_join4 as
select ref.*, band,
((ref.ssn eq chk.ssn) + (ref.id eq chk.id) +
(ref.fname eq chk.fname) + (ref.lname eq chk.lname) +
(ref.dob eq chk.dob)) as ms
from ref inner join chk
on ((ref.ssn eq chk.ssn) or (ref.id eq chk.id) or
(ref.fname eq chk.fname) or (ref.lname eq chk.lname) or
(ref.dob eq chk.dob)) ;
quit ;
The match score is shown below.
Obs
1
2
3
4
5
6
7
8
9
10
11
12
FNAME
LNAME
George
George
Paul
Paul
John
Ringo
Pete
Roger
Roger
John
John
Keith
Harison
Harison
McCartney
McCartney
Lemon
Starr
Townshend
Daltrey
Daltrey
Entwistle
Entwistle
Moon
DOB
SSN
ID
02/25/1943
02/25/1943
06/18/1942
06/18/1942
10/09/1940
02/25/1943
05/19/1945
03/01/1944
03/01/1944
10/09/1944
10/09/1944
08/23/1946
123-45-6789
123-45-6789
234-56-7890
234-56-7890
345-67-8901
456-78-9012
567-89-0123
678-90-1234
678-90-1234
789-01-2345
789-01-2345
890-12-3456
98765
98765
87654
87654
76543
65432
54321
43210
43210
32109
32109
21098
BAND
MS
THE BEATLES
THE BEATLES
THE BEATLES
THE
THE
THE
THE
BEATLES
BEATLES
WHO
WHO
THE BEATLES
THE WHO
THE WHO
3
1
4
2
3
3
4
4
2
1
2
4
Evaluating a frequency of a match score can give some insight into where the appropriate cutoff should be.
MS
1
2
3
4
Frequency
2
3
3
4
Percent
16.67
25.00
25.00
33.33
Cumulative
Frequency
2
5
8
12
Cumulative
Percent
16.67
41.67
66.67
100.00
In this case, it's unclear exactly where to draw the line. Matches with a score of 1 are clearly not valid, but at least
one match with a score of 2 is correct.
5
NESUG 2009
Coders' Corner
CREATING A WEIGHTED MATCH SCORE
In practice, some identifiers are more important than others. It is much more likely that two people share a birth
date than a Social Security number. Accordingly, it is possible, and perhaps preferable, to weight some logical
comparisons more heavily than others when calculating a match score. The syntax for creating a weighted match
score follows.
proc sql ;
create table inner_join5 as
select ref.*, band, ((2*(ref.ssn eq chk.ssn)) +
(2*(ref.id eq chk.id)) + (ref.fname eq chk.fname) +
(ref.lname eq chk.lname) + (ref.dob eq chk.dob)) as wms
from ref inner join chk
on ((ref.ssn eq chk.ssn) or (ref.id eq chk.id) or
(ref.fname eq chk.fname) or (ref.lname eq chk.lname) or
(ref.dob eq chk.dob))
order by ref.lname, ref.fname, wms ;
quit ;
Here the match score is weighted so that it favors matches on SSN and ID, a secondary numeric identifier.
Obs
1
2
3
4
5
6
7
8
9
10
11
12
FNAME
LNAME
Roger
Roger
John
John
George
George
John
Paul
Paul
Keith
Ringo
Pete
Daltrey
Daltrey
Entwistle
Entwistle
Harison
Harison
Lemon
McCartney
McCartney
Moon
Starr
Townshend
DOB
SSN
ID
03/01/1944
03/01/1944
10/09/1944
10/09/1944
02/25/1943
02/25/1943
10/09/1940
06/18/1942
06/18/1942
08/23/1946
02/25/1943
05/19/1945
678-90-1234
678-90-1234
789-01-2345
789-01-2345
123-45-6789
123-45-6789
345-67-8901
234-56-7890
234-56-7890
890-12-3456
456-78-9012
567-89-0123
43210
43210
32109
32109
98765
98765
76543
87654
87654
21098
65432
54321
BAND
WMS
THE
THE
THE
THE
THE
THE
WHO
BEATLES
WHO
BEATLES
BEATLES
BEATLES
THE
THE
THE
THE
BEATLES
WHO
BEATLES
WHO
2
6
1
3
1
5
4
2
6
5
5
6
Examining the frequency of weighted match scores reveals that there is a clear cutoff point of 3; any match with a
score less than 3 is invalid.
WMS
1
2
3
4
5
6
Frequency
2
2
1
1
3
3
Percent
16.67
16.67
8.33
8.33
25.00
25.00
Cumulative
Frequency
2
4
5
6
9
12
Cumulative
Percent
16.67
33.33
41.67
50.00
75.00
100.00
Once a cutoff point is established, it can be built into the PROC SQL statement. SAS code for including the cutoff
criteria in the PROC SQL code follows.
6
NESUG 2009
Coders' Corner
proc sql ;
create table inner_join6 as
select ref.*, band, ((2*(ref.ssn eq chk.ssn)) +
(2*(ref.id eq chk.id)) + (ref.fname eq chk.fname) +
(ref.lname eq chk.lname) + (ref.dob eq chk.dob)) as wms
from ref inner join chk
on ((ref.ssn eq chk.ssn) or (ref.id eq chk.id) or
(ref.fname eq chk.fname) or (ref.lname eq chk.lname) or
(ref.dob eq chk.dob))
where calculated wms ge 3
order by ref.lname, ref.fname ;
quit ;
Because the variable is created in the query, the keyword "calculated" must be specified.
The resulting data set contains all 8 sample members from the data set REF and the BAND value.
Obs
1
2
3
4
5
6
7
8
FNAME
LNAME
Roger
John
George
John
Paul
Keith
Ringo
Pete
Daltrey
Entwistle
Harison
Lemon
McCartney
Moon
Starr
Townshend
DOB
SSN
ID
03/01/1944
10/09/1944
02/25/1943
10/09/1940
06/18/1942
08/23/1946
02/25/1943
05/19/1945
678-90-1234
789-01-2345
123-45-6789
345-67-8901
234-56-7890
890-12-3456
456-78-9012
567-89-0123
43210
32109
98765
76543
87654
21098
65432
54321
BAND
THE
THE
THE
THE
THE
THE
THE
THE
WMS
WHO
WHO
BEATLES
BEATLES
BEATLES
WHO
BEATLES
WHO
6
3
5
4
6
5
5
6
Of course, the cutoff is not always so clear, even when weighting the match score to favor identifiers that are
more likely to be unique. The query can be written in such a way so that it is not necessary to determine a cutoff.
AUTOMATICALLY SELECTING THE BEST MATCH
In the code below, a GROUP BY clause has been added to the query and the WHERE clause has been replaced
with a HAVING clause; the logical expression has been replaced as well. The HAVING clause returns the highest
match score within each group specified in the GROUP BY clause. In this case, each observation with the highest
value within each value of SSN is output.
proc sql ;
create table inner_join7 as
select ref.*, band, ((2*(ref.ssn eq chk.ssn)) +
(2*(ref.id eq chk.id)) + (ref.fname eq chk.fname) +
(ref.lname eq chk.lname) + (ref.dob eq chk.dob)) as wms
from ref inner join chk
on ((ref.ssn eq chk.ssn) or (ref.id eq chk.id) or
(ref.fname eq chk.fname) or (ref.lname eq chk.lname) or
(ref.dob eq chk.dob))
group by ref.ssn
having calculated wms eq max(calculated wms)
order by ref.lname, ref.fname ;
quit ;
7
NESUG 2009
Coders' Corner
The boxes below show the groupings specified; the highest WMS value is bolded.
Obs
1
2
3
4
5
6
7
8
9
10
11
12
FNAME
LNAME
Roger
Roger
John
John
George
George
John
Paul
Paul
Keith
Ringo
Pete
Daltrey
Daltrey
Entwistle
Entwistle
Harison
Harison
Lemon
McCartney
McCartney
Moon
Starr
Townshend
DOB
SSN
ID
03/01/1944
03/01/1944
10/09/1944
10/09/1944
02/25/1943
02/25/1943
10/09/1940
06/18/1942
06/18/1942
08/23/1946
02/25/1943
05/19/1945
678-90-1234
678-90-1234
789-01-2345
789-01-2345
123-45-6789
123-45-6789
345-67-8901
234-56-7890
234-56-7890
890-12-3456
456-78-9012
567-89-0123
43210
43210
32109
32109
98765
98765
76543
87654
87654
21098
65432
54321
BAND
WMS
THE
THE
THE
THE
THE
THE
WHO
BEATLES
WHO
BEATLES
BEATLES
BEATLES
THE
THE
THE
THE
BEATLES
WHO
BEATLES
WHO
2
6
1
3
1
5
4
2
6
5
5
6
The observation with the highest value of WMS within each group specified is kept. The resulting data set is
shown below.
Obs
1
2
3
4
5
6
7
8
FNAME
LNAME
Roger
John
George
John
Paul
Keith
Ringo
Pete
Daltrey
Entwistle
Harison
Lemon
McCartney
Moon
Starr
Townshend
DOB
SSN
ID
03/01/1944
10/09/1944
02/25/1943
10/09/1940
06/18/1942
08/23/1946
02/25/1943
05/19/1945
678-90-1234
789-01-2345
123-45-6789
345-67-8901
234-56-7890
890-12-3456
456-78-9012
567-89-0123
43210
32109
98765
76543
87654
21098
65432
54321
BAND
THE
THE
THE
THE
THE
THE
THE
THE
WMS
WHO
WHO
BEATLES
BEATLES
BEATLES
WHO
BEATLES
WHO
6
3
5
4
6
5
5
6
Note that in cases where there is a "tie," and multiple observations have the highest value within a group, all
observations with that value will be kept. In general, this technique is more flexible than choosing a cutoff from a
frequency table because it allows for variation within groups of identifiers. In order to create the most useful score
possible, it is necessary to know your data as well as possible.
CONCLUSIONS
PROC SQL joins offer a tremendous amount of flexibility when combining data sets using multiple identifiers.
Creating a match score with certain comparisons weighted to the programmer's specifications is a powerful and
useful tool for combining data sets.
ACKNOWLEDGMENTS
SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS
Institute Inc. in the USA and other countries. ® indicates USA registration.
Other brand and product names are registered trademarks or trademarks of their respective companies.
CONTACT INFORMATION
Jedediah Teres
MDRC
th
th
16 East 34 St, 19 Floor
New York, NY 10016
(212) 340-8807
[email protected]
www.mdrc.org
8

Download Report

Using SQL Joins to Perform Weighted Matches on

Paperzz.com

Your Paperzz