NESUG 2009 Coders' Corner Using SQL Joins to Perform Weighted Matches on Multiple Identifiers Jedediah J. Teres, MDRC, New York, NY ABSTRACT Matching observations from different data sources is problematic without a reliable shared identifier. Using multiple identifiers can be more restrictive as it requires multiple exact matches. One way around this is to create a match score based on the number of matching identifiers. This score can be weighted to favor certain matches or sets of matches (e.g., first name and last name) over others (e.g., first name and date of birth). This paper will describe a technique for using a match score in the context of a SQL join to find matches in cases when all identifiers are not exactly the same. INTRODUCTION Using a MERGE statement in a DATA step to combine data sets is a reliable and intuitive technique. However, match-merges require a common variable on all data sets being combined and exact matches on this variable. This paper explores methods of combining data sets using multiple variables as matching criteria in PROC SQL. Knowledge of using a match-merge in a DATA step as well as the use of the SELECT statement in PROC SQL is assumed. SAMPLE DATA SETS Two SAS data sets are used for illustration purposes in this paper: REF and CHK. Data set REF: Obs 1 2 3 4 5 6 7 8 FNAME LNAME George Paul John Ringo Pete Roger John Keith Harison McCartney Lemon Starr Townshend Daltrey Entwistle Moon DOB SSN ID 02/25/1943 06/18/1942 10/09/1940 02/25/1943 05/19/1945 03/01/1944 10/09/1944 08/23/1946 123-45-6789 234-56-7890 345-67-8901 456-78-9012 567-89-0123 678-90-1234 789-01-2345 890-12-3456 98765 87654 76543 65432 54321 43210 32109 21098 Data set REF contains 5 variables and 8 observations. Each observation represents a unique sample member. Data set CHK: Obs 1 2 3 4 5 6 7 8 9 10 FNAME LNAME George Paul John Richard Pete Roger Jon Keith Roger Saul Harrison McCartney Lennon Starkey Townsend Daltery Entwistle Moon Daltrey McCartney DOB SSN ID 02/25/1948 06/13/1942 10/09/1940 02/25/1943 05/19/1945 03/01/1944 09/10/1944 08/23/1946 04/30/1955 06/18/1942 123-45-6789 234-56-7890 345-67-8907 456-78-9012 567-89-0123 678-90-1234 789-01-2346 890-21-3456 987-65-4231 234-56-8901 98765 87654 76543 65432 54321 43210 32109 21098 71395 82651 BAND THE THE THE THE THE THE THE THE BEATLES BEATLES BEATLES BEATLES WHO WHO WHO WHO Data set CHK contains 6 variables and 10 observations. Each observation represents a unique individual. The data set CHK contains the same variables as the data set REF plus an additional variable BAND. The goal is to merge the data set CHK to the data set REF in order to pick up the BAND variable for the people in the sample. 1 NESUG 2009 Coders' Corner USING A MATCH-MERGE TO COMBINE DATA SETS Perhaps the most straightforward method of combining two data sets is using a match-merge in a DATA step. However, successfully combining data sets requires a common identifier between the two files. Here, a matchmerge is used to combine data sets REF and CHK. SSN is specified as the common identifier. proc sort data = ref ; by ssn ; run ; proc sort data = chk ; by ssn ; run ; data merged1 merge ref chk by ssn ; if ref and run ; ; (in = ref) (in = chk) ; chk ; The resulting data set contains only 5 observations. The original data set REF had 8, so 3 observations did not have a match in CHK. Obs 1 2 3 4 5 FNAME George Paul Richard Pete Roger LNAME Harrison McCartney Starkey Townsend Daltery DOB SSN ID 02/25/1948 06/13/1942 02/25/1943 05/19/1945 03/01/1944 123-45-6789 234-56-7890 456-78-9012 567-89-0123 678-90-1234 98765 87654 65432 54321 43210 BAND THE THE THE THE THE BEATLES BEATLES BEATLES WHO WHO A closer inspection of the data reveals that there are several SSNs that do not match exactly. George Paul John Ringo Pete Roger John Keith Harison McCartney Lemon Starr Townshend Daltrey Entwistle Moon 123456789 234567890 345678901 456789012 567890123 678901234 789012345 890123456 George Paul John Richard Pete Roger Jon Keith Roger Saul Harrison McCartney Lennon Starkey Townsend Daltery Entwistle Moon Daltrey McCartney 123456789 234567890 345678907 456789012 567890123 678901234 789012346 890213456 987654231 234568901 The inconsistencies between values of SSN between the two files are fairly minor and seem like plausible data entry errors. With a data set this small, values could be hardcoded so they match on both files. In a larger data set, that would not be a viable solution, so it's well worth considering what other options are available. Since there are other identifiers available on the file, a combination of criteria for the match-merge would intuitively result in a greater number of matched observations. A match-merge using SSN and LNAME is shown below: 2 NESUG 2009 Coders' Corner proc sort data = ref ; by ssn lname ; run ; proc sort data = by ssn lname ; run ; data merged2 ; merge ref (in chk (in by ssn lname ; if ref and chk run ; chk ; = ref) = chk) ; ; The resulting data set, however, has only 1 observation. Obs FNAME LNAME DOB SSN ID BAND 1 Paul McCartney 06/13/1942 234-56-7890 87654 THE BEATLES Why did the match-merge produce so few results? Upon further inspection, it becomes clear that only Paul McCartney has the same SSN and spelling of his last name on both files. George Paul John Ringo Pete Roger John Keith Harison McCartney Lemon Starr Townshend Daltrey Entwistle Moon 123456789 234567890 345678901 456789012 567890123 678901234 789012345 890123456 George Paul John Richard Pete Roger Jon Keith Roger Saul Harrison McCartney Lennon Starkey Townsend Daltery Entwistle Moon Daltrey McCartney 123456789 234567890 345678907 456789012 567890123 678901234 789012346 890213456 987654231 234568901 Using multiple criteria in the match-merge restricted the pool of matches, rather than expanding it. Therefore, if the goal is to use multiple identifiers to combine these data sets, the DATA step is not sufficient. USING AN SQL JOIN TO COMBINE DATA SETS Using a JOIN in SQL offers more flexibility when combining data sets. The basic syntax for combing data sets using a join in PROC SQL follows. proc sql ; create table inner_join1 as select ref.*, band from ref inner join chk on (ref.ssn eq chk.ssn) ; quit ; The results of this inner join are the same as the match-merge on SSN where only observations in both REF and CHK were kept. Obs 1 2 3 4 5 FNAME LNAME George Paul Ringo Pete Roger Harison McCartney Starr Townshend Daltrey DOB SSN ID 02/25/1943 06/18/1942 02/25/1943 05/19/1945 03/01/1944 123-45-6789 234-56-7890 456-78-9012 567-89-0123 678-90-1234 98765 87654 65432 54321 43210 3 BAND THE THE THE THE THE BEATLES BEATLES BEATLES WHO WHO NESUG 2009 Coders' Corner SSN and LNAME can be used as criteria for the inner join, but we can specify that either SSN match or LNAME match; recall that the match-merge required that both SSN and LNAME match. proc sql ; create table inner_join2 as select ref.*, band from ref inner join chk on (ref.ssn eq chk.ssn) or (ref.lname eq chk.lname) ; quit ; The join produces a larger data set. There are still some problems, however. There are now duplicate records for Paul McCartney and Roger Daltrey, because they were also matched to Saul McCartney and a different Roger Daltrey in the data set CHK. Also, John Lennon is not included in the join. Obs 1 2 3 4 5 6 7 8 9 FNAME LNAME George Paul Paul Ringo Pete Roger John Keith Roger Harison McCartney McCartney Starr Townshend Daltrey Entwistle Moon Daltrey DOB SSN ID 02/25/1943 06/18/1942 06/18/1942 02/25/1943 05/19/1945 03/01/1944 10/09/1944 08/23/1946 03/01/1944 123-45-6789 234-56-7890 234-56-7890 456-78-9012 567-89-0123 678-90-1234 789-01-2345 890-12-3456 678-90-1234 98765 87654 87654 65432 54321 43210 32109 21098 43210 BAND THE BEATLES THE BEATLES THE THE THE THE THE BEATLES WHO WHO WHO WHO Since using two identifiers did not produce a match for everyone in the sample, what would happen if all identifiers were used as criteria in the join? The code below allows for matches on any of the identifiers in the data sets. proc sql ; create table inner_join3 as select ref.*, band from ref inner join chk on ((ref.ssn eq chk.ssn) or (ref.id eq chk.id) or (ref.fname eq chk.fname) or (ref.lname eq chk.lname) or (ref.dob eq chk.dob)) ; quit ; This query creates the following data set: Obs 1 2 3 4 5 6 7 8 9 10 11 12 FNAME LNAME George George Paul Paul John Ringo Pete Roger Roger John John Keith Harison Harison McCartney McCartney Lemon Starr Townshend Daltrey Daltrey Entwistle Entwistle Moon DOB SSN ID 02/25/1943 02/25/1943 06/18/1942 06/18/1942 10/09/1940 02/25/1943 05/19/1945 03/01/1944 03/01/1944 10/09/1944 10/09/1944 08/23/1946 123-45-6789 123-45-6789 234-56-7890 234-56-7890 345-67-8901 456-78-9012 567-89-0123 678-90-1234 678-90-1234 789-01-2345 789-01-2345 890-12-3456 98765 98765 87654 87654 76543 65432 54321 43210 43210 32109 32109 21098 BAND THE BEATLES THE BEATLES THE BEATLES THE THE THE THE BEATLES BEATLES WHO WHO THE BEATLES THE WHO THE WHO Now everyone in REF is accounted for, but at the cost of having four sets of duplicate records on the file. Clearly, a method of refining the matching process is necessary. 4 NESUG 2009 Coders' Corner CREATING A MATCH SCORE A match score is a variable created during the join. In this case, it is equal to the sum of the number of matching identifiers. Instead of requiring two numbers to be equal, it is also possible to specify a tolerable range of error (e.g., matching on three out of five identifiers). Here, a match score, MS, is created that is equal to the sum of all matching identifiers. proc sql ; create table inner_join4 as select ref.*, band, ((ref.ssn eq chk.ssn) + (ref.id eq chk.id) + (ref.fname eq chk.fname) + (ref.lname eq chk.lname) + (ref.dob eq chk.dob)) as ms from ref inner join chk on ((ref.ssn eq chk.ssn) or (ref.id eq chk.id) or (ref.fname eq chk.fname) or (ref.lname eq chk.lname) or (ref.dob eq chk.dob)) ; quit ; The match score is shown below. Obs 1 2 3 4 5 6 7 8 9 10 11 12 FNAME LNAME George George Paul Paul John Ringo Pete Roger Roger John John Keith Harison Harison McCartney McCartney Lemon Starr Townshend Daltrey Daltrey Entwistle Entwistle Moon DOB SSN ID 02/25/1943 02/25/1943 06/18/1942 06/18/1942 10/09/1940 02/25/1943 05/19/1945 03/01/1944 03/01/1944 10/09/1944 10/09/1944 08/23/1946 123-45-6789 123-45-6789 234-56-7890 234-56-7890 345-67-8901 456-78-9012 567-89-0123 678-90-1234 678-90-1234 789-01-2345 789-01-2345 890-12-3456 98765 98765 87654 87654 76543 65432 54321 43210 43210 32109 32109 21098 BAND MS THE BEATLES THE BEATLES THE BEATLES THE THE THE THE BEATLES BEATLES WHO WHO THE BEATLES THE WHO THE WHO 3 1 4 2 3 3 4 4 2 1 2 4 Evaluating a frequency of a match score can give some insight into where the appropriate cutoff should be. MS 1 2 3 4 Frequency 2 3 3 4 Percent 16.67 25.00 25.00 33.33 Cumulative Frequency 2 5 8 12 Cumulative Percent 16.67 41.67 66.67 100.00 In this case, it's unclear exactly where to draw the line. Matches with a score of 1 are clearly not valid, but at least one match with a score of 2 is correct. 5 NESUG 2009 Coders' Corner CREATING A WEIGHTED MATCH SCORE In practice, some identifiers are more important than others. It is much more likely that two people share a birth date than a Social Security number. Accordingly, it is possible, and perhaps preferable, to weight some logical comparisons more heavily than others when calculating a match score. The syntax for creating a weighted match score follows. proc sql ; create table inner_join5 as select ref.*, band, ((2*(ref.ssn eq chk.ssn)) + (2*(ref.id eq chk.id)) + (ref.fname eq chk.fname) + (ref.lname eq chk.lname) + (ref.dob eq chk.dob)) as wms from ref inner join chk on ((ref.ssn eq chk.ssn) or (ref.id eq chk.id) or (ref.fname eq chk.fname) or (ref.lname eq chk.lname) or (ref.dob eq chk.dob)) order by ref.lname, ref.fname, wms ; quit ; Here the match score is weighted so that it favors matches on SSN and ID, a secondary numeric identifier. Obs 1 2 3 4 5 6 7 8 9 10 11 12 FNAME LNAME Roger Roger John John George George John Paul Paul Keith Ringo Pete Daltrey Daltrey Entwistle Entwistle Harison Harison Lemon McCartney McCartney Moon Starr Townshend DOB SSN ID 03/01/1944 03/01/1944 10/09/1944 10/09/1944 02/25/1943 02/25/1943 10/09/1940 06/18/1942 06/18/1942 08/23/1946 02/25/1943 05/19/1945 678-90-1234 678-90-1234 789-01-2345 789-01-2345 123-45-6789 123-45-6789 345-67-8901 234-56-7890 234-56-7890 890-12-3456 456-78-9012 567-89-0123 43210 43210 32109 32109 98765 98765 76543 87654 87654 21098 65432 54321 BAND WMS THE THE THE THE THE THE WHO BEATLES WHO BEATLES BEATLES BEATLES THE THE THE THE BEATLES WHO BEATLES WHO 2 6 1 3 1 5 4 2 6 5 5 6 Examining the frequency of weighted match scores reveals that there is a clear cutoff point of 3; any match with a score less than 3 is invalid. WMS 1 2 3 4 5 6 Frequency 2 2 1 1 3 3 Percent 16.67 16.67 8.33 8.33 25.00 25.00 Cumulative Frequency 2 4 5 6 9 12 Cumulative Percent 16.67 33.33 41.67 50.00 75.00 100.00 Once a cutoff point is established, it can be built into the PROC SQL statement. SAS code for including the cutoff criteria in the PROC SQL code follows. 6 NESUG 2009 Coders' Corner proc sql ; create table inner_join6 as select ref.*, band, ((2*(ref.ssn eq chk.ssn)) + (2*(ref.id eq chk.id)) + (ref.fname eq chk.fname) + (ref.lname eq chk.lname) + (ref.dob eq chk.dob)) as wms from ref inner join chk on ((ref.ssn eq chk.ssn) or (ref.id eq chk.id) or (ref.fname eq chk.fname) or (ref.lname eq chk.lname) or (ref.dob eq chk.dob)) where calculated wms ge 3 order by ref.lname, ref.fname ; quit ; Because the variable is created in the query, the keyword "calculated" must be specified. The resulting data set contains all 8 sample members from the data set REF and the BAND value. Obs 1 2 3 4 5 6 7 8 FNAME LNAME Roger John George John Paul Keith Ringo Pete Daltrey Entwistle Harison Lemon McCartney Moon Starr Townshend DOB SSN ID 03/01/1944 10/09/1944 02/25/1943 10/09/1940 06/18/1942 08/23/1946 02/25/1943 05/19/1945 678-90-1234 789-01-2345 123-45-6789 345-67-8901 234-56-7890 890-12-3456 456-78-9012 567-89-0123 43210 32109 98765 76543 87654 21098 65432 54321 BAND THE THE THE THE THE THE THE THE WMS WHO WHO BEATLES BEATLES BEATLES WHO BEATLES WHO 6 3 5 4 6 5 5 6 Of course, the cutoff is not always so clear, even when weighting the match score to favor identifiers that are more likely to be unique. The query can be written in such a way so that it is not necessary to determine a cutoff. AUTOMATICALLY SELECTING THE BEST MATCH In the code below, a GROUP BY clause has been added to the query and the WHERE clause has been replaced with a HAVING clause; the logical expression has been replaced as well. The HAVING clause returns the highest match score within each group specified in the GROUP BY clause. In this case, each observation with the highest value within each value of SSN is output. proc sql ; create table inner_join7 as select ref.*, band, ((2*(ref.ssn eq chk.ssn)) + (2*(ref.id eq chk.id)) + (ref.fname eq chk.fname) + (ref.lname eq chk.lname) + (ref.dob eq chk.dob)) as wms from ref inner join chk on ((ref.ssn eq chk.ssn) or (ref.id eq chk.id) or (ref.fname eq chk.fname) or (ref.lname eq chk.lname) or (ref.dob eq chk.dob)) group by ref.ssn having calculated wms eq max(calculated wms) order by ref.lname, ref.fname ; quit ; 7 NESUG 2009 Coders' Corner The boxes below show the groupings specified; the highest WMS value is bolded. Obs 1 2 3 4 5 6 7 8 9 10 11 12 FNAME LNAME Roger Roger John John George George John Paul Paul Keith Ringo Pete Daltrey Daltrey Entwistle Entwistle Harison Harison Lemon McCartney McCartney Moon Starr Townshend DOB SSN ID 03/01/1944 03/01/1944 10/09/1944 10/09/1944 02/25/1943 02/25/1943 10/09/1940 06/18/1942 06/18/1942 08/23/1946 02/25/1943 05/19/1945 678-90-1234 678-90-1234 789-01-2345 789-01-2345 123-45-6789 123-45-6789 345-67-8901 234-56-7890 234-56-7890 890-12-3456 456-78-9012 567-89-0123 43210 43210 32109 32109 98765 98765 76543 87654 87654 21098 65432 54321 BAND WMS THE THE THE THE THE THE WHO BEATLES WHO BEATLES BEATLES BEATLES THE THE THE THE BEATLES WHO BEATLES WHO 2 6 1 3 1 5 4 2 6 5 5 6 The observation with the highest value of WMS within each group specified is kept. The resulting data set is shown below. Obs 1 2 3 4 5 6 7 8 FNAME LNAME Roger John George John Paul Keith Ringo Pete Daltrey Entwistle Harison Lemon McCartney Moon Starr Townshend DOB SSN ID 03/01/1944 10/09/1944 02/25/1943 10/09/1940 06/18/1942 08/23/1946 02/25/1943 05/19/1945 678-90-1234 789-01-2345 123-45-6789 345-67-8901 234-56-7890 890-12-3456 456-78-9012 567-89-0123 43210 32109 98765 76543 87654 21098 65432 54321 BAND THE THE THE THE THE THE THE THE WMS WHO WHO BEATLES BEATLES BEATLES WHO BEATLES WHO 6 3 5 4 6 5 5 6 Note that in cases where there is a "tie," and multiple observations have the highest value within a group, all observations with that value will be kept. In general, this technique is more flexible than choosing a cutoff from a frequency table because it allows for variation within groups of identifiers. In order to create the most useful score possible, it is necessary to know your data as well as possible. CONCLUSIONS PROC SQL joins offer a tremendous amount of flexibility when combining data sets using multiple identifiers. Creating a match score with certain comparisons weighted to the programmer's specifications is a powerful and useful tool for combining data sets. ACKNOWLEDGMENTS SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration. Other brand and product names are registered trademarks or trademarks of their respective companies. CONTACT INFORMATION Jedediah Teres MDRC th th 16 East 34 St, 19 Floor New York, NY 10016 (212) 340-8807 [email protected] www.mdrc.org 8
© Copyright 2026 Paperzz