Hot Fuzz: using SAS “fuzzy” data matching
techniques to identify duplicate and
related customer records.
Stuart Edwards, Collection House Group
Collection House Group; what do we do?
•
Debt Collection; purchased debt, receivables and outsourced collections.
•
Banking and Finance, Insurance, Government, Telecom and Utility debt.
•
More than 250,000 purchased accounts currently active; aggregate face
value exceeding $5 Billion.
•
Call-centre based; CSOs locate customers and negotiate an outcome.
Page 2
Skip tracing; the art of finding people.
•
Debt usually sold / outsourced as the customer has defaulted and “skipped”.
?
+
•
New “Leads” must be identified and worked to contact the customer.
•
Leads data can be gathered from multiple providers.
•
Our database of existing or previous customers is a free source of leads
•
Cross-referencing is employed to search this data for linkages.
Page 3
Why do we perform cross-referencing?
•
Many customers have multiple debts.
Co-borrower
Credit Card
Account 1
Loan
Account 2
Customer 1
07 1234 5678
Utility
Account 3
Primary
Customer 2
Primary
Motor Loan
Account 7
Guarantor
<Unknown>
Telco
Account 4
•
Makes skip-tracing more efficient and more timely.
•
Fuzzy-matching uncovers non-standard matches – e.g. name changes (due
to marriage), mis-spellings and co-habitants.
•
Linking accounts enables consistent negotiation.
•
Cross-referencing results may indicate a customer’s propensity to pay.
Page 4
What attributes help us to determine a match?
•
Need to identify both perfect and potential matches.
Name: J Smith
DOB: 12/05/1968
10 Queen St, Brisbane
?
Name: John Smith
DOB: 05/12/1968
4 Merthyr Rd, New Farm
•
Input data can be of varying quality!
•
There are a number of attributes upon which we can match.
First Name
Maiden Name
Middle Name(s) Monikers
Surname
Initials
Drivers Licence
Credit Bureau Reference
Date of Birth
Inverted DD / MM
Address
DPID
X/Y Coordinates
Telephone number(s)
Email address
Page 5
SAS offers a range of methods for matching data.
Method
Data step match-merge.
data c1;
merge a1 b1;
by <Common Variable>;
Proc SQL join.
proc sql;
create table c1 as
select a1.*, b1.*
from a1 left join b1 on
(<Common Variable>);
SAS Data Flux.
Benefits
Disadvantages
Simple, Base SAS syntax.
Not suited to many-to-
Fast execution.
many matches.
Generates a Cartesian
Slow when merging large
product; good for many-
data sets.
to-many joins.
Can fill up disk space if
not optimised.
Optimised for this task.
More expensive than our
Matching logic in-built.
budget for the features
we required.
Hash table matching.
data c1;
declare Hash HX (dataset:
"Match_Data"
.......etc
In memory execution.
Enables matching on
multiple key variables.
Complex syntax to write
and debug.
Can fill up memory with
large data sets.
Page 6
Hash matching was chosen as the preferred
strategy.
*Match Base_Data to Match_Data via Hash Table;
data Match_Output;
•
Many-to-many matching is required.
•
Data volumes are large – 1M+ records
*Declare the Hash key;
declare Hash HX (dataset: "Match_Data", multidata: 'y',
hashexp: 12);
rc = HX.DefineKey ("KEY1");
rc = HX.DefineData (<Match Variables to Write>);
rc = HX.DefineDone ();
matched against 1M+ records.
•
*Populate the hash table;
do until (eof_hash);
set Match_Data end = eof_hash;
rc = HX.add ();
end;
In memory processing – reliable with
acceptable speed.
•
*Loop through the records in the base data set;
do until (eofX) ;
Matching is performed on common
set Base_Data end = eofX;
rc = HX.find ();
if (rc = 0) then do;
variables.
•
if <Crude Filter Conditions> then output;
HX.has_next(result: r);
do while(r ne 0);
Crude filters applied to eliminate obvious
non-matches.
rc = HX.find_next();
if <Crude Filter Conditions> then output;
HX.has_next(result: r);
end;
end;
end;
stop;
run;
Page 7
Multiple keys are generated from transformed attributes to
perform the hash matches.
•
The Soundex function is used to give similar
sounding names a “fuzzy” value.
SX_Name = soundex(<Name>);
JOHN-PAUL
JONNY PAOLO
J514
•
Multiple key values are combined.
Key_Cat = catt(SX_Name,
SX_Surname);
•
The MD5 function generates a Hash Key.
Key_Name = put(md5(Key_Cat),$hex8.);
J514T2435253
•
Hash key is read into memory and records are
BC590E6B
rc = HX.DefineKey (“Key_Name");
matched on the key.
•
-
Multiple hash results combined by SQL Union.
proc sql;
create table Hash_Combined as
select * from Hash_Result1
UNION
select * from Hash_Result2
;
quit;
Page 8
Filtering is undertaken to remove unlikely match pairs.
•
•
Crude filtering applied to exclude obvious non-matches.
ID
ID_X
Key
Key Type
Name
Name_X
Address
Address_X
Distance
46873
46132
1FGH8
Surname +
Initial
Robert
Jones
R Jones
1 King St,
Sydney
1 King St,
Sydney
0 km
46873
45944
2J9KX
Name + DOB
Robert
Jones
Robbie
Jones
1 King St,
Sydney
5 New Rd,
Pyrmont
2.1 km
46873
40185
J6R7F
Name + DPID
Robert
Jones
Mary
Jones
1 King St,
Sydney
1 King St,
Sydney
0 km
46873
42922
L13A9
Name
Robert
Jones
Bob
Jones
1 King St,
Sydney
18 Old St,
Bondi
6.2 km
Score
Match Type
PAF data is used to cleanse and validate addresses.
o DPID (premises-specific delivery ID) used in Key generation.
o X/Y Coordinates, to calculate geographical distance.
Page 9
Enterprise Miner is used to derive a propensity-tomatch score using available points of ID.
•
Customer records with many points of ID are
used to seed a model.
•
Credit Bureau data verifies the match and
sets the Target variable (match or nomatch).
•
Points of ID are selectively removed.
•
Hash matching conducted on sampled
JOHN D SMITH
28/05/1975
QLD1678904
5 Wombat St
Kangaroo Pt
QLD 4069
7 Echidna Rd
Sherwood
QLD 4075
0412 345 678
data.
•
Propensity-to-match model is built on the
validated target variable.
Page 10
Enterprise Miner is used to derive a propensity-tomatch score using available points of ID.
•
Results of model nodes determine the
worth of input variables.
•
Models are compared and the best is used
to generate score code.
•
Score logic applied to records as given to
deliver a match likelihood score.
ID
ID_X
Key
Key Type
Name
Name_X
Address
Address_X
Distance
Score
Match Type
46873
46132
1FGH8
Surname +
Initial
Robert
Jones
R Jones
1 King St,
Sydney
1 King St,
Sydney
0 km
920
EXACT
46873
45944
2J9KX
Name + DOB
Robert
Jones
Robbie
Jones
1 King St,
Sydney
5 New Rd,
Pyrmont
2.1 km
615
PROBABLE
46873
40185
J6R7F
Name + DPID
Robert
Jones
Mary
Jones
1 King St,
Sydney
1 King St,
Sydney
0 km
276
FAMILY
46873
42922
L13A9
Name
Robert
Jones
Bob
Jones
1 King St,
Sydney
18 Old St,
Bondi
6.2 km
102
POSSIBLE
Page 11
Filtered results are displayed to our Collections Agents
through the CRM.
•
Pertinent information is displayed to CSOs for assessment.
Account: 46021358721 Type:
Name:
ROBERT JONES
DOB:
12/04/1977
Address:
1 KING ST, SYDNEY, NSW 2000
D/L:
N/A
Home Ph:
02 2134 5768
Mobile Ph:
N/A
Client:
BIG FINANCE (BF1234)
Balance:
$5,241
Current Intention:
Work Ph:
Last Paid:
N/A
Product:
-
CREDIT CARD
Last Contact: 15/03/2015
Located, no active commitment.
Account: 46021358721 Type:
Name:
ROBBIE JONES
DOB:
12/04/1977
Address:
5 NEW RD, PYRMONT, NSW 2006
D/L:
NSW268427
Home Ph:
N/A
Mobile Ph:
0411 223 344
Client:
POWER2 (PT5678)
Balance:
$0
Current Intention:
Work Ph:
Last Paid:
N/A
Product:
12/01/2015
ELECTRICITY SUPPLY
Last Contact: 12/01/2015
Account Paid in Full.
Page 12
Project successes and lessons learned.
•
Time for more than 20 FTE saved by eliminating manual XREF.
•
New cross-reference approach identified new match pairs.
•
Filtering to balance between too-many and too-few “possible” matches.
•
Performing many-to-many matches on large datasets can be resource
hungry and result in crashes.
•
Hash matching is the most reliable solution.
•
Enterprise Miner derived score models allowed more qualitative assessment.
Page 13
References to useful Hash Table resources.
•
SAS Hash Object Programming Made Easy.
Michele M. Burlew (SAS Press, 2012)).
•
Think FAST! Use Memory Tables (Hashing) for Faster Merging.
SUGI 31,Paper 244-31 (Gregg Snell, Data Savant Consulting, KS)
•
An Introduction to Hash Tables.
SAS Canada User Groups, 2013 (Shaun Kaufmann, Farm Credit Canada).
•
Table Look-Up by Direct Addressing: Key-Indexing – Bitmapping – Hashing.
SUGI 26,Paper 8-26 (Paul M. Dorfman, CitiCorp AT&T Universal Card, Jacksonville, FL)
•
Choosing the Right Technique to Merge Large Data Sets Efficiently.
SUGI 26,Paper 071-2009 (Qingfeng Liang, Community Care Behavioral Health
Organization, Pittsburgh, PA)
Page 14
Questions?
Stuart Edwards
Head of Analytics, Collection House Group
[email protected]
Page 15
© Copyright 2026 Paperzz