Introduction – Entity Match Service Step-by

Introduction – Entity Match Service
In order to incorporate as much institutional data into our central alumni and donor database (hereafter referred to
as “CADS”), we’ve developed a comprehensive suite of automated entity match services. The CADS database
contains millions of entity records. The Entity Match Service Suite identifies if one of the millions of entity records
CADS corresponds with the person represented by the input data, and if so, which record. Unfortunately, looking for
exact matches on attributes such as Name, Address, Telephone, etc., will miss many true matches, potentially
causing a number of duplicate records to be created in CADS. The reasons an exact match might fail are numerous:
ambiguity in data (Thomas Smith and Tom Smith may represent the same person); unformatted data (the same
address may be written in multiple ways); missing data elements; out-of-date information; or unrecognized partial
matches. In addition to the difficulties posed by attempting an exact match, the sheer volume of data in CADS
requires that the number of candidate records be narrowed prior to matching.
To address these challenges, the Entity Match Web Service is divided into four general steps:
1. Receiving and accepting the data
2. Deciding which CADS entities to match against (Blocking)
3. Matching the data against the chosen CADS entities (Preliminary Match)
4. Using the Name data to refine and confirm the match results (Secondary Match)
Step-by-Step Description
In Step 1, the service receives any or all of the following input: First Name, Middle Name, Last Name, Address,
Phone, and Email. If the input fails to meet the minimum requirements, or if the service is unavailable, an error will
be generated. The minimum requirements for the Entity Match Service Suite are:
During Step 2, the service determines which CADS entities will be looked at as potential matches. Instead of trying
to match all the provided input fields against 1.4 million CADS entities, the service uses CADS data to choose a small
subset to pass on to Step 3. The three criteria used to make this choice are:
In Step 3, each piece of input data is matched against its counterpart for each CADS entity identified during Step 2.
An exact match is attempted on all names, addresses, phones, and emails on the CADS record. Each time a match is
found on an attribute, the CADS entity’s match score is increased; each time a non-match is found, the entity’s score
is decreased. For a detailed look at how the match score is calculated, please the explanation beginning on Page 4.
By the end of Step 3, all potential matches are assigned a match confidence value. The values are determined based
on the aggregate score calculated during the match process. The various values and their score thresholds are as
follows:
• Low: match score less than or equal to 0
• Medium: match score greater than 0 but less than or equal to 8.8
• High: match score greater than 8.8
Based on the confidence values assigned to each CADS entity in the pool of potential matches, the following actions
are taken:
Highest Confidence Value Action Performed
in Result Set
No results
Pass input data to Entity Create Service; automatically create new CADS entity
Low
Pass input data to Entity Create Service; automatically create new CADS entity
Medium
Pass Medium result(s) to Exception Interface
High
Pass High result(s) to Step 4
Step 4 loops back and takes a final look at the name input and compares it to the name data present on each
matched CADS record. Because Step 2 looks at other attributes than name when choosing entities to pass to Step 3,
it’s possible for two members of the same household to both be assigned a High match confidence value. It’s also
possible that the name input might include a misspelling or use a nickname not recorded on a CADS entity’s record.
In order to compensate for these possibilities, the service runs all High confidence results through the decision tree
depicted below:
The first name check compares the calculated Oracle SoundEx value of the input name with the known SoundEx
value of all names found on the matched entity’s record. This screens out members of the same household who
have different first names.
2
The second name check employs the Jaro-Winkler similarity metric to calculate the string distance between the
input First Name and all First Names found on the matched entity’s record. Comparing the names character-bycharacter prevents names which sound different due to a misspelling from erroneously excluding an entity that is an
otherwise correct match.
The third name check runs the input First Name against a custom-built synonym table then, if any synonyms are
found, compares those synonyms to all First Names found on the matched entity’s record.
Any single High confidence match that passes any one of the name checks is considered to represent the same
individual as the input data. The input data is passed to the Entity Update Service, which will automatically apply any
new or different input data to the applicable CADS entity record.
Any High confidence match that cannot pass the three name checks is assigned a new confidence value of “High*”.
This conditional high value indicates that although the CADS entity was a numerically strong candidate, there is
insufficient name evidence for the Entity Match Service to decide that the CADS entity and the entity represented by
the input data are truly the same individual. These records are passed to the Exception Interface where an expert
can make the final determination regarding the match.
Example of Match Process
3
Appendix – Weights and Matching
During the match process, each attribute is assigned a positive or negative match score, also known as a “weight.”
The value of each weight was precalculated during the development process of the Entity Match Service Suite; the
array of positive and negative weights is specially calibrated for the CADS dataset.
Attribute
Weight
(Positive Match)
First Name
Middle Name
Last Name
Address
Phone
Email
+1.1202
+1.0071
+2.0433
+7.3102
+6.7998
+7.3328
Weight (Data Not Present
in Input OR Not Present in
CADS)
0.0000
0.0000
0.0000
0.0000
0.0000
0.0000
Highest Possible Score: 25.6133
Weight
(Negative Match)
-4.7506
-0.4485
-5.2233
-2.2687
-1.1676
-2.3590
Lowest Possible Score: -16.2177
Results Skewed Toward Address, Phone, and Email
Because far fewer people share an Address, Phone and/or Email than share a name, these three attributes have a
high positive weight. In other words, two sets of entity information that share an Address, Phone, and/or Email have
a statistically significant likelihood to represent the same person. This does not mean, however, that two sets of
entity information that do not share an Address, Phone, and/or Email are strongly predisposed not to represent the
same person. The power of the Address, Phone and Email match is one of positive correlation.
Match 1
Match 2
Match 3
Match 4
Match 5
Match 6
Match 7
First
Name
+1.1202
-4.7506
-4.7506
-4.7506
-4.7506
-4.7506
-4.7506
Middle
Name
+1.0071
+1.0071
-0.4485
-0.4485
-0.4485
-0.4485
-0.4485
Last Name
Address
Phone
Email
Score
Status
+2.0433
+2.0433
+2.0433
-5.2233
-5.2233
-5.2233
-5.2233
+7.3102
+7.3102
+7.3102
+7.3102
-2.2687
-2.2687
-2.2687
+6.7998
+6.7998
+6.7998
+6.7998
+6.7998
-1.1676
-1.1676
+7.3328
+7.3328
+7.3328
+7.3328
+7.3328
+7.3328
-2.3590
25.6133
19.7426
18.2870
11.0204
1.4415
-6.5259
-16.2177
High
High
High
High
Medium
Low
Low
Results Skewed Toward Names
Unlike Address, Phone, and Email, the power of the First Name and Last Name match is one of negative correlation.
Since many different people can share the same first and/or last name, a positive match for those attributes isn’t
very powerful. A negative match on those attributes is powerful because it is much more likely to indicate that the
two sets of entity information being compared do not represent the same person. For example, it is much more
likely that many different people are named John Smith than that one person is named both Robert Johnson and
Martin Bridges.
4
Match 8
Match 9
Match 10
Match 11
Match 12
Match 13
Match 14
First
Name
+1.1202
+1.1202
+1.1202
+1.1202
+1.1202
+1.1202
-4.7506
Middle
Name
+1.0071
+1.0071
+1.0071
+1.0071
+1.0071
-0.4485
-0.4485
Last Name
Address
Phone
Email
Score
Status
+2.0433
+2.0433
+2.0433
+2.0433
-5.2233
-5.2233
-5.2233
+7.3102
+7.3102
+7.3102
-2.2687
-2.2687
-2.2687
-2.2687
+6.7998
+6.7998
-1.1676
-1.1676
-1.1676
-1.1676
-1.1676
+7.3328
-2.3590
-2.3590
-2.3590
-2.3590
-2.3590
-2.3590
25.6133
15.9216
7.9542
-1.6247
-8.8913
-10.3469
-16.2177
High
High
Medium
Low
Low
Low
Low
Taking a Closer Look at 50/50 Matches
Match 11 shows both categories at their weakest: negative matches for Address, Phone, and Email; and positive
matches for First Name, Middle Name, and Last Name. With a total score of only -1.6247, the positive and negative
values for the two categories nearly cancel each other out.
Match 4, on the other hand, shows both categories at their most powerful: positive matches for Address, Phone,
and Email; and negative matches for First Name, Middle Name, and Last Name. In this case, because Address,
Phone, and Email are so heavily weighted, the result is still a High status match.
First Name
Match 4
Match 11
-4.7506
+1.1202
Middle
Name
-0.4485
+1.0071
Last Name
Address
Phone
Email
Score
Status
-5.2233
+2.0433
+7.3102
-2.2687
+6.7998
-1.1676
+7.3328
-2.3590
11.0204
-1.6247
High
Low
Summary – Weights and Matching
In Summary, when determining if two sets of entity information represent the same person:
• Differing names are more meaningful than matching names
• Matching Address, Phone and/or Email are more meaningful than differing Address, Phone, and/or Email
• Matching Address, Phone, and/or Email are more meaningful than differing names
5