Adjudicator Agreement and
System Rankings for Person
Name Search
Mark Arehart, Chris Wolf, Keith Miller
The MITRE Corporation
{marehart, cwolf, keith}@mitre.org
Summary
Matching multicultural name variants is knowledge intensive
Ground truth dataset requires tedious adjudication
Guidelines not comprehensive, adjudicators often disagree
Previous evaluations: multiple adjudication, voting
Results of study:
High agreement, multiple adjudication not needed
“Nearly” same payoff for much less effort
2
Dataset
Watchlist, ~71K
Deceased persons lists
Mixed cultures
1.1K variants for 404 base names
Ave. 2.8 variants per base record
Queries, 700
404 base names
296 randomly selected from watchlist
Subset of 100 randomly selected for this study
3
Method
Adjudication pools as in TREC: pool from 13 algorithms
Four judges complete pools (1712 pairs, excluding exact matches)
Compare system rankings under different versions of ground truth
Type
Criteria for true match
1
Consensus
Tie or majority vote (baseline)
2
Union
Judged true by anyone
3
Intersection
Judged true by everyone
4
Single
Judgments from a single adjudicator (4)
5
Random
Randomly choose adjudicator per item (1000)
4
Adjudicator Agreement Measures
+
-
+
A
B
-
C
D
overlap
=
a / (a + b + c)
p+
=
2a / (2a + b + c)
p-
=
2d / (2d + b + c)
5
Adjudicator Agreement
Lowest is A~B
kappa 0.57
1.00
0.90
Highest is C~D
kappa 0.78
0.80
0.70
0.60
kappa
overlap
p+
p-
0.50
0.40
0.30
0.20
0.10
0.00
AB
BD
AD
BC
AC
CD
6
So far…
Test watchlist and query list
Results from 13 algorithms
Adjudications by 4 volunteers
Ways of compiling alternate ground truth sets
Still need…
7
Comparing System Rankings
A complete ranking
A
A
B
C
C
B
D
E
E
D
How similar?
Kendall’s tau
Spearman’s rank correlation
8
Significance Testing
Not all differences are significant (duh)
F1-measure: harmonic mean of precision & recall
Not a proportion or mean of independent observations
Not amenable to traditional significance tests
Like other IR measures, e.g. MAP
Bootstrap resampling
Sample with replacement from data
Compute difference for many trials
Produces a distribution of differences
9
Incomplete Ranking
Not all differences significant partial ordering
B
A
B
A
C
How similar?
C
D
E
D
E
10
Evaluation Statements
A
B
D
C
E
A>B
A>C
A>D
A>E
B=C
B>D
B>E
C>D
C>E
D=E
B
A
C
D
E
A<B
A>C
A>D
A>E
B>C
B>D
B>E
C>D
C>E
D=E
11
Similarity
n systems n(n-1) / 2 evaluation statements
Reversal rate:
proportion of reversed
relations: 10%
A>B
A>C
A>D
A>E
B=C
B>D
B>E
C>D
C>E
D=E
A<B
A>C
A>D
A>E
B>C
B>D
B>E
C>D
C>E
D=E
Sens = 80%
Sens = 90%
Total disagreement: 20%
Sensitivity: proportion
of relations with sig diff
12
Comparisons With Baseline
Truth Set
Consensus
Union
Intersection
Judge A
Judge B
Judge C
Judge D
Low!
Sensitivity
0.744
0.782
0.538
0.769
0.705
0.756
0.692
Highest and
lowest agr with
consensus
Disagree
n/a
0.064
0.423
0.051
0.038
0.115
0.179
Reversal
n/a
0
0.038
0
0
0
0
No reversals
except with
intersection GT
(one algorithm)
13
GT Comparisons
0.8
0.75
0.7
0.65
0.6
0.55
0.5
0.45
0.4
0.35
0.3
consensus
union
intersection judge A
judge B
judge C
judge D
14
Comparison With Random
1000 GT versions created by randomly selecting a judge
Consensus sensitivity
Average random sensitivity
= 74.4%
= 72.9% (sig diff at 0.05)
Average disagreement with consensus = 7.3%
5% disagreement expected (actually more)
2.3% remainder (actually less) attributable to GT method
No reversals in any of the 1000 sets
15
Conclusion
Multiple adjudicators judge everything expensive
Single adjudicator variability in sensitivity
Multiple adjudicators randomly divide pool:
Slightly less sensitivity
No reversals of results
Much less labor
Differences wash out approximating consensus
Practically same result for less effort
16
© Copyright 2026 Paperzz