Using Spatial, Temporal and Evidence-status

J Forensic Sci, January 2014, Vol. 59, No. 1
doi: 10.1111/1556-4029.12269
Available online at: onlinelibrary.wiley.com
PAPER
CRIMINALISTICS
Yan Yang,1 B.Eng.; Avi Koffman,2 B.Sc.; Gil Hocherman,2 M.Sc.; and
Lawrence M. Wein,3 Ph.D.
Using Spatial, Temporal and Evidence-status
Data to Improve Ballistic Imaging Performance*
ABSTRACT: Firearms identification imaging systems help solve crimes by comparing newly acquired images of cartridge casings or bullets
to a database of images obtained from past crime scenes. We formulate an optimization problem that bases its matching decisions not only on
the similarity between pairs of images, but also on the time and spatial location of each new acquisition and each database entry. The objective
is to maximize the detection probability subject to a constraint on the false positive rate. We use data on all cartridge casings matches detected
in Israel during 2006–2008 to estimate most of the model parameters. We estimate matching accuracy from two different studies and predict
that the optimal use of extraneous information would increase the detection probability from 0.931 to 0.987 and from 0.707 to 0.844, respectively. These improvements are achieved by favoring pairs of images that are closer together in space and time.
KEYWORDS: forensic science, ballistic imaging, optimization, statistics, spatial data, temporal data
The Bureau of Alcohol, Tobacco, Firearms and Explosives
(ATF) developed the National Integrated Ballistic Information
Network (NIBIN) to help state and local law enforcement agencies to solve gun crimes (1). NIBIN uses computerized imaging
technology to maintain a database on cartridge casings and bullets that are either recovered from crime scenes (called “evidence”) or test-fired from weapons that are recovered by (or
surrendered to) law enforcement officers (called “nonevidence”).
NIBIN rapidly computes similarity scores between a newly
acquired casing or bullet and the entries in the database, and––
using software developed by a single vendor, Forensic Technology Inc.––generates a list of the (e.g., 10) most promising
matches, which are subsequently analyzed by a forensic firearms
examiner to obtain confirmed hits. In this manner, NIBIN can
potentially identify a cold hit between a nonevidence acquisition
and an earlier crime or discover links between crimes, both of
which generate new leads to assist in crime solving. This system, if used properly (i.e., data entered in a thorough and timely
manner, and confirmed hits integrated with all other investigative
information) can be very useful to local police departments
(Appendix A of reference [2]). However, NIBIN is used very
inconsistently by various US municipalities and rarely employed
for nonlocal (e.g., interstate) searches (1). Some European countries (e.g., [3]) have implemented similar systems.
Ballistic imaging technology––particularly for bullets––does
not perform as well (e.g., as measured by receiver operating
characteristic curves) as some biometric technologies, such as
fingerprints and DNA matching, that are also used for forensic
purposes (2,4). The goal of this study is to examine whether ballistic imaging performance can be improved by combining it
with other spatial, temporal, and categorical data that are collected along with the ballistic image. More specifically, we introduce and optimize a threshold-based system (i.e., potential hits
are defined by similarity scores above a certain threshold rather
than by being ranked in the top 10) that allows the threshold
used to compare a newly acquired casing or bullet with one in
the database to depend on the time interval between the two
events (i.e., acquisition or crime), the spatial distance between
the two events, and whether the new acquisition is evidence or
nonevidence. The rationale behind this approach is that crime
guns and their crimes cluster in space and time, and evidence
acquisitions are more apt to have been involved in a previous
crime than nonevidence acquisitions.
We use spatial, temporal, and categorical data on all Israeli
matches during 2006–2008, along with published performance
data for matching cartridge casings, to calibrate our model and
assess the potential improvement in performance of our approach.
Materials and Methods
1
Institute for Computational and Mathematical Engineering, Stanford
University, Stanford, CA, 94305.
2
Division of Identification & Forensic Science, Israel Police Investigation
Department, National Police Headquarters, Jerusalem, 91906, Israel.
3
Graduate School of Business, Stanford University, Stanford, CA, 94305.
*Supported by the Graduate School of Business, Stanford University
(Y.Y and L.M.W.).
Received 9 May 2012; and in revised form 20 Sept. 2012; accepted 13
Oct. 2012.
© 2013 American Academy of Forensic Sciences
Model
Our model focuses on cartridge casings but has been applied
to bullets in (5). The evidence status of a new acquisition, or
arrival, is denoted by the subscript i = 0 for nonevidence recovered by law enforcement officers and i = 1 for evidence
obtained from a crime scene. The probability that an arrival is of
evidence status i is qi for i = 0,1 (definitions of all mathematical
103
104
JOURNAL OF FORENSIC SCIENCES
TABLE 1––Definitions of all variables and parameter values. The subscript
i describes the evidence status of a new acquisition (i = 0 for nonevidence
and i = 1 for evidence). Subscript j describes the spatial proximity between
a random acquisition and a random database entry.
Parameters
Definition
lF, rF
Intragun score (F(x)) parameters
lG, rG
Intergun score (G(y)) parameters
qi
rj
N
th
tija
gi(n)
gi ðmÞ
hm(a)
Arrival evidence-status probability
Spatial proximity probability
for arrival-database pair
Spatial proximity probability
for matching pair of evidence i
Evidence database size
Historical threshold
Thresholds to be optimized
PMF for true matches
PMF for detected matches
Age PDF of matches
h(a)
Age PDF of database records
pij
Values
Lognormal lF = 6.79,
rF = 0.59 (9), lF = 6.50,
rF = 1.26 (10)
Lognormal lG = 4.52,
rG = 0.50 (9), lG = 4.81,
rG = 0.48 (10)
q0 = 0.504, q1 = 0.496
r0 = 0.101, r1 = 0.899
p00 = 0.832, p10 = 0.875
11,350
442.6
Table 2, Fig. 8
Fig. 4
Fig. 4
Lognormal la = 4.90,
ra = 1.69
Fig. 7
symbols can be found in Table 1). Each new arrival is matched
against a database consisting of evidence images, because
nonevidence guns have been confiscated, nonevidence images
are not added to the database for future matching.
Our model incorporates spatial and temporal information
about pairs of images, which consist of an arrival and a database
entry. As explained later, the nature of the Israeli spatial data
leads us to describe the spatial proximity between a random arrival and a random database entry by a categorical variable
denoted by the subscript j = 0,…, J. Let rj be the probability
that a random arrival and a random database entry have spatial
proximity status j for j = 0,…, J.
The temporal proximity between an arrival and a random
entry in a database is described by a continuous random variable
with the subscript a for age, which is the time of acquisition of
the arrival minus the time of acquisition of a random database
entry (i.e., newly acquired evidence is added to the database
immediately after it undergoes matching). Let h(a) be the probability density function (PDF) of the random age, which is
assumed to be independent of time (i.e., as the database grows).
In practice, ballistic imaging software involves an initial filtering step (e.g., by caliber, firing pin shape, and the number, twist
rate, and width of the rifling) followed by an investigation of
multiple aspects of the cartridge casing or bullet (2). For example, the matching of casings incorporates similarity scores for the
breech face, firing pin, and ejector mark, and two-dimensional
bullet matching examines all possible rotations of bullets (2).
Although the mathematical modeling of multimodal matching is
tractable (see (6,7) for examples in biometrics), these detailed
data are owned by the software vendor, which has published only
aggregate performance curves. As it is not possible to identify
joint PDFs for similarity scores of the multiple aspects (e.g.,
breech face, firing pin, ejector mark) from aggregate performance
curves, we assume in our model that an aggregate similarity score
is generated as a result of comparing each arrival to each database
entry, and let F(x) be the intragun cumulative distribution function (CDF) of the similarity score between two images emanating
from the same gun, and G(y) be the intergun CDF of similarity
scores between two images from different guns. Consequently,
our results cannot be operationalized––that is, it is not possible to
compute an aggregate similarity score (e.g., as appears in the
vertical axis of Fig. 8) from a trio of similarity scores for breech
face, firing pin, and ejector mark––without the joint similarity
score PDFs for the multiple aspects.
Before introducing our decision variables, we describe how the
probability of a true match between an arrival and a database entry
is affected by the evidence status and spatial and temporal proximity. One complicating factor in ballistic imaging that does not typically arise in biometric matching is that an arrival can match
multiple entries in the evidence database. Let gi (n) be the probability that an arrival of evidence status i have n true matches (during our statistical analysis, we differentiate between true matches
and detected matches) in the database, for i = 0,1 and n = 0,1,2,
…. Let a true match with arrival evidence status i (i.e., the arrival
in this matching pair has evidence status i) have probability pij of
being in spatial proximity category j, for i = 0,1 and j = 0,…, J.
Also, let hm(a) be the PDF of the age of true matches between a
random arrival and a random database entry.
As noted earlier, the matching software for ballistic images is
rank based with a candidate list of a fixed size, where the top
(e.g., 10) matches generated by a new arrival are forwarded to a
forensic firearms examiner for final verification. In our model,
we use a threshold-based system, where similarity scores above
the threshold are forwarded for human verification; this system
generates a candidate list of variable size. In (5), we show that
the performance is very similar for these two systems, and later,
we discuss the implications of our threshold-based assumption.
As the goal of this study is to exploit spatial, temporal, and
evidence-status data to improve performance, we let the similarity score threshold depend on the evidence status of the arrival
and the spatial proximity category and age of the pair being
matched. We denote our decision variables by tija. Because i and
j are categorical and the age a is a continuous quantity, we
restrict tija’s dependence on a to take on a specific functional
form. After testing linear, power, and exponential functions, we
settled on the exponential function,
tija ¼ aij ebij a þ cij
ð1Þ
Hence, our optimization is over the 6 (J + 1) variables,
faij ; bij ; cij ; i ¼ 0; 1; j ¼ 0; . . .; Jg.
We are now in a position to formulate our optimization problem. The objective of our optimization problem is to maximize
the probability that if at least one true match for an arrival exists
in the database, then we detect at least one true match; we call
this quantity the detection probability, and refer to one minus
this quantity as the false negative rate. Note that this probability
is much more easily calculated than the expected number of true
matches detected, which may be a more natural performance
measure. However, because each database entry has already been
through the matching process, it is often the case that if an arrival has more than one match, it is already known that these
matching entries in the database are also linked to each other.
To derive the objective function, we first observe that the probability
at least one true match in the database for an arrival is
P1 ofP
1
i¼0 qi
n¼1 gi ðnÞ . Also, a true match of age a involving an
arrival of evidence status i and a pair of spatial proximity category j will go undetected if its similarity score (which has intragun CDF F(x)) is less than the threshold tija, which has
probability F(tija). Integrating over the age PDF of matches, a
true match involving an arrival of evidence status i and a pair of
spatial
proximity category j goes undetected with probability
R1
Fðt
Þhm ðaÞda. It follows that our objective function is
ija
0
YANG ET AL.
1
P
max
qi
i¼0
1
P
gi ðnÞ½1 ð
n¼1
J
P
pij
j¼1
1
P
aij ;bij ;cij
i¼0
qi
1
P
R1
0
.
USING DATA TO IMPROVE BALLISTIC IMAGING PERFORMANCE
Fðtija Þhm ðaÞdaÞn ð2Þ
i¼0 j¼0
Z
qi rj
1
hm(a). All parameter values appear in Table 1, and the derivations of these values, along with graphs of the five probability
distributions, are given below.
gi ðnÞ
n¼1
Estimating the Similarity Score PDFs
As with most problems of this type, our goal is to maximize
the detection probability subject to some type of constraint on
the false positives. In our model, a false positive occurs whenever a candidate that is not a true match is forwarded to a forensic examiner. Because the rank-based approach forwards 10
candidates per arrival to a forensic examiner (although Israel forwards the top 30 candidates––10 each from firing pin, breech
face, and ejector mark––we consider 10 in total, because there
may be significant overlap in the three candidate lists), a natural
constraint for our threshold-based approach would be to force
the expected size of the candidate list (i.e., the expected number
of similarity scores that exceed the threshold) per arrival to be
no larger than 10. Because this quantity is very difficult to compute, we take an alternative approach and require the expected
number of false positives per arrival to be no more than the
expected number of false positives per arrival generated by a
constant threshold system (i.e., the threshold does not vary with
i, j, or a) when the expected candidate list size is 10. In (5), we
show that this false positive inequality constraint behaves nearly
the same as if we used an inequality constraint on the mean candidate list size. To derive our false positive inequality constraint,
we let th, which is estimated from data in our statistical analysis,
be the threshold used in a constant threshold system that generates an expected candidate list size of 10. Because nearly all
database entries are not matches to a given arrival, we assume
that there are N nonmatches in the database for every arrival,
each of which ends up on the candidate list with probability
1 G(th) under the constant threshold system. Hence, the right
side of the constraint, which is the expected number of false
positives per arrival under the constant threshold system, is N
[1 G(th)]. After summing over i and j and integrating over
age a, we find that the left side of the inequality constraint,
which is the expected
of Rfalse positives per arrival with
P
P number
1
threshold tija, is N 1i¼0 Jj¼0 qi rj 0 ½1 Gðtija ÞhðaÞda. Canceling the database size, N, from both sides of the constraint, we
obtain our false positive constraint,
1 X
J
X
105
½1 Gðtija ÞhðaÞda 1 Gðth Þ
We assume that the intragun and intergun similarity score
CDFs, F(x) and G(y), are lognormal and estimate the values of
the four parameters, which are denoted by lF, rF, lG, and rG.
There is not a definitive estimate in the literature on ballistic
imaging matching performance for cartridge casings because the
results depend upon a variety of factors, including the types of
firearms and ammunition. Consequently, we derive two sets of
parameter values from two different studies. We first estimate
these parameter values using the lower left performance curve in
fig. 12 of reference 9, which already incorporates an initial filtering step (restricting to 9 mm Luger cartridge casings) and multiple measurements (the curve is generated using similarity scores
for breech face, firing pin, and ejector mark). This performance
curve plots the probability that a true match is ranked among the
top 10 scores when the database contains one true match and N
nonmatches, as N varies from 0 to 106. If the intragun similarity
score PDF is f(x), then this probability is
9
X
N!
M!ðN
MÞ!
M¼0
Z
1
ð1 GðtÞÞM GðtÞNM f ðtÞdt
ð4Þ
0
Using seven points along the performance curve in fig. 12 of
reference 9, we derive the least-squares estimates lf = 6.7912,
rf = 0.5927, lg = 4.5207, rg = 0.5016. Fig. 1 compares the
seven data points to the performance curve predicted (via
Eq (4)) by the lognormal distributions, and Fig. 2 shows the
resulting PDFs.
We also estimate the parameter values from (10), which uses
only breech face and firing pin information (i.e., no ejector mark
information), has a database of 600 casings, and considers 32
arrivals that have a mate with the same ammunition type (Remmington) in the database. The probability that an arrival’s mate
ranks in position k is
ð3Þ
0
and our optimization problem is given by Eqs (1–3).
We solve Eqs (1–3) using a sequential quadratic programming
algorithm (via the fmincon function in MATLAB [8]). Because
the optimization problem does not possess the second-order
properties required to guarantee that the algorithm converges to
a global optimum, we compared local optima resulting from various starting points in the large (12- or 18-dimensional) decision
variable space to increase the likelihood that we are achieving a
near-optimal solution.
Statistical Analysis Overview
For our application to Israeli cartridge casings, we have nine
quantities to estimate, which naturally divide into four groups:
(i) the similarity score CDFs F(x) and G(y), (ii) the probabilities
qi, rj, and pij, (iii) the probability mass function (PMF) gi (n)
and the historical threshold th, and (iv) the age PDFs h(a) and
FIG. 1––Actual points (x) on the performance curve for casings from data
in (9) versus the performance curve generated by the best-fit lognormal
distribution.
106
JOURNAL OF FORENSIC SCIENCES
FIG. 2––The lognormal similarity score PDFs, intragun f(x) and intergun
g (y), for casings.
PðkÞ ¼
599!
ðk 1Þ!ð600 kÞ!
Z
1
to maximize the average similarity within each group. The
k- means algorithm is random, and solutions can vary from run
to run. We solve the problem 1000 times and group precincts
into stations if they are classified in the same group in >90% of
the solutions. We left all other precincts as isolated to prevent
overfitting. This procedure results in 18 merged groups containing 55 precincts and 34 isolated precincts, for a total of 52
merged locations that we refer to as stations. The resulting
52 9 52 matrix of matches appears in Fig. 3. This merging process increases the intralocation matching proportion from 0.551
to 0.832 for nonevidence and from 0.595 to 0.875 for evidence.
If we let ak and dk denote the number of arrivals from station k
and
records from station k, then r0 ¼
P52 the number
P52 of database
P52
k¼1 ak dk =ð
l¼1 al Þð l¼1 dl Þ ¼ 0:101 and r1 ¼ 0:899, where the
subscript j = 0 corresponds to intrastation, and j = 1 corresponds to
interstation. Of the 697 matches, 107 were from nonevidence arrivals, and 590 were from evidence arrivals. Of the 107 nonevidence
arrivals generating matches, 89 were intrastation, giving p00 = 89/
107 = 0.832 and p01 = 0.168. Similarly, we have p10 = 516/
590 = 0.875 and p11 = 0.125.
ð1 GðtÞÞk1 GðtÞNk f ðtÞdt
0
From
Fig. 1
of
reference
(10),
we P use
600
Pð1Þ ¼ 18=32; Pð2Þ ¼ 2=32; Pð3Þ ¼ Pð4Þ ¼ 1=32 and
k¼30
PðkÞ ¼ 8=32. The least-squares fit to these probabilities is
lf = 6.50, rf = 1.26, lg = 4.81, rg = 0.48.
Estimating the Probabilities qi, rj, and pij
The evidence database from Israel contains 14,979 entries
covering the entire country during 1980–2000, and its average
size during 2006–2008 was 11,350 entries. The arrivals data
consist of all arrivals between January 1, 2006 and December
31, 2008. There were 7138 arrivals during this time period, and
697 of these arrivals matched at least one entry in the database.
An arrival can match multiple entries in the database, and there
were a total of 1364 matching pairs (i.e., matches between an
arrival and a database entry). Of the 7138 arrivals, 3598 were
nonevidence, and 3540 were evidence, yielding q0 = 0.504,
q1 = 0.496.
If we had data on the precise location of each arrival and each
database entry, we could measure the Euclidean distance of each
match and allow the threshold to be a specified functional form
of the Euclidean distance, as in Eq (1). However, we only have
data on which of the 89 Israeli police precincts collected each
arrival and each database entry. Because many pairs of precincts
generated no matches during 2006–2008, we use only two spatial categories (i.e., J = 1): intralocation and interlocation. To
fully exploit the spatial information, it is desirable to merge two
locations if it results in a more favorable ratio of intralocation
matches to interlocation matches. Each of the 89 Israeli police
precincts has an 89-dimensional vector stating the number of
matches during 2006–2008 that it has with each precinct. In our
analysis, two locations are good candidates for merging if the
dot product of their vectors is large; we refer to this dot product
as the two locations’ similarity and solve the graph partitioning
problem that maximizes the average similarity within each
group. We solve this problem using the k- means algorithm (11)
with k = 20, which merges the 89 precincts into 20 groups so as
FIG. 3––All nonzero entries in the matching matrix for the 52 merged stations. Each entry is the number of matches during 2006–2008 between each
pair of merged stations. Stations A–R are the merged stations. Shaded
squares are the intrastation matchings.
YANG ET AL.
.
USING DATA TO IMPROVE BALLISTIC IMAGING PERFORMANCE
Estimating the Historical Threshold th and the PMF gi(n)
From the raw data pertaining to the matches and arrivals, we
can construct the PMF gi ðmÞ, which is the probability of detecting m matches from an arrival that has evidence status i; this
PMF is not to be confused with gi(n), which is the PMF for true
(i.e., detected plus undetected) matches. The observed PMF
gi ðmÞ (Fig. 4) allows us to estimate the historical threshold th
for a constant threshold policy that generates an average candidate list size of 10, by assuming that the mean number of true
positives
less than the database size (i.e.,
P1is much
P1
q
ng
ðnÞ\
\NÞ,
yielding
i
i
i¼0
n¼1
1
X
i¼0
qi
1
X
ngi ðnÞ þ N½1 Gðth Þ ¼ 10
ð5Þ
n¼1
Setting N = 11,350, which is the average value of the database size during 2006–2008, in Eq (5) gives th = 442.6.
The most challenging part of our estimation procedure is to
estimate gi(n), which is the PMF of true matches, from three
quantities: gi ðmÞ, which is the observed PMF of detected
matches, the historical threshold th under a constant threshold
system, and the intragun similarity score CDF F(x). Let Pmn be
the probability that m matches are detected from an arrival in
the Israeli database, given that n matches to this arrival exist and
the arrival has evidence status i (the argument i in Pmn is suppressed Pfor ease of presentation). It follows that
gi ðmÞ ¼ 1
n¼m Pmn gi ðnÞ. However, for practical purposes, we
truncate this system of equations at m = n = 14 (i.e., we set gi
(n) = 0 for n ≥ 15) because gi ðmÞ is only nonzero for m ≤ 14
in our data set. This truncated system of equations can be
expressed as
0
P11
B 0
B
B ..
@ .
0
P12
P22
..
.
0
..
.
P1;14
P2;14
..
.
P14;14
10
CB
CB
CB
A@
gi ð1Þ
gi ð2Þ
..
.
gi ð14Þ
1
0
C B
C B
C¼B
A @
gi ð1Þ
gi ð2Þ
..
.
1
C
C
C
A
ð6Þ
the size of the groups; for example, state 1,1,2 means that there
are a total of four true matches currently in the database, two of
them are connected (i.e., have been correctly detected to be a
match) and the remaining two are isolated (i.e., have not been
correctly matched with any of the other three true matches). We
use the notation P(A|B) to represent the probability that, conditioned on the current matchings being in state B, after a new
arrival undergoes the matching process with the prior arrivals,
the new state is A (where, by construction, the sum of the numbers in A is always one greater than the sum of the numbers in
B).
For illustrative purposes, we show how to use the HMM to
compute Pm3 for m = 0,1,2,3, and then we provide a broad
description of a general algorithm for any value of n. The initial
state of the HMM is 1, which refers to the first of the n true
matches having already entered the database. For this example
where n = 3, we need to track the HMM through n more arrivals (i.e., until after the fourth arrival) because the fourth arrival
has n = 3 true matches in the database. These dynamics are
described by the following transitions.
• Transitions caused by the second arrival: P(1,1|1) = p,
P(2|1) = 1 p.
= p2,
• Transitions caused by the third arrival: P(1,1,1|1,1)
2
P(1,2|1,1) = 2p(1 p), P(3|1,1) = (1 p) , P(1,2|2) = p2,
P(3|2) = 1 p2.
• Transitions caused 2by the fourth arrival: P(1,1,1,1|1,1,1) =2 p3,
P(1,1,2|1,1,1) = 3p (1 p), P(1,3|1,1,1) = 3(1 p) p,
P(4|1,1,1) = (1 p)3, P(1,1,2|1,2) = p3, P(1,3|1,2) = p(1 p2),
P(2,2|1,2) = p2 (1 p), P(4|1,2) = (1 p)(1 p2), P(1,3|3) =
p3, P(4|3) = 1 p3.
Hence, the hidden states after the second arrival are (1,1) and
(2), the hidden states after the third arrival are (1,1,1), (1,2), and
(3), and the hidden states after the fourth arrival are (1,1,1,1),
(1,1,2), (1,3), (2,2), and (4). The HMM transition probabilities
derived above can be written as the following stochastic matrices, denoted by Mk,
gi ð14Þ
because gi ðmÞ is known and the system of Eq (6) is invertible,
our estimation problem for gi(n) reduces to determining the
probabilities Pmn, which are a function of m, n, and the known
false negative probability F(th) under the constant threshold
system, which is denoted by p.
Conditioned on n matches existing in the database, the probability Pmn that a new arrival detects m of them depends on two
things. First, it depends on the current knowledge about how
these n entries are related, which can range from not realizing
that any of them are matched to each other (i.e., there are n singletons) to realizing that they are all matched to each other (i.e.,
they are in a single group of n matches). Second, it depends on
which matching groups, if any, the new arrival is detected to
belong. Hence, to derive Pmn, we need to construct a detailed
dynamic model that tracks the evolution of the n true matches as
they sequentially arrive to the system and undergo a matching
process (with false negative probability p = F(th)) with the prior
arrivals.
This model is a hidden Markov model (HMM) because we
cannot observe the state (or transition probabilities among states)
(12). More generally, the state of the HMM is defined by the
current matched groupings of the true matches that have already
arrived, where the groupings are given in the ascending order of
107
M1 ¼ ð p 1 p Þ;
M2 ¼
p2
0
2pð1 pÞ ð1 pÞ2
;
p2
1 p2
0
1
p3 3p2 ð1 pÞ 3ð1 pÞ2 p
0
ð1 pÞ3
M3 ¼ @ 0
p3
pð1 p2 Þ p2 ð1 pÞ ð1 p2 Þð1 pÞ A
0
1 p3
0
0
p3
Note that the product M1M2 equals the probability of arriving at
the various states that the fourth true match will see upon arrival: P(1,1,1) = p3, P(1,2) = 3p2 (1 p), P(3) = (1 p)2
(1 + 2p) .
Finally, we group all the transitions from the fourth arrival
according to how many matches are actually found. For example, the transition from (1,1,1) to (1,3) means that two matches
are found. Using the law of total probability (i.e., conditioning
on all possible states of the current matching and then summing the joint probabilities; see pg 6 of [13]) yields our final
result:
108
JOURNAL OF FORENSIC SCIENCES
FIG. 4––The PMFs for the evidence true matches (- - -), evidence detected
matches (…), nonevidence true matches (-.-), and nonevidence detected
matches (—).
…,137. To estimate the PDF hm (a), we solve a maximum likelihood estimation problem with uncertain data. Let the known
ages be denoted by the 1227-dimensional vector X, and the age of
records with uncertain age be given by the 137-dimensional vector Z. If we denote the lognormal parameters by (la, ra), then the
Q
Q137 R tj þ365
likelihood function is 1227
f ðZj ÞdZj . Choosing
i¼1 f ðXi Þ
j¼1 tj
(la, ra) to maximize the log-likelihood function, which is
P137
P1227
i¼1 log½f ðXi Þþ
j¼1 log½Fðtj þ365ÞFðtj Þ, yields la = 4.904
and ra = 1.687. The frequency distribution of these 1364 ages
and the resulting lognormal are plotted in Fig. 5.
We only know the year in which each database entry was
acquired, and to estimate the age PDF h(a) for all database
entries, we assume that the age of an entry is December 31,
2010 minus the acquisition date of the entry. However, we only
know the year in which each entry in the Israeli database was
acquired. We estimate h(a) by fitting a piecewise cubic hermite
interpolating polynomial (14) to the yearly aggregates to estimate
an increasing smooth CDF (Fig. 6) and then numerically differentiating it to yield a PDF (Fig. 7).
Results
P03 ¼ Pð1; 1; 1; 1j1; 1; 1ÞPð1; 1; 1Þ þ Pð1; 1; 2j1; 2ÞPð1; 2Þ
þ Pð1; 3j3ÞPð3Þ ¼ p3 ;
P13 ¼ Pð1; 1; 2j1; 1; 1ÞPð1; 1; 1Þ
þ Pð2; 2j1; 2ÞPð1; 2Þ ¼ 3p4 ð1 pÞ;
P23 ¼ Pð1; 3j1; 1; 1ÞPð1; 1; 1Þ
P33
þ Pð1; 3j1; 2ÞPð1; 2Þ ¼ 3p3 ð1 pÞ2 ð1 þ 2pÞ;
¼ Pð4j1; 1; 1ÞPð1; 1; 1Þ þ Pð4j1; 2ÞPð1; 2Þ
þ Pð4j3ÞPð3Þ ¼ ð1 pÞ3 ð6p3 þ 6p2 þ 3p þ 1Þ
For a general value of n, the calculation of Pmn for m = 0,1,…,
n can be carried out by the following algorithm.
Construct the HMM through the n + 1st arrival, and derive
the transition
Q matrices Mk for k = 1, …, n.
Compute n1
k¼1 Mk , which gives the probability distribution
for the various hidden states that the n + 1st true match sees
upon arrival.
In the transition matrix Mn, find the number of matches m
detected for each possible transition A ? B. Using the law of
total probability, multiply the transition probability from A to
B
in Mn by the probability of observing state A, which is
Qn1
k¼1 Mk , and add these products over all possible transitions
to get Pmn.
We first report results using the matching performance estimated from (9). Under the constant threshold policy (i.e., which
employs the threshold th) for cartridge casings in Israel, the
probability that at least one true match for an arrival is detected,
given that at least one true match exists, is 0.931. This detection
probability increases to 0.987 under the optimal policy derived
from Eqs (1–3), which represents a 81.4% reduction (from 0.069
to 0.013) in the false negative rate. The optimal thresholds from
Eq (1) are given in the last row of Table 2 and are higher for
interstation matches, nonevidence arrivals, and older ages
(Fig. 8). By optimizing each of the three types of information in
isolation and in pairs (Table 2), we find that optimizing age
offers slightly more improvement than optimizing spatial information, while optimizing evidence status provides very little
improvement (e.g., optimizing only evidence status increases the
detection probability by just 0.005 over the constant threshold
policy). In addition, the impact of optimizing age and spatial
information is subadditive.
Running this algorithm for n = 1, …, 14 results in the gi (n)
PMF plotted in Fig. 4.
Estimating the Age Distributions
We assume that the age PDF of true matches is the same as
the observed age PDF of detected matches in the Israeli database
and estimate hm(a) by a lognormal. Of the 1364 matching pairs
during 2006–2008, 1227 (or 90.0%) of them have database
entries that occurred during 2006–2008, in which case we know
the exact age in days. For the remaining 137 (or 10.0%) matching pairs, we only know the year when the database entry was
acquired, and so the age is in the interval [tj,tj + 365] for j = 1,
FIG. 5––Frequency distribution of ages of the 1364 matching pairs in the
Israeli database, containing 1227 exact ages and 137 ages that are randomly sampled from the correct year, along with the best-fit lognormal
PDF, hm(a).
YANG ET AL.
.
USING DATA TO IMPROVE BALLISTIC IMAGING PERFORMANCE
109
FIG. 8––Optimal thresholds for Israeli cartridge casings.
FIG. 6––Age CDF fit to raw Israeli data.
Using the matching performance parameter values derived
from (10), we find that the detection probability for the constant
threshold policy is 0.707, and the detection probability for the
optimal policy is 0.844. The absolute improvement in detection
probability is larger using (10) rather than (9) (0.844–
0.707 = 0.137 vs. 0.987–0.931 = 0.056), although the percentage reduction in the false negative rate is considerably smaller
(46.8% vs. 81.4%).
Discussion
FIG. 7––Age PDF h (a) for all database records.
TABLE 2––Detection probability for all combinations of optimizing evidence
status, spatial category and age information. The subscripts of tija are
suppressed if they are not being optimized. The matching performance is
based on data in (9).
Optimized
Information
None
Evidence
Spatial
Age
Evidence, spatial
Evidence, age
Spatial, age
Evidence,
spatial, age
Thresholds
th = 443
t0 = 486, t1 = 416
t0 = 321, t1 = 528
ta = 391.1e0.00152a + 537.4
t00 = 354, t01 = 562, t10 = 301,
t11 = 502
t0a = 302.7e0.00149a + 507.9,
t1a = 345.9e0.00153a + 587.3
t0a = 283.3e0.00100a + 435.7,
t1a = 420.4e0.00128a + 674.0
t00a = 292.6e0.00121a + 459.0,
t01a = 408.5e0.00136a + 691.6,
t10a = 223.1e0.00142a + 364.9
t11a = 339.2e0.00158a + 587.6
Detection
Probability
0.931
0.936
0.965
0.972
0.967
0.974
0.986
0.987
Our main result is that exploiting information––particularly
spatiotemporal information––that is extraneous to the ballistic
imaging process can improve the performance of ballistic imaging systems. The magnitude of improvement is difficult to estimate due to the lack of a definitive estimate for matching
performance in the literature. While the increase in detection
probability from 0.931 to 0.987 is modest due to the high baselevel detection probability (derived from matching performance
data in [9]), this improvement is impressive when viewed as a
81.4% reduction in the false negative rate. The absolute
improvement in detection probability from 0.707 to 0.844 is
somewhat larger when using the matching performance data in
(10), although the reduction in the false negative rate is only
46.8%. These results suggest that crime guns and their crimes
do indeed cluster in space and time, and this information can be
exploited to solve more crimes. Although Israel performs nationwide searches, we can also analyze the counterfactual scenario
in which Israel only performs intraprecinct searches. Even if all
intraprecinct matches are detected, the detection probability
(using matching performance data in [9]) is only 0.729, which––
when compared to 0.931–– reveals the benefit of performing
nationwide searches in Israel.
Other Potential Applications
As mentioned earlier, our model can also be directly applied
to bullets rather than cartridge casings. More generally, because
the benefits of our approach stem almost entirely from exploiting
the spatial and temporal clustering of crimes committed by crime
guns (Table 2), our approach may be most beneficial for countries that––like Israel––cover a small geographic area and do not
110
JOURNAL OF FORENSIC SCIENCES
suffer from long delays in data entry. Some European countries
may fit this profile: many are comparable in size to Israel, and,
for example, the UK’s ballistic imaging system does not appear
to have any backlogs (3). In contrast, the US may not be an
ideal setting for our approach: it is much larger geographically
than Israel, and new images are not added to the NIBIN database in a timely manner (1). Indeed, the temporal clustering suggests that NIBIN performance might improve if new images
were entered into the NIBIN database in a last-in first-out
(LIFO) manner rather than in first-in first-out (FIFO) order.
The US database associated with NIBIN is divided into 47
partitions that are grouped into 12 regions, and so there are three
possible geographical approaches to matching: each arrival
undergoes only intrapartition searches, only intraregion searches,
or national searches. Under a constant threshold policy, there is
a tradeoff inherent in these three approaches: as the system
expands from performing only intrapartition searches to performing intraregion searches and on to performing national searches,
it gains improved coverage but experiences deteriorating matching performance (due to the increased database size). In theory,
an optimized approach to national searches can largely bypass
this tradeoff: by setting higher threshold levels for more distant
(e.g., inter-regional) searches, it can achieve full coverage and
perhaps suffer only a small degradation in matching accuracy.
While an optimized approach to national searches would––by
construction––perform at least as well as the constant threshold
intrapartition policy that is in widespread use in the US (i.e.,
using infinite thresholds for interpartition matches is feasible in
the national approach and would reduce to an approach that
employs only intrapartition searches), the key issue is to assess
the magnitude of this improvement. Calculations in (5) compare
the performance of these three geographical approaches in the
US for both cartridge casings and bullets. However, these
numerical results are highly speculative because of the lack of
publicly available spatial data (e.g., we do not know what fraction of matches are intrapartition vs. intra-regional vs. interregional) and are not reported here. As noted in (2), progress in
this area is problematic because the sole vendor, Forensic Technology Inc., has much of the necessary data, and hence, the
National Institute of Standards and Technology, which works on
certain technical aspects of ballistic imaging (15), may be in the
best position to perform or enable future research.
With additional US data, the model in Eqs (1–3) could also
be applied in several other ways. The spatial categories are quite
general and could be used to exploit spatial patterns in the illegal
gun market in the US (16). The proposed Reference Ballistic
Image Database (RBID), which would maintain a national database from firings of newly manufactured and imported guns,
could be accommodated in our model by introducing a third type
of evidence status (i.e., i = 2 would correspond to new guns).
Although a national RBID was deemed to be impractical due to
its large database size as well as other factors (e.g., gun wear
over time, differences caused by ammunition, filling the database
with guns that are extremely unlikely to be involved in a crime)
(2), this issue could be revisited using our approach, which
would allow very high thresholds for new guns. Moreover, if
RBID incorporated point-of-sale data, then our approach could
use lower thresholds for the miniscule fraction of retailers that
sell the majority of crime guns in the US (17). Note that as ballistic imaging is a search process that is followed by human verification, the retailers who sell many crime guns would be
unaffected by the increased false positive rate associated with
their lower thresholds.
Limitations of Our Analysis
Due to the large amount of Israeli data, we have precise estimates of all the parameters in Table 1 (e.g., the standard errors
for q0, r0, and p00 are 0.0059, 0.0012, and 0.0362, respectively),
with the exception of the parameters related to matching accuracy; that is, the uncertainty in our results is driven almost
entirely by our estimates of lF, rF, lG, and rG. The biggest
shortcoming of our analysis is that the actual problem deals with
similarity scores that are based on multimodal (breech face, firing pin, ejector mark) measurements that are possibly correlated
and possibly repeated (Israel acquires two samples from each
evidence and nonevidence gun that is recovered) and that come
from a variety of gun and ammunition types. Due to the lack of
raw similarity score data and the range of matching performance
estimates in the literature, we use two different data sets that
present aggregate performance: a performance curve (fig. 12 of
reference [9]) for combined breech face, firing pin, and ejector
mark scores for 9 mm caliber guns, and results (Fig. 2 of reference [10]) from an experiment for combined breech face and firing pin scores with Remmington ammunition. Hence, the
respective detection probabilities of 0.931 and 0.707 under the
constant threshold policy are not necessarily an accurate prediction of Israel’s current performance, although our results (see
also [5]) suggest that the optimal policy consistently outperforms
the current threshold policy. More generally, ballistic imaging
technology is in a state of flux, with the vendor recently introducing three-dimensional ballistic imaging matching systems for
both cartridge casings and bullets (18), for which very little
published performance data exist (19,20). In addition, ballistic
imaging systems are likely to perform better in controlled experiments than in the field. Moreover, as mentioned earlier, our
results cannot be operationalized (i.e., the aggregate similarity
scores on the vertical axis of Fig. 8 cannot be computed from
raw similarity scores for breech face, firing pin and ejector
mark) unless one gains access to data for the joint PDF of
breech face, firing pin, and ejector mark scores. Although Forensic Technology Inc. has these data, they are not in the public
domain.
A second limitation is that we use a threshold-based approach
rather than the rank-based approach that is in current use. If similarity scores were independent of gun and ammunition type,
then a threshold-based approach would perform at least as well
as a rank-based approach. However, if similarity scores vary by
gun and ammunition type (which is likely to be the case), then
our optimal policy may not work well. There are two ways to
adapt our ideas to this setting: (i) have the threshold tijk also
depend on the gun and ammunition type, or (ii) change the optimization problem to a rank-based system, where the decision
variables are changed from tijk to multiplicative scaling factors
qijk (i.e., a similarity score s would be transformed to qijk s),
which need not vary by gun or ammunition type. The former
approach is feasible (e.g., it has been used for fingerprints with
different image qualities [21]) but tedious, and the latter
approach is preferable.
The final limitation of the Israeli analysis is the implicit
assumption that the probabilities qi, rj, and pij do not vary over
time. During 2006–2008 in Israel, these quantities were reasonably stable. Nonetheless, there are three concerns. The first is if
criminals adapt their behavior as a result of the ballistic imaging
system. In Boston, criminals were found not to increase their use
of revolvers, which do not eject cartridge casings, as a result of
the implementation of a casing imaging system (2), and so this
YANG ET AL.
.
USING DATA TO IMPROVE BALLISTIC IMAGING PERFORMANCE
concern may be unfounded, particularly given the system’s lack
of transparency from the criminal’s viewpoint. The second concern is changes in the mobility patterns of crime guns, which
could occur for a variety of reasons. A third concern is changes
in police procedures (e.g., spatial reallocation of law enforcement resources). The latter two concerns can be partially mitigated by periodically updating the estimates of these
probabilities.
Conclusion
We develop a data-driven approach to improve the performance of ballistic imaging systems and predict that it could
increase the detection probability in Israel, using matching data
for cartridge casings from 2006 to 2008. This improvement is
achieved by requiring a very close match for pairs of images that
are distant in space and/or time. This approach may have potential applications in other countries that are of comparable size to
Israel (e.g., European countries). An assessment of an optimized
national approach for the US seems worth pursuing, but the US
Department of Justice and/or the National Institute of Standards
and Technology would need to gather the necessary data to
enable such an assessment.
Acknowledgments
Supported by the Graduate School of Business, Stanford University (Y.Y and L.M.W.).
References
1. Office of the Inspector General, U.S. Department of Justice. The Bureau of
Alcohol, Tobacco, Firearms and Explosives’ National Integrated Ballistic
Information Network program; 2005 Audit Report 05-30. Washington,
DC: Office of the Inspector General, U.S. Department of Justice, 2005.
2. Cork DL, Rolph JE, Meieran ES, Petrie CV, editors. Ballistic imaging.
Washington, DC: National Academies Press, 2008.
3. http://www.nabis.police.uk/home.asp (accessed on October 8, 2012).
4. Committee on Identifying the Needs of the Forensic Sciences Community,
National Research Council. Strengthening forensic science in the United
States: a path forward. Washington, DC: National Academies Press, 2009.
5. Yang Y. Three data-driven operations research analyses in the public
sector (dissertation). Stanford, CA: Stanford University, 2012.
111
6. Prabhakar S, Jain AK. Decision-level fusion in fingerprint verification.
Pattern Recogn 2002;35:861–74.
7. Baveja M, Wein LM. An effective two-finger, two-stage biometric strategy for the US-VISIT Program. Oper Res 2009;57:1068–81.
8. http://www.mathworks.com/products/matlab/index.html (accessed on October
8, 2012).
9. http://www.forensictechnology.com/Default.aspx?app=LeadgenDownload
&shortpath=docs /LargeDatabaseFinal.pdf (accessed on October 8, 2012).
10. De Kinder J, Tulleners F, Thiebaut H. Reference ballistic imaging database performance. Forensic Sci Int 2004;140:207–15.
11. http://www.ece.ecsb.edu/ hespanha/software/grPartition.html (accessed on
October 8, 2012).
12. Durbin R, Eddy S, Krogh A, Mitchison G. Biological sequence analysis:
probabilistic models of proteins and nucleic acids. Cambridge, U.K.:
Cambridge University Press, 1998.
13. Karlin S, Taylor HM. A first course in stochastic processes, 2nd edn.
New York, NY: Academic Press, 1975.
14. Fritsch FN, Carlson RE. Monotone piecewise cubic interpolation. SIAM
J Numer Anal 1980;17:238–46.
15. Vorburger TV, Yen JH, Bachrach B, Renegar TB, Filliben JJ, Ma L
et al. Surface topography analysis for a feasibility assessment of a
national ballistics imaging database. Gaithersburg, MD: National Institute
of Standards and Technology, 2007; Internal Report 7362.
16. Wintemute GJ, Romero MP, Wright MA, Grassel KM. The life cycle of
crime guns: a description based on guns recovered from young people in
California. Ann Emerg Med 2004;43:733–42.
17. Wintemute GJ, Braga AA. Opportunities for state-level action to reduce
firearm violence: proceeding from the evidence. Am J Public Health
2011;101:e1–3.
18. http://www.forensictechnology.com/IBISTRAX/ (accessed on October 8,
2012).
19. Roberge D, Beauchamp A. The use of BulletTRAX-3D in a study of
consecutively manufactured barrels. AFTE J 2006;38:166–72.
20. Brinck TB. Comparing the performance of IBIS and BulletTRAX-3D
technology using bullets fired through 10 consecutively rifled barrels.
J Forensic Sci 2008;53:677–82.
21. Wein LM, Baveja M. Using fingerprint image quality to improve the
identification performance of the U.S. Visitor and Immigrant Status Indicator Technology Program. Proc Natl Acad Sci U.S.A. 2005;102:
7772–5.
Additional information and reprint requests:
Lawrence M. Wein, Ph.D.
Jeffrey S. Skoll Professor of Management Science
Graduate School of Business
Stanford University
655 Knight Way
Stanford, CA 94305
E-mail: [email protected]
SUPPORTING MATERIAL
We estimate the similarity score PDFs F (x) and G(y) in §1 and the Israeli casings
parameter values in §2. Justification for using a threshold-based approach is presented in §3
and a comparison of the false negative constraint and the list size constraint is performed
in §4.
1
Estimating the Similarity Score PDFs
We first estimate the lognormal (using natural logarithms, not logarithms to the base 10)
parameters (µf , σf , µg , σg ) using the lower left graph in Fig. 12 of [1], which plots the
probability that a true match is ranked among the top 10 scores when the database contains
one true match and N nonmatches, as N varies from 0 to 106 . If the intra-gun similarity
score PDF is f (x), then this probability is
9
X
Z ∞
N!
(1 − G(t))M G(t)N −M f (t) dt.
M
!(N
−
M
)!
0
M =0
(1)
Using seven points along the performance curve in Fig. 12 of [1], we derive the least-squares
estimates µf = 6.7912, σf = 0.5927, µg = 4.5207, σg = 0.5016. Fig. 2 compares the seven
data points to the performance curve predicted (via (1)) by the lognormal distributions, and
Fig. 3 shows the resulting PDFs.
In [2], 32 arrivals are compared to a database of 600 images that include the arrivals.
The probability that an arrival’s mate ranks in position k is
P (k) =
Z ∞
599!
(1 − G(t))k−1 G(t)N −k f (t) dt.
(k − 1)!(600 − k)! 0
From Fig. 1 of [2], we use P (1) =
18
,
32
P (2) =
2
,
32
P (3) = P (4) =
1
32
and
(2)
P600
k=30
P (k) =
The least-squares fit to these probabilities is µf = 6.50, σf = 1.26, µg = 4.81, σg = 0.48.
8
.
32
2
Estimating Israeli Casings Parameter Values
We estimate the probabilities qi , rj and pij in §2.1, the PMF gi (n) and the historical threshold
th in §2.2, and the age PDFs h(a) and hm (a) in §2.3.
2.1
Estimating the Probabilities qi , rj and pij
The evidence database contains 14,979 entries covering the entire country during 1980-2000.
The arrivals data consist of all arrivals between January 1, 2006 and December 31, 2008.
There were 7138 arrivals during this time period, and 697 of them generated matches. We
start by estimating the three probabilities, qi , rj and pij . Of the 7138 arrivals, 3598 were
nonevidence and 3540 were evidence, yielding q0 = 0.504, q1 = 0.496.
We know which of the 89 Israeli police precincts collected each arrival and each database
entry. Because many pairs of precincts generated no matches during 2006-2008, we use only
two categories (e.g., J = 1): intra-location and inter-location. However, to fully exploit the
spatial information, it is desirable to merge two locations if it results in a more favorable ratio
of intra-location matches to inter-location matches. Each of the 89 Israeli police precincts
has an 89-dimensional vector stating the number of matches during 2006-2008 that it has
with each precinct. In our analysis, two locations are good candidates for merging if the dot
product of their vectors is large; we refer to this dot product as the two locations’ similarity,
and solve the graph partitioning problem that maximizes the average similarity within each
group. We solve this problem using the k−means algorithm [4] with k = 20, which merges
the 89 precincts into 20 groups so as to maximize the average similarity within each group.
The k−means algorithm is random, and solutions can vary from run to run. We solve the
problem 1000 times, and group precincts into stations if they are classified in the same group
in > 90% of the solutions. We left all other precincts as isolated to prevent overfitting. This
procedure results in 18 merged groups containing 55 precincts, and 34 isolated precincts, for
2
a total of 52 “new” stations. The resulting 52 × 52 matrix of matches appears in Fig. 6.
This merging process increases the intra-location matching proportion from 0.551 to 0.832
for nonevidence, and from 0.595 to 0.875 for evidence.
If we let ak and dk denote the number of arrivals from station k and the number of
database records from station k, then r0 =
ak dk
k=1 (P55 a )(P55 d )
l=1 l
l=1 l
P55
= 0.101 and r1 = 0.899,
where the subscript j = 0 corresponds to intra-station and j = 1 corresponds to interstation. Of the 697 matches, 107 were from nonevidence arrivals and 590 were from evidence
arrivals. Of the 107 nonevidence arrivals generating matches, 89 were intra-station, giving
p00 = 89/107 = 0.832 and p01 = 0.168. Similarly, we have p10 = 516/590 = 0.875 and
p11 = 0.125.
2.2
Estimating the Historical Threshold th and the PMF gi (n)
The observed PMF gi∗ (m) (Fig. 7) allows us to estimate the historical threshold th for
a constant threshold policy that generates an average candidate list size of 10, by assuming that the mean number of true positives is much less than the database size (i.e.,
P1
i=0 qi
P∞
n=1
ngi∗ (n) N ), yielding
1
X
i=0
qi
∞
X
ngi∗ (n) + N [1 − G(th )] = 10.
(3)
n=1
Setting N = 11, 350, which is the average value of the database size during 2006-2008, in (3)
gives th = 442.6.
In the remainder of this subsection, we estimate gi (n), which is the PMF of true
matches, from three quantities: gi∗ (m), which is the observed PMF of detected matches, the
historical threshold th under a constant threshold system, and the intra-gun similarity score
CDF F (x). Let Pmn be the probability that m matches are detected from an arrival in the
Israeli database, given that n matches to this arrival exist and the arrival has evidence status
i (the argument i in Pmn is suppressed for ease of presentation). It follows that gi∗ (m) =
3
P∞
n=m
Pmn gi (n). However, for practical purposes, we truncate this system of equations at
m = n = 14 (i.e., we set gi (n) = 0 for n ≥ 15) because gi∗ (m) is only nonzero for m ≤ 14 in
our data set. This truncated system of equations can be expressed as







P11 P12
0 P22
..
..
.
.
0
0
···
···
..
.
P1,14
P2,14
..
.
· · · P14,14







gi (1)
gi (2)
..
.
gi (14)














=
gi∗ (1)
gi∗ (2)
..
.
gi∗ (14)




.


(4)
Because gi∗ (m) is known and the system of equations in (4) is invertible, our estimation
problem for gi (n) reduces to determining the probabilities Pmn , which are a function of m, n
and the known false negative probability F (th ) under the constant threshold system, which
is denoted by p.
Conditioned on n matches existing in the database, the probability Pmn that a new
arrival detects m of them depends on two things. First, it depends on the current knowledge
about how these n entries are related, which can range from not realizing that any of them
are matched to each other (i.e., there are n singletons) to realizing that they are all matched
to each other (i.e., they are in a single group of n matches). Second, it depends on which
matching groups, if any, the new arrival is detected to belong. Hence, to derive Pmn , we need
to construct a detailed dynamic model that tracks the evolution of the n true matches as
they sequentially arrive to the system and undergo a matching process (with false negative
probability p = F (th )) with the prior arrivals.
This model is a Hidden Markov Model (HMM) because we cannot observe the state (or
transition probabilities among states) [5]. More generally, the state of the HMM is defined
by the current matched groupings of the true matches that have already arrived, where
the groupings are given in the ascending order of the size of the groups; e.g., state 1,1,2
means that there are a total of four true matches currently in the database, two of them
are connected (i.e., have been correctly detected to be a match) and the remaining two are
isolated (i.e., have not been correctly matched with any of the other three true matches).
4
We use the notation P (A|B) to represent the probability that, conditioned on the current
matchings being in state B, after a new arrival undergoes the matching process with the
prior arrivals, the new state is A (where, by construction, the sum of the numbers in A is
always one greater than the sum of the numbers in B).
For illustrative purposes, we show how to use the HMM to compute Pm3 for m =
0, 1, 2, 3, and then we provide a broad description of a general algorithm for any value of n.
The initial state of the HMM is 1, which refers to the first of the n true matches having
already entered the database. For this examples where n = 3, we need to track the HMM
through n more arrivals (i.e., until after the fourth arrival) because the fourth arrival has
n = 3 true matches in the database. These dynamics are described by the following transitions.
Transitions caused by the second arrival: P (1, 1|1) = p, P (2|1) = 1 − p.
Transitions caused by the third arrival: P (1, 1, 1|1, 1) = p2 , P (1, 2|1, 1) = 2p(1 − p),
P (3|1, 1) = (1 − p)2 , P (1, 2|2) = p2 , P (3|2) = 1 − p2 .
Transitions caused by the fourth arrival: P (1, 1, 1, 1|1, 1, 1) = p3 , P (1, 1, 2|1, 1, 1) = 3p2 (1 −
p), P (1, 3|1, 1, 1) = 3(1 − p)2 p, P (4|1, 1, 1) = (1 − p)3 , P (1, 1, 2|1, 2) = p3 , P (1, 3|1, 2) =
p(1 − p2 ), P (2, 2|1, 2) = p2 (1 − p), P (4|1, 2) = (1 − p)(1 − p2 ), P (1, 3|3) = p3 , P (4|3) = 1 − p3 .
Hence, the hidden states after the second arrival are (1, 1) and (2), the hidden states
after the third arrival are (1, 1, 1), (1, 2) and (3), and the hidden states after the fourth arrival
are (1, 1, 1, 1), (1, 1, 2), (1, 3), (2, 2) and (4). The HMM transition probabilities derived above
can be written as the following stochastic matrices, denoted by Mk ,
M1 =
M2 =
p 1−p
,
p2 2p(1 − p) (1 − p)2
0
p2
1 − p2

!
,

p3 3p2 (1 − p) 3(1 − p)2 p
0
(1 − p)3


M3 = 
p3
p(1 − p2 ) p2 (1 − p) (1 − p2 )(1 − p) 
 0
.
3
3
0
0
p
0
1−p
5
Note that the product M1 M2 equals the probability of arriving at the various states that
the fourth true match will see upon arrival: P (1, 1, 1) = p3 , P (1, 2) = 3p2 (1 − p), P (3) =
(1 − p)2 (1 + 2p).
Finally, we group all the transitions from the fourth arrival according to how many
matches are actually found. For example, the transition from (1, 1, 1) to (1, 3) means that
two matches are found. Using the law of total probability (i.e., conditioning on all possible
states of the current matching and then summing the joint probabilities; see pg 6 of [6])
yields our final result:
P03 = P (1, 1, 1, 1|1, 1, 1)P (1, 1, 1) + P (1, 1, 2|1, 2)P (1, 2) + P (1, 3|3)P (3) = p3 ,
P13 = P (1, 1, 2|1, 1, 1)P (1, 1, 1) + P (2, 2|1, 2)P (1, 2) = 3p4 (1 − p),
P23 = P (1, 3|1, 1, 1)P (1, 1, 1) + P (1, 3|1, 2)P (1, 2) = 3p3 (1 − p)2 (1 + 2p),
P33 = P (4|1, 1, 1)P (1, 1, 1) + P (4|1, 2)P (1, 2) + P (4|3)P (3) = (1 − p)3 (6p3 + 6p2 + 3p + 1).
For a general value of n, the calculation of Pmn for m = 0, 1, . . . , n can be carried out
by the following algorithm.
1 - Construct the HMM through the n + 1st arrival, and derive the transition matrices Mk
for k = 1, . . . , n.
2 - Compute
Qn−1
k=1
Mk , which gives the probability distribution for the various hidden states
that the n + 1st true match sees upon arrival.
3 - In the transition matrix Mn , find the number of matches m detected for each possible
transition A → B. Using the law of total probability, multiply the transition probability
from A to B in Mn by the probability of observing state A, which is
Qn−1
k=1
Mk , and add these
products over all possible transitions to get Pmn .
Running this algorithm for n = 1, . . . , 14 results in the gi (n) PMF plotted in Fig. 7.
6
2.3
Estimating the Age Distributions
Of the 1364 matching pairs during 2006-2008, 1227 (or 90.0%) of them have database entries
that occurred during 2006-2008, in which case we know the exact age in days. For the remaining 137 (or 10.0%) matching pairs, we only know the year when the database entry was
acquired, and so the age is in the interval [tj , tj + 365] for j = 1, . . . , 137. To estimate the
PDF hm (a), we solve a maximum likelihood estimation problem with uncertain data. Let
the known ages be denoted by the 1227-dimensional vector X, and the age of records with
uncertain age be given by the 137-dimensional vector Z. If we denote the lognormal parameters by (µa , σa ), then the likelihood function is
Q1227
i=1
f (Xi )
(µa , σa ) to maximize the log-likelihood function, which is
Q137 R tj +365
j=1 tj
P1227
i=1
f (Zj )dZj . Choosing
log[f (Xi )] +
P137
j=1
log[F (tj +
365) − F (tj )], yields µa = 4.904 and σa = 1.687. The frequency distribution of these 1364
ages and the resulting lognormal are plotted in Fig. 8.
To estimate the age PDF h(a) for all database entries, we assume that the age of an
entry is December 31, 2010 minus the acquisition date of the entry. However, we only know
the year in which each entry in the Israeli database was acquired. We estimate h(a) by fitting
a piecewise cubic hermite interpolating polynomial [7] to the yearly aggregates to estimate
an increasing smooth CDF (Fig. 9), and then numerically differentiating it to yield a PDF
(Fig. 10).
3
Comparison of a Constant Threshold System and a
Rank-based System
In this section, we argue that for a simplified setting in which there is never more than
one match in the database for an arrival, the rank-based system and the constant threshold
system perform very similarly. Equation (1) gives the detection probability for a rank-based
7
system that generates a candidate list of size 10 when the database has N entries. In
approximating the detection probability for the constant threshold system that generates an
average candidate list size of 10, we make use of the fact that the average number of true
positives in the Israeli data is 0.214: 90.24% of the candidate lists have no true matches, and
the average number of true matches in the other 9.76% of lists is 2.19. Substituting 0.214 for
the first term in equation (3) and solving for the threshold gives th ≈ G−1 1 −
9.786
N
. The
detection probability for the constant threshold system in this simplified setting is 1 − F (th ),
which can be approximated by
−1
1−F G
9.786 .
1−
N
(5)
Using the cartridge casings lognormal distributions derived in §1 and plotting the detection
probabilities in equations (1) and (5) as functions of the database size N reveal that the
performance curves are almost identical (Fig. 11).
4
The False Positive Constraint vs. the List Size Constraint
Here we explore the difference between constraining the expected number of false positives,
as in equation (3) in the main text, and constraining the expected list size, which is much
more difficult analytically. Our concern is that our choice of the false positive constraint
may favor the optimal policy because the solution to (1)-(4) in the main text may raise
the average number of true matches in the candidate list. In this case, restricting only the
average number of false positives would increase the average list size beyond 10. Calculations
in this section suggest that the effect of our choice of constraint on the results is extremely
small.
For Israeli cartridge casings, we know that the probability that there is at least one
8
true match in the candidate list (taking into account arrivals that have no true matches in
the database) is 0.0976 for the constant threshold policy and
0.987
(0.0976)
0.931
= 0.1035 for the
optimal policy (using the detection probabilities for both policies). Although we know that
the mean number of true matches in the candidate lists with at least one true match is 2.19
for the constant threshold policy, we do not know this quantity for the optimal policy. By
assuming that this value is 5.0, which should be a significant overestimate, the mean number
of true matches in the candidate list increases to 5 × 0.1035 = 0.5175. To approximate the
new detection probability under a list size constraint, we should resolve (1)-(4) in the main
text with the right side of the constraint reduced by 0.5175-0.214. To be conservative, we
reduce the right side of the constraint by 0.5175. The new detection probability is 0.9870,
compared to 0.9872 under the original constraint.
References
[1] Beauchamp A, Roberge D. Model of the behavior of the IBIS correlation scores
in a large database of cartridge scores. Unpublished manuscript, 2005. Accessed at
www.forensictechnology.com/Default.aspx?app=LeadgenDownload&shortpath=docs/
LargeDatabaseFinal.pdf on September 2, 2011.
[2] De Kinder J, Tulleners F, Thiebaut H. Reference ballistic imaging database performance. Forensic Science International 2004;140:207-215.
[3] Roberge D, Beauchamp A. The use of BulletTRAX-3D in a study of consecutively
manufactured barrels. AFTE Journal 2006;38:166-172.
[4] Hespanha J. grPartition - a MATLAB function for graph partitioning. 2004. Accessed
at www.ece.ecsb.edu/ hespanha/software/grPartition.html on September 8, 2011.
9
[5] Durbin R, Eddy S, Krogh A, Mitchison G. Biological sequence analysis: probabilistic
models of proteins and nucleic acids. Cambridge U. Press, Cambridge, UK, 1998.
[6] Karlin S, Taylor HM. A first course in stochastic processes, second edition. Academic
Press, New York, 1975.
[7] Fritsch FN, Carlson RE. Monotone piecewise cubic interpolation. SIAM J. Numerical
Analysis 1980;17:238-246.
10
Figure 2: Actual points (x) on the performance curve for casings from data in [1] vs. the
performance curve generated by the best-fit lognormal distribution.
11
Figure 3: The lognormal similarity score PDFs, intra-gun f (x) and inter-gun g(y), for casings.
12
Figure 4: Actual (based on BulletTrax-3D data in [3]) vs. lognormal inter-gun similarity
score PDF for bullets.
13
Figure 5: The lognormal similarity score PDFs, intra-gun f (x) and inter-gun g(y), for bullets.
14
Figure 6: All non-zero entries in the matching matrix for the 52 merged stations. Each entry
is the number of matches during 2006-2008 between each pair of merged stations. Stations
A-R are the merged stations. Yellow squares are the intra-station matchings.
15
Figure 7: The PMFs for the true (gi (n) in blue) and detected (gi∗ (m) in red) matches for
arrivals of evidence status i (i = 0 is nonevidence (—) and i = 1 is evidence (- - -)).
16
Figure 8: Frequency distribution of ages of the 1364 matching pairs in the Israeli database,
containing 1227 exact ages and 137 ages that are randomly sampled from the correct year,
along with the best-fit lognormal PDF, hm (a).
17
Figure 9: Age CDF fit to raw Israeli data.
18
Figure 10: Age PDF h(a) for all database records.
19
Figure 11: Comparison of the performance of the rank-based system and the constant threshold system.
20