J Forensic Sci, January 2014, Vol. 59, No. 1 doi: 10.1111/1556-4029.12269 Available online at: onlinelibrary.wiley.com PAPER CRIMINALISTICS Yan Yang,1 B.Eng.; Avi Koffman,2 B.Sc.; Gil Hocherman,2 M.Sc.; and Lawrence M. Wein,3 Ph.D. Using Spatial, Temporal and Evidence-status Data to Improve Ballistic Imaging Performance* ABSTRACT: Firearms identification imaging systems help solve crimes by comparing newly acquired images of cartridge casings or bullets to a database of images obtained from past crime scenes. We formulate an optimization problem that bases its matching decisions not only on the similarity between pairs of images, but also on the time and spatial location of each new acquisition and each database entry. The objective is to maximize the detection probability subject to a constraint on the false positive rate. We use data on all cartridge casings matches detected in Israel during 2006–2008 to estimate most of the model parameters. We estimate matching accuracy from two different studies and predict that the optimal use of extraneous information would increase the detection probability from 0.931 to 0.987 and from 0.707 to 0.844, respectively. These improvements are achieved by favoring pairs of images that are closer together in space and time. KEYWORDS: forensic science, ballistic imaging, optimization, statistics, spatial data, temporal data The Bureau of Alcohol, Tobacco, Firearms and Explosives (ATF) developed the National Integrated Ballistic Information Network (NIBIN) to help state and local law enforcement agencies to solve gun crimes (1). NIBIN uses computerized imaging technology to maintain a database on cartridge casings and bullets that are either recovered from crime scenes (called “evidence”) or test-fired from weapons that are recovered by (or surrendered to) law enforcement officers (called “nonevidence”). NIBIN rapidly computes similarity scores between a newly acquired casing or bullet and the entries in the database, and–– using software developed by a single vendor, Forensic Technology Inc.––generates a list of the (e.g., 10) most promising matches, which are subsequently analyzed by a forensic firearms examiner to obtain confirmed hits. In this manner, NIBIN can potentially identify a cold hit between a nonevidence acquisition and an earlier crime or discover links between crimes, both of which generate new leads to assist in crime solving. This system, if used properly (i.e., data entered in a thorough and timely manner, and confirmed hits integrated with all other investigative information) can be very useful to local police departments (Appendix A of reference [2]). However, NIBIN is used very inconsistently by various US municipalities and rarely employed for nonlocal (e.g., interstate) searches (1). Some European countries (e.g., [3]) have implemented similar systems. Ballistic imaging technology––particularly for bullets––does not perform as well (e.g., as measured by receiver operating characteristic curves) as some biometric technologies, such as fingerprints and DNA matching, that are also used for forensic purposes (2,4). The goal of this study is to examine whether ballistic imaging performance can be improved by combining it with other spatial, temporal, and categorical data that are collected along with the ballistic image. More specifically, we introduce and optimize a threshold-based system (i.e., potential hits are defined by similarity scores above a certain threshold rather than by being ranked in the top 10) that allows the threshold used to compare a newly acquired casing or bullet with one in the database to depend on the time interval between the two events (i.e., acquisition or crime), the spatial distance between the two events, and whether the new acquisition is evidence or nonevidence. The rationale behind this approach is that crime guns and their crimes cluster in space and time, and evidence acquisitions are more apt to have been involved in a previous crime than nonevidence acquisitions. We use spatial, temporal, and categorical data on all Israeli matches during 2006–2008, along with published performance data for matching cartridge casings, to calibrate our model and assess the potential improvement in performance of our approach. Materials and Methods 1 Institute for Computational and Mathematical Engineering, Stanford University, Stanford, CA, 94305. 2 Division of Identification & Forensic Science, Israel Police Investigation Department, National Police Headquarters, Jerusalem, 91906, Israel. 3 Graduate School of Business, Stanford University, Stanford, CA, 94305. *Supported by the Graduate School of Business, Stanford University (Y.Y and L.M.W.). Received 9 May 2012; and in revised form 20 Sept. 2012; accepted 13 Oct. 2012. © 2013 American Academy of Forensic Sciences Model Our model focuses on cartridge casings but has been applied to bullets in (5). The evidence status of a new acquisition, or arrival, is denoted by the subscript i = 0 for nonevidence recovered by law enforcement officers and i = 1 for evidence obtained from a crime scene. The probability that an arrival is of evidence status i is qi for i = 0,1 (definitions of all mathematical 103 104 JOURNAL OF FORENSIC SCIENCES TABLE 1––Definitions of all variables and parameter values. The subscript i describes the evidence status of a new acquisition (i = 0 for nonevidence and i = 1 for evidence). Subscript j describes the spatial proximity between a random acquisition and a random database entry. Parameters Definition lF, rF Intragun score (F(x)) parameters lG, rG Intergun score (G(y)) parameters qi rj N th tija gi(n) gi ðmÞ hm(a) Arrival evidence-status probability Spatial proximity probability for arrival-database pair Spatial proximity probability for matching pair of evidence i Evidence database size Historical threshold Thresholds to be optimized PMF for true matches PMF for detected matches Age PDF of matches h(a) Age PDF of database records pij Values Lognormal lF = 6.79, rF = 0.59 (9), lF = 6.50, rF = 1.26 (10) Lognormal lG = 4.52, rG = 0.50 (9), lG = 4.81, rG = 0.48 (10) q0 = 0.504, q1 = 0.496 r0 = 0.101, r1 = 0.899 p00 = 0.832, p10 = 0.875 11,350 442.6 Table 2, Fig. 8 Fig. 4 Fig. 4 Lognormal la = 4.90, ra = 1.69 Fig. 7 symbols can be found in Table 1). Each new arrival is matched against a database consisting of evidence images, because nonevidence guns have been confiscated, nonevidence images are not added to the database for future matching. Our model incorporates spatial and temporal information about pairs of images, which consist of an arrival and a database entry. As explained later, the nature of the Israeli spatial data leads us to describe the spatial proximity between a random arrival and a random database entry by a categorical variable denoted by the subscript j = 0,…, J. Let rj be the probability that a random arrival and a random database entry have spatial proximity status j for j = 0,…, J. The temporal proximity between an arrival and a random entry in a database is described by a continuous random variable with the subscript a for age, which is the time of acquisition of the arrival minus the time of acquisition of a random database entry (i.e., newly acquired evidence is added to the database immediately after it undergoes matching). Let h(a) be the probability density function (PDF) of the random age, which is assumed to be independent of time (i.e., as the database grows). In practice, ballistic imaging software involves an initial filtering step (e.g., by caliber, firing pin shape, and the number, twist rate, and width of the rifling) followed by an investigation of multiple aspects of the cartridge casing or bullet (2). For example, the matching of casings incorporates similarity scores for the breech face, firing pin, and ejector mark, and two-dimensional bullet matching examines all possible rotations of bullets (2). Although the mathematical modeling of multimodal matching is tractable (see (6,7) for examples in biometrics), these detailed data are owned by the software vendor, which has published only aggregate performance curves. As it is not possible to identify joint PDFs for similarity scores of the multiple aspects (e.g., breech face, firing pin, ejector mark) from aggregate performance curves, we assume in our model that an aggregate similarity score is generated as a result of comparing each arrival to each database entry, and let F(x) be the intragun cumulative distribution function (CDF) of the similarity score between two images emanating from the same gun, and G(y) be the intergun CDF of similarity scores between two images from different guns. Consequently, our results cannot be operationalized––that is, it is not possible to compute an aggregate similarity score (e.g., as appears in the vertical axis of Fig. 8) from a trio of similarity scores for breech face, firing pin, and ejector mark––without the joint similarity score PDFs for the multiple aspects. Before introducing our decision variables, we describe how the probability of a true match between an arrival and a database entry is affected by the evidence status and spatial and temporal proximity. One complicating factor in ballistic imaging that does not typically arise in biometric matching is that an arrival can match multiple entries in the evidence database. Let gi (n) be the probability that an arrival of evidence status i have n true matches (during our statistical analysis, we differentiate between true matches and detected matches) in the database, for i = 0,1 and n = 0,1,2, …. Let a true match with arrival evidence status i (i.e., the arrival in this matching pair has evidence status i) have probability pij of being in spatial proximity category j, for i = 0,1 and j = 0,…, J. Also, let hm(a) be the PDF of the age of true matches between a random arrival and a random database entry. As noted earlier, the matching software for ballistic images is rank based with a candidate list of a fixed size, where the top (e.g., 10) matches generated by a new arrival are forwarded to a forensic firearms examiner for final verification. In our model, we use a threshold-based system, where similarity scores above the threshold are forwarded for human verification; this system generates a candidate list of variable size. In (5), we show that the performance is very similar for these two systems, and later, we discuss the implications of our threshold-based assumption. As the goal of this study is to exploit spatial, temporal, and evidence-status data to improve performance, we let the similarity score threshold depend on the evidence status of the arrival and the spatial proximity category and age of the pair being matched. We denote our decision variables by tija. Because i and j are categorical and the age a is a continuous quantity, we restrict tija’s dependence on a to take on a specific functional form. After testing linear, power, and exponential functions, we settled on the exponential function, tija ¼ aij ebij a þ cij ð1Þ Hence, our optimization is over the 6 (J + 1) variables, faij ; bij ; cij ; i ¼ 0; 1; j ¼ 0; . . .; Jg. We are now in a position to formulate our optimization problem. The objective of our optimization problem is to maximize the probability that if at least one true match for an arrival exists in the database, then we detect at least one true match; we call this quantity the detection probability, and refer to one minus this quantity as the false negative rate. Note that this probability is much more easily calculated than the expected number of true matches detected, which may be a more natural performance measure. However, because each database entry has already been through the matching process, it is often the case that if an arrival has more than one match, it is already known that these matching entries in the database are also linked to each other. To derive the objective function, we first observe that the probability at least one true match in the database for an arrival is P1 ofP 1 i¼0 qi n¼1 gi ðnÞ . Also, a true match of age a involving an arrival of evidence status i and a pair of spatial proximity category j will go undetected if its similarity score (which has intragun CDF F(x)) is less than the threshold tija, which has probability F(tija). Integrating over the age PDF of matches, a true match involving an arrival of evidence status i and a pair of spatial proximity category j goes undetected with probability R1 Fðt Þhm ðaÞda. It follows that our objective function is ija 0 YANG ET AL. 1 P max qi i¼0 1 P gi ðnÞ½1 ð n¼1 J P pij j¼1 1 P aij ;bij ;cij i¼0 qi 1 P R1 0 . USING DATA TO IMPROVE BALLISTIC IMAGING PERFORMANCE Fðtija Þhm ðaÞdaÞn ð2Þ i¼0 j¼0 Z qi rj 1 hm(a). All parameter values appear in Table 1, and the derivations of these values, along with graphs of the five probability distributions, are given below. gi ðnÞ n¼1 Estimating the Similarity Score PDFs As with most problems of this type, our goal is to maximize the detection probability subject to some type of constraint on the false positives. In our model, a false positive occurs whenever a candidate that is not a true match is forwarded to a forensic examiner. Because the rank-based approach forwards 10 candidates per arrival to a forensic examiner (although Israel forwards the top 30 candidates––10 each from firing pin, breech face, and ejector mark––we consider 10 in total, because there may be significant overlap in the three candidate lists), a natural constraint for our threshold-based approach would be to force the expected size of the candidate list (i.e., the expected number of similarity scores that exceed the threshold) per arrival to be no larger than 10. Because this quantity is very difficult to compute, we take an alternative approach and require the expected number of false positives per arrival to be no more than the expected number of false positives per arrival generated by a constant threshold system (i.e., the threshold does not vary with i, j, or a) when the expected candidate list size is 10. In (5), we show that this false positive inequality constraint behaves nearly the same as if we used an inequality constraint on the mean candidate list size. To derive our false positive inequality constraint, we let th, which is estimated from data in our statistical analysis, be the threshold used in a constant threshold system that generates an expected candidate list size of 10. Because nearly all database entries are not matches to a given arrival, we assume that there are N nonmatches in the database for every arrival, each of which ends up on the candidate list with probability 1 G(th) under the constant threshold system. Hence, the right side of the constraint, which is the expected number of false positives per arrival under the constant threshold system, is N [1 G(th)]. After summing over i and j and integrating over age a, we find that the left side of the inequality constraint, which is the expected of Rfalse positives per arrival with P P number 1 threshold tija, is N 1i¼0 Jj¼0 qi rj 0 ½1 Gðtija ÞhðaÞda. Canceling the database size, N, from both sides of the constraint, we obtain our false positive constraint, 1 X J X 105 ½1 Gðtija ÞhðaÞda 1 Gðth Þ We assume that the intragun and intergun similarity score CDFs, F(x) and G(y), are lognormal and estimate the values of the four parameters, which are denoted by lF, rF, lG, and rG. There is not a definitive estimate in the literature on ballistic imaging matching performance for cartridge casings because the results depend upon a variety of factors, including the types of firearms and ammunition. Consequently, we derive two sets of parameter values from two different studies. We first estimate these parameter values using the lower left performance curve in fig. 12 of reference 9, which already incorporates an initial filtering step (restricting to 9 mm Luger cartridge casings) and multiple measurements (the curve is generated using similarity scores for breech face, firing pin, and ejector mark). This performance curve plots the probability that a true match is ranked among the top 10 scores when the database contains one true match and N nonmatches, as N varies from 0 to 106. If the intragun similarity score PDF is f(x), then this probability is 9 X N! M!ðN MÞ! M¼0 Z 1 ð1 GðtÞÞM GðtÞNM f ðtÞdt ð4Þ 0 Using seven points along the performance curve in fig. 12 of reference 9, we derive the least-squares estimates lf = 6.7912, rf = 0.5927, lg = 4.5207, rg = 0.5016. Fig. 1 compares the seven data points to the performance curve predicted (via Eq (4)) by the lognormal distributions, and Fig. 2 shows the resulting PDFs. We also estimate the parameter values from (10), which uses only breech face and firing pin information (i.e., no ejector mark information), has a database of 600 casings, and considers 32 arrivals that have a mate with the same ammunition type (Remmington) in the database. The probability that an arrival’s mate ranks in position k is ð3Þ 0 and our optimization problem is given by Eqs (1–3). We solve Eqs (1–3) using a sequential quadratic programming algorithm (via the fmincon function in MATLAB [8]). Because the optimization problem does not possess the second-order properties required to guarantee that the algorithm converges to a global optimum, we compared local optima resulting from various starting points in the large (12- or 18-dimensional) decision variable space to increase the likelihood that we are achieving a near-optimal solution. Statistical Analysis Overview For our application to Israeli cartridge casings, we have nine quantities to estimate, which naturally divide into four groups: (i) the similarity score CDFs F(x) and G(y), (ii) the probabilities qi, rj, and pij, (iii) the probability mass function (PMF) gi (n) and the historical threshold th, and (iv) the age PDFs h(a) and FIG. 1––Actual points (x) on the performance curve for casings from data in (9) versus the performance curve generated by the best-fit lognormal distribution. 106 JOURNAL OF FORENSIC SCIENCES FIG. 2––The lognormal similarity score PDFs, intragun f(x) and intergun g (y), for casings. PðkÞ ¼ 599! ðk 1Þ!ð600 kÞ! Z 1 to maximize the average similarity within each group. The k- means algorithm is random, and solutions can vary from run to run. We solve the problem 1000 times and group precincts into stations if they are classified in the same group in >90% of the solutions. We left all other precincts as isolated to prevent overfitting. This procedure results in 18 merged groups containing 55 precincts and 34 isolated precincts, for a total of 52 merged locations that we refer to as stations. The resulting 52 9 52 matrix of matches appears in Fig. 3. This merging process increases the intralocation matching proportion from 0.551 to 0.832 for nonevidence and from 0.595 to 0.875 for evidence. If we let ak and dk denote the number of arrivals from station k and records from station k, then r0 ¼ P52 the number P52 of database P52 k¼1 ak dk =ð l¼1 al Þð l¼1 dl Þ ¼ 0:101 and r1 ¼ 0:899, where the subscript j = 0 corresponds to intrastation, and j = 1 corresponds to interstation. Of the 697 matches, 107 were from nonevidence arrivals, and 590 were from evidence arrivals. Of the 107 nonevidence arrivals generating matches, 89 were intrastation, giving p00 = 89/ 107 = 0.832 and p01 = 0.168. Similarly, we have p10 = 516/ 590 = 0.875 and p11 = 0.125. ð1 GðtÞÞk1 GðtÞNk f ðtÞdt 0 From Fig. 1 of reference (10), we P use 600 Pð1Þ ¼ 18=32; Pð2Þ ¼ 2=32; Pð3Þ ¼ Pð4Þ ¼ 1=32 and k¼30 PðkÞ ¼ 8=32. The least-squares fit to these probabilities is lf = 6.50, rf = 1.26, lg = 4.81, rg = 0.48. Estimating the Probabilities qi, rj, and pij The evidence database from Israel contains 14,979 entries covering the entire country during 1980–2000, and its average size during 2006–2008 was 11,350 entries. The arrivals data consist of all arrivals between January 1, 2006 and December 31, 2008. There were 7138 arrivals during this time period, and 697 of these arrivals matched at least one entry in the database. An arrival can match multiple entries in the database, and there were a total of 1364 matching pairs (i.e., matches between an arrival and a database entry). Of the 7138 arrivals, 3598 were nonevidence, and 3540 were evidence, yielding q0 = 0.504, q1 = 0.496. If we had data on the precise location of each arrival and each database entry, we could measure the Euclidean distance of each match and allow the threshold to be a specified functional form of the Euclidean distance, as in Eq (1). However, we only have data on which of the 89 Israeli police precincts collected each arrival and each database entry. Because many pairs of precincts generated no matches during 2006–2008, we use only two spatial categories (i.e., J = 1): intralocation and interlocation. To fully exploit the spatial information, it is desirable to merge two locations if it results in a more favorable ratio of intralocation matches to interlocation matches. Each of the 89 Israeli police precincts has an 89-dimensional vector stating the number of matches during 2006–2008 that it has with each precinct. In our analysis, two locations are good candidates for merging if the dot product of their vectors is large; we refer to this dot product as the two locations’ similarity and solve the graph partitioning problem that maximizes the average similarity within each group. We solve this problem using the k- means algorithm (11) with k = 20, which merges the 89 precincts into 20 groups so as FIG. 3––All nonzero entries in the matching matrix for the 52 merged stations. Each entry is the number of matches during 2006–2008 between each pair of merged stations. Stations A–R are the merged stations. Shaded squares are the intrastation matchings. YANG ET AL. . USING DATA TO IMPROVE BALLISTIC IMAGING PERFORMANCE Estimating the Historical Threshold th and the PMF gi(n) From the raw data pertaining to the matches and arrivals, we can construct the PMF gi ðmÞ, which is the probability of detecting m matches from an arrival that has evidence status i; this PMF is not to be confused with gi(n), which is the PMF for true (i.e., detected plus undetected) matches. The observed PMF gi ðmÞ (Fig. 4) allows us to estimate the historical threshold th for a constant threshold policy that generates an average candidate list size of 10, by assuming that the mean number of true positives less than the database size (i.e., P1is much P1 q ng ðnÞ\ \NÞ, yielding i i i¼0 n¼1 1 X i¼0 qi 1 X ngi ðnÞ þ N½1 Gðth Þ ¼ 10 ð5Þ n¼1 Setting N = 11,350, which is the average value of the database size during 2006–2008, in Eq (5) gives th = 442.6. The most challenging part of our estimation procedure is to estimate gi(n), which is the PMF of true matches, from three quantities: gi ðmÞ, which is the observed PMF of detected matches, the historical threshold th under a constant threshold system, and the intragun similarity score CDF F(x). Let Pmn be the probability that m matches are detected from an arrival in the Israeli database, given that n matches to this arrival exist and the arrival has evidence status i (the argument i in Pmn is suppressed Pfor ease of presentation). It follows that gi ðmÞ ¼ 1 n¼m Pmn gi ðnÞ. However, for practical purposes, we truncate this system of equations at m = n = 14 (i.e., we set gi (n) = 0 for n ≥ 15) because gi ðmÞ is only nonzero for m ≤ 14 in our data set. This truncated system of equations can be expressed as 0 P11 B 0 B B .. @ . 0 P12 P22 .. . 0 .. . P1;14 P2;14 .. . P14;14 10 CB CB CB A@ gi ð1Þ gi ð2Þ .. . gi ð14Þ 1 0 C B C B C¼B A @ gi ð1Þ gi ð2Þ .. . 1 C C C A ð6Þ the size of the groups; for example, state 1,1,2 means that there are a total of four true matches currently in the database, two of them are connected (i.e., have been correctly detected to be a match) and the remaining two are isolated (i.e., have not been correctly matched with any of the other three true matches). We use the notation P(A|B) to represent the probability that, conditioned on the current matchings being in state B, after a new arrival undergoes the matching process with the prior arrivals, the new state is A (where, by construction, the sum of the numbers in A is always one greater than the sum of the numbers in B). For illustrative purposes, we show how to use the HMM to compute Pm3 for m = 0,1,2,3, and then we provide a broad description of a general algorithm for any value of n. The initial state of the HMM is 1, which refers to the first of the n true matches having already entered the database. For this example where n = 3, we need to track the HMM through n more arrivals (i.e., until after the fourth arrival) because the fourth arrival has n = 3 true matches in the database. These dynamics are described by the following transitions. • Transitions caused by the second arrival: P(1,1|1) = p, P(2|1) = 1 p. = p2, • Transitions caused by the third arrival: P(1,1,1|1,1) 2 P(1,2|1,1) = 2p(1 p), P(3|1,1) = (1 p) , P(1,2|2) = p2, P(3|2) = 1 p2. • Transitions caused 2by the fourth arrival: P(1,1,1,1|1,1,1) =2 p3, P(1,1,2|1,1,1) = 3p (1 p), P(1,3|1,1,1) = 3(1 p) p, P(4|1,1,1) = (1 p)3, P(1,1,2|1,2) = p3, P(1,3|1,2) = p(1 p2), P(2,2|1,2) = p2 (1 p), P(4|1,2) = (1 p)(1 p2), P(1,3|3) = p3, P(4|3) = 1 p3. Hence, the hidden states after the second arrival are (1,1) and (2), the hidden states after the third arrival are (1,1,1), (1,2), and (3), and the hidden states after the fourth arrival are (1,1,1,1), (1,1,2), (1,3), (2,2), and (4). The HMM transition probabilities derived above can be written as the following stochastic matrices, denoted by Mk, gi ð14Þ because gi ðmÞ is known and the system of Eq (6) is invertible, our estimation problem for gi(n) reduces to determining the probabilities Pmn, which are a function of m, n, and the known false negative probability F(th) under the constant threshold system, which is denoted by p. Conditioned on n matches existing in the database, the probability Pmn that a new arrival detects m of them depends on two things. First, it depends on the current knowledge about how these n entries are related, which can range from not realizing that any of them are matched to each other (i.e., there are n singletons) to realizing that they are all matched to each other (i.e., they are in a single group of n matches). Second, it depends on which matching groups, if any, the new arrival is detected to belong. Hence, to derive Pmn, we need to construct a detailed dynamic model that tracks the evolution of the n true matches as they sequentially arrive to the system and undergo a matching process (with false negative probability p = F(th)) with the prior arrivals. This model is a hidden Markov model (HMM) because we cannot observe the state (or transition probabilities among states) (12). More generally, the state of the HMM is defined by the current matched groupings of the true matches that have already arrived, where the groupings are given in the ascending order of 107 M1 ¼ ð p 1 p Þ; M2 ¼ p2 0 2pð1 pÞ ð1 pÞ2 ; p2 1 p2 0 1 p3 3p2 ð1 pÞ 3ð1 pÞ2 p 0 ð1 pÞ3 M3 ¼ @ 0 p3 pð1 p2 Þ p2 ð1 pÞ ð1 p2 Þð1 pÞ A 0 1 p3 0 0 p3 Note that the product M1M2 equals the probability of arriving at the various states that the fourth true match will see upon arrival: P(1,1,1) = p3, P(1,2) = 3p2 (1 p), P(3) = (1 p)2 (1 + 2p) . Finally, we group all the transitions from the fourth arrival according to how many matches are actually found. For example, the transition from (1,1,1) to (1,3) means that two matches are found. Using the law of total probability (i.e., conditioning on all possible states of the current matching and then summing the joint probabilities; see pg 6 of [13]) yields our final result: 108 JOURNAL OF FORENSIC SCIENCES FIG. 4––The PMFs for the evidence true matches (- - -), evidence detected matches (…), nonevidence true matches (-.-), and nonevidence detected matches (—). …,137. To estimate the PDF hm (a), we solve a maximum likelihood estimation problem with uncertain data. Let the known ages be denoted by the 1227-dimensional vector X, and the age of records with uncertain age be given by the 137-dimensional vector Z. If we denote the lognormal parameters by (la, ra), then the Q Q137 R tj þ365 likelihood function is 1227 f ðZj ÞdZj . Choosing i¼1 f ðXi Þ j¼1 tj (la, ra) to maximize the log-likelihood function, which is P137 P1227 i¼1 log½f ðXi Þþ j¼1 log½Fðtj þ365ÞFðtj Þ, yields la = 4.904 and ra = 1.687. The frequency distribution of these 1364 ages and the resulting lognormal are plotted in Fig. 5. We only know the year in which each database entry was acquired, and to estimate the age PDF h(a) for all database entries, we assume that the age of an entry is December 31, 2010 minus the acquisition date of the entry. However, we only know the year in which each entry in the Israeli database was acquired. We estimate h(a) by fitting a piecewise cubic hermite interpolating polynomial (14) to the yearly aggregates to estimate an increasing smooth CDF (Fig. 6) and then numerically differentiating it to yield a PDF (Fig. 7). Results P03 ¼ Pð1; 1; 1; 1j1; 1; 1ÞPð1; 1; 1Þ þ Pð1; 1; 2j1; 2ÞPð1; 2Þ þ Pð1; 3j3ÞPð3Þ ¼ p3 ; P13 ¼ Pð1; 1; 2j1; 1; 1ÞPð1; 1; 1Þ þ Pð2; 2j1; 2ÞPð1; 2Þ ¼ 3p4 ð1 pÞ; P23 ¼ Pð1; 3j1; 1; 1ÞPð1; 1; 1Þ P33 þ Pð1; 3j1; 2ÞPð1; 2Þ ¼ 3p3 ð1 pÞ2 ð1 þ 2pÞ; ¼ Pð4j1; 1; 1ÞPð1; 1; 1Þ þ Pð4j1; 2ÞPð1; 2Þ þ Pð4j3ÞPð3Þ ¼ ð1 pÞ3 ð6p3 þ 6p2 þ 3p þ 1Þ For a general value of n, the calculation of Pmn for m = 0,1,…, n can be carried out by the following algorithm. Construct the HMM through the n + 1st arrival, and derive the transition Q matrices Mk for k = 1, …, n. Compute n1 k¼1 Mk , which gives the probability distribution for the various hidden states that the n + 1st true match sees upon arrival. In the transition matrix Mn, find the number of matches m detected for each possible transition A ? B. Using the law of total probability, multiply the transition probability from A to B in Mn by the probability of observing state A, which is Qn1 k¼1 Mk , and add these products over all possible transitions to get Pmn. We first report results using the matching performance estimated from (9). Under the constant threshold policy (i.e., which employs the threshold th) for cartridge casings in Israel, the probability that at least one true match for an arrival is detected, given that at least one true match exists, is 0.931. This detection probability increases to 0.987 under the optimal policy derived from Eqs (1–3), which represents a 81.4% reduction (from 0.069 to 0.013) in the false negative rate. The optimal thresholds from Eq (1) are given in the last row of Table 2 and are higher for interstation matches, nonevidence arrivals, and older ages (Fig. 8). By optimizing each of the three types of information in isolation and in pairs (Table 2), we find that optimizing age offers slightly more improvement than optimizing spatial information, while optimizing evidence status provides very little improvement (e.g., optimizing only evidence status increases the detection probability by just 0.005 over the constant threshold policy). In addition, the impact of optimizing age and spatial information is subadditive. Running this algorithm for n = 1, …, 14 results in the gi (n) PMF plotted in Fig. 4. Estimating the Age Distributions We assume that the age PDF of true matches is the same as the observed age PDF of detected matches in the Israeli database and estimate hm(a) by a lognormal. Of the 1364 matching pairs during 2006–2008, 1227 (or 90.0%) of them have database entries that occurred during 2006–2008, in which case we know the exact age in days. For the remaining 137 (or 10.0%) matching pairs, we only know the year when the database entry was acquired, and so the age is in the interval [tj,tj + 365] for j = 1, FIG. 5––Frequency distribution of ages of the 1364 matching pairs in the Israeli database, containing 1227 exact ages and 137 ages that are randomly sampled from the correct year, along with the best-fit lognormal PDF, hm(a). YANG ET AL. . USING DATA TO IMPROVE BALLISTIC IMAGING PERFORMANCE 109 FIG. 8––Optimal thresholds for Israeli cartridge casings. FIG. 6––Age CDF fit to raw Israeli data. Using the matching performance parameter values derived from (10), we find that the detection probability for the constant threshold policy is 0.707, and the detection probability for the optimal policy is 0.844. The absolute improvement in detection probability is larger using (10) rather than (9) (0.844– 0.707 = 0.137 vs. 0.987–0.931 = 0.056), although the percentage reduction in the false negative rate is considerably smaller (46.8% vs. 81.4%). Discussion FIG. 7––Age PDF h (a) for all database records. TABLE 2––Detection probability for all combinations of optimizing evidence status, spatial category and age information. The subscripts of tija are suppressed if they are not being optimized. The matching performance is based on data in (9). Optimized Information None Evidence Spatial Age Evidence, spatial Evidence, age Spatial, age Evidence, spatial, age Thresholds th = 443 t0 = 486, t1 = 416 t0 = 321, t1 = 528 ta = 391.1e0.00152a + 537.4 t00 = 354, t01 = 562, t10 = 301, t11 = 502 t0a = 302.7e0.00149a + 507.9, t1a = 345.9e0.00153a + 587.3 t0a = 283.3e0.00100a + 435.7, t1a = 420.4e0.00128a + 674.0 t00a = 292.6e0.00121a + 459.0, t01a = 408.5e0.00136a + 691.6, t10a = 223.1e0.00142a + 364.9 t11a = 339.2e0.00158a + 587.6 Detection Probability 0.931 0.936 0.965 0.972 0.967 0.974 0.986 0.987 Our main result is that exploiting information––particularly spatiotemporal information––that is extraneous to the ballistic imaging process can improve the performance of ballistic imaging systems. The magnitude of improvement is difficult to estimate due to the lack of a definitive estimate for matching performance in the literature. While the increase in detection probability from 0.931 to 0.987 is modest due to the high baselevel detection probability (derived from matching performance data in [9]), this improvement is impressive when viewed as a 81.4% reduction in the false negative rate. The absolute improvement in detection probability from 0.707 to 0.844 is somewhat larger when using the matching performance data in (10), although the reduction in the false negative rate is only 46.8%. These results suggest that crime guns and their crimes do indeed cluster in space and time, and this information can be exploited to solve more crimes. Although Israel performs nationwide searches, we can also analyze the counterfactual scenario in which Israel only performs intraprecinct searches. Even if all intraprecinct matches are detected, the detection probability (using matching performance data in [9]) is only 0.729, which–– when compared to 0.931–– reveals the benefit of performing nationwide searches in Israel. Other Potential Applications As mentioned earlier, our model can also be directly applied to bullets rather than cartridge casings. More generally, because the benefits of our approach stem almost entirely from exploiting the spatial and temporal clustering of crimes committed by crime guns (Table 2), our approach may be most beneficial for countries that––like Israel––cover a small geographic area and do not 110 JOURNAL OF FORENSIC SCIENCES suffer from long delays in data entry. Some European countries may fit this profile: many are comparable in size to Israel, and, for example, the UK’s ballistic imaging system does not appear to have any backlogs (3). In contrast, the US may not be an ideal setting for our approach: it is much larger geographically than Israel, and new images are not added to the NIBIN database in a timely manner (1). Indeed, the temporal clustering suggests that NIBIN performance might improve if new images were entered into the NIBIN database in a last-in first-out (LIFO) manner rather than in first-in first-out (FIFO) order. The US database associated with NIBIN is divided into 47 partitions that are grouped into 12 regions, and so there are three possible geographical approaches to matching: each arrival undergoes only intrapartition searches, only intraregion searches, or national searches. Under a constant threshold policy, there is a tradeoff inherent in these three approaches: as the system expands from performing only intrapartition searches to performing intraregion searches and on to performing national searches, it gains improved coverage but experiences deteriorating matching performance (due to the increased database size). In theory, an optimized approach to national searches can largely bypass this tradeoff: by setting higher threshold levels for more distant (e.g., inter-regional) searches, it can achieve full coverage and perhaps suffer only a small degradation in matching accuracy. While an optimized approach to national searches would––by construction––perform at least as well as the constant threshold intrapartition policy that is in widespread use in the US (i.e., using infinite thresholds for interpartition matches is feasible in the national approach and would reduce to an approach that employs only intrapartition searches), the key issue is to assess the magnitude of this improvement. Calculations in (5) compare the performance of these three geographical approaches in the US for both cartridge casings and bullets. However, these numerical results are highly speculative because of the lack of publicly available spatial data (e.g., we do not know what fraction of matches are intrapartition vs. intra-regional vs. interregional) and are not reported here. As noted in (2), progress in this area is problematic because the sole vendor, Forensic Technology Inc., has much of the necessary data, and hence, the National Institute of Standards and Technology, which works on certain technical aspects of ballistic imaging (15), may be in the best position to perform or enable future research. With additional US data, the model in Eqs (1–3) could also be applied in several other ways. The spatial categories are quite general and could be used to exploit spatial patterns in the illegal gun market in the US (16). The proposed Reference Ballistic Image Database (RBID), which would maintain a national database from firings of newly manufactured and imported guns, could be accommodated in our model by introducing a third type of evidence status (i.e., i = 2 would correspond to new guns). Although a national RBID was deemed to be impractical due to its large database size as well as other factors (e.g., gun wear over time, differences caused by ammunition, filling the database with guns that are extremely unlikely to be involved in a crime) (2), this issue could be revisited using our approach, which would allow very high thresholds for new guns. Moreover, if RBID incorporated point-of-sale data, then our approach could use lower thresholds for the miniscule fraction of retailers that sell the majority of crime guns in the US (17). Note that as ballistic imaging is a search process that is followed by human verification, the retailers who sell many crime guns would be unaffected by the increased false positive rate associated with their lower thresholds. Limitations of Our Analysis Due to the large amount of Israeli data, we have precise estimates of all the parameters in Table 1 (e.g., the standard errors for q0, r0, and p00 are 0.0059, 0.0012, and 0.0362, respectively), with the exception of the parameters related to matching accuracy; that is, the uncertainty in our results is driven almost entirely by our estimates of lF, rF, lG, and rG. The biggest shortcoming of our analysis is that the actual problem deals with similarity scores that are based on multimodal (breech face, firing pin, ejector mark) measurements that are possibly correlated and possibly repeated (Israel acquires two samples from each evidence and nonevidence gun that is recovered) and that come from a variety of gun and ammunition types. Due to the lack of raw similarity score data and the range of matching performance estimates in the literature, we use two different data sets that present aggregate performance: a performance curve (fig. 12 of reference [9]) for combined breech face, firing pin, and ejector mark scores for 9 mm caliber guns, and results (Fig. 2 of reference [10]) from an experiment for combined breech face and firing pin scores with Remmington ammunition. Hence, the respective detection probabilities of 0.931 and 0.707 under the constant threshold policy are not necessarily an accurate prediction of Israel’s current performance, although our results (see also [5]) suggest that the optimal policy consistently outperforms the current threshold policy. More generally, ballistic imaging technology is in a state of flux, with the vendor recently introducing three-dimensional ballistic imaging matching systems for both cartridge casings and bullets (18), for which very little published performance data exist (19,20). In addition, ballistic imaging systems are likely to perform better in controlled experiments than in the field. Moreover, as mentioned earlier, our results cannot be operationalized (i.e., the aggregate similarity scores on the vertical axis of Fig. 8 cannot be computed from raw similarity scores for breech face, firing pin and ejector mark) unless one gains access to data for the joint PDF of breech face, firing pin, and ejector mark scores. Although Forensic Technology Inc. has these data, they are not in the public domain. A second limitation is that we use a threshold-based approach rather than the rank-based approach that is in current use. If similarity scores were independent of gun and ammunition type, then a threshold-based approach would perform at least as well as a rank-based approach. However, if similarity scores vary by gun and ammunition type (which is likely to be the case), then our optimal policy may not work well. There are two ways to adapt our ideas to this setting: (i) have the threshold tijk also depend on the gun and ammunition type, or (ii) change the optimization problem to a rank-based system, where the decision variables are changed from tijk to multiplicative scaling factors qijk (i.e., a similarity score s would be transformed to qijk s), which need not vary by gun or ammunition type. The former approach is feasible (e.g., it has been used for fingerprints with different image qualities [21]) but tedious, and the latter approach is preferable. The final limitation of the Israeli analysis is the implicit assumption that the probabilities qi, rj, and pij do not vary over time. During 2006–2008 in Israel, these quantities were reasonably stable. Nonetheless, there are three concerns. The first is if criminals adapt their behavior as a result of the ballistic imaging system. In Boston, criminals were found not to increase their use of revolvers, which do not eject cartridge casings, as a result of the implementation of a casing imaging system (2), and so this YANG ET AL. . USING DATA TO IMPROVE BALLISTIC IMAGING PERFORMANCE concern may be unfounded, particularly given the system’s lack of transparency from the criminal’s viewpoint. The second concern is changes in the mobility patterns of crime guns, which could occur for a variety of reasons. A third concern is changes in police procedures (e.g., spatial reallocation of law enforcement resources). The latter two concerns can be partially mitigated by periodically updating the estimates of these probabilities. Conclusion We develop a data-driven approach to improve the performance of ballistic imaging systems and predict that it could increase the detection probability in Israel, using matching data for cartridge casings from 2006 to 2008. This improvement is achieved by requiring a very close match for pairs of images that are distant in space and/or time. This approach may have potential applications in other countries that are of comparable size to Israel (e.g., European countries). An assessment of an optimized national approach for the US seems worth pursuing, but the US Department of Justice and/or the National Institute of Standards and Technology would need to gather the necessary data to enable such an assessment. Acknowledgments Supported by the Graduate School of Business, Stanford University (Y.Y and L.M.W.). References 1. Office of the Inspector General, U.S. Department of Justice. The Bureau of Alcohol, Tobacco, Firearms and Explosives’ National Integrated Ballistic Information Network program; 2005 Audit Report 05-30. Washington, DC: Office of the Inspector General, U.S. Department of Justice, 2005. 2. Cork DL, Rolph JE, Meieran ES, Petrie CV, editors. Ballistic imaging. Washington, DC: National Academies Press, 2008. 3. http://www.nabis.police.uk/home.asp (accessed on October 8, 2012). 4. Committee on Identifying the Needs of the Forensic Sciences Community, National Research Council. Strengthening forensic science in the United States: a path forward. Washington, DC: National Academies Press, 2009. 5. Yang Y. Three data-driven operations research analyses in the public sector (dissertation). Stanford, CA: Stanford University, 2012. 111 6. Prabhakar S, Jain AK. Decision-level fusion in fingerprint verification. Pattern Recogn 2002;35:861–74. 7. Baveja M, Wein LM. An effective two-finger, two-stage biometric strategy for the US-VISIT Program. Oper Res 2009;57:1068–81. 8. http://www.mathworks.com/products/matlab/index.html (accessed on October 8, 2012). 9. http://www.forensictechnology.com/Default.aspx?app=LeadgenDownload &shortpath=docs /LargeDatabaseFinal.pdf (accessed on October 8, 2012). 10. De Kinder J, Tulleners F, Thiebaut H. Reference ballistic imaging database performance. Forensic Sci Int 2004;140:207–15. 11. http://www.ece.ecsb.edu/ hespanha/software/grPartition.html (accessed on October 8, 2012). 12. Durbin R, Eddy S, Krogh A, Mitchison G. Biological sequence analysis: probabilistic models of proteins and nucleic acids. Cambridge, U.K.: Cambridge University Press, 1998. 13. Karlin S, Taylor HM. A first course in stochastic processes, 2nd edn. New York, NY: Academic Press, 1975. 14. Fritsch FN, Carlson RE. Monotone piecewise cubic interpolation. SIAM J Numer Anal 1980;17:238–46. 15. Vorburger TV, Yen JH, Bachrach B, Renegar TB, Filliben JJ, Ma L et al. Surface topography analysis for a feasibility assessment of a national ballistics imaging database. Gaithersburg, MD: National Institute of Standards and Technology, 2007; Internal Report 7362. 16. Wintemute GJ, Romero MP, Wright MA, Grassel KM. The life cycle of crime guns: a description based on guns recovered from young people in California. Ann Emerg Med 2004;43:733–42. 17. Wintemute GJ, Braga AA. Opportunities for state-level action to reduce firearm violence: proceeding from the evidence. Am J Public Health 2011;101:e1–3. 18. http://www.forensictechnology.com/IBISTRAX/ (accessed on October 8, 2012). 19. Roberge D, Beauchamp A. The use of BulletTRAX-3D in a study of consecutively manufactured barrels. AFTE J 2006;38:166–72. 20. Brinck TB. Comparing the performance of IBIS and BulletTRAX-3D technology using bullets fired through 10 consecutively rifled barrels. J Forensic Sci 2008;53:677–82. 21. Wein LM, Baveja M. Using fingerprint image quality to improve the identification performance of the U.S. Visitor and Immigrant Status Indicator Technology Program. Proc Natl Acad Sci U.S.A. 2005;102: 7772–5. Additional information and reprint requests: Lawrence M. Wein, Ph.D. Jeffrey S. Skoll Professor of Management Science Graduate School of Business Stanford University 655 Knight Way Stanford, CA 94305 E-mail: [email protected] SUPPORTING MATERIAL We estimate the similarity score PDFs F (x) and G(y) in §1 and the Israeli casings parameter values in §2. Justification for using a threshold-based approach is presented in §3 and a comparison of the false negative constraint and the list size constraint is performed in §4. 1 Estimating the Similarity Score PDFs We first estimate the lognormal (using natural logarithms, not logarithms to the base 10) parameters (µf , σf , µg , σg ) using the lower left graph in Fig. 12 of [1], which plots the probability that a true match is ranked among the top 10 scores when the database contains one true match and N nonmatches, as N varies from 0 to 106 . If the intra-gun similarity score PDF is f (x), then this probability is 9 X Z ∞ N! (1 − G(t))M G(t)N −M f (t) dt. M !(N − M )! 0 M =0 (1) Using seven points along the performance curve in Fig. 12 of [1], we derive the least-squares estimates µf = 6.7912, σf = 0.5927, µg = 4.5207, σg = 0.5016. Fig. 2 compares the seven data points to the performance curve predicted (via (1)) by the lognormal distributions, and Fig. 3 shows the resulting PDFs. In [2], 32 arrivals are compared to a database of 600 images that include the arrivals. The probability that an arrival’s mate ranks in position k is P (k) = Z ∞ 599! (1 − G(t))k−1 G(t)N −k f (t) dt. (k − 1)!(600 − k)! 0 From Fig. 1 of [2], we use P (1) = 18 , 32 P (2) = 2 , 32 P (3) = P (4) = 1 32 and (2) P600 k=30 P (k) = The least-squares fit to these probabilities is µf = 6.50, σf = 1.26, µg = 4.81, σg = 0.48. 8 . 32 2 Estimating Israeli Casings Parameter Values We estimate the probabilities qi , rj and pij in §2.1, the PMF gi (n) and the historical threshold th in §2.2, and the age PDFs h(a) and hm (a) in §2.3. 2.1 Estimating the Probabilities qi , rj and pij The evidence database contains 14,979 entries covering the entire country during 1980-2000. The arrivals data consist of all arrivals between January 1, 2006 and December 31, 2008. There were 7138 arrivals during this time period, and 697 of them generated matches. We start by estimating the three probabilities, qi , rj and pij . Of the 7138 arrivals, 3598 were nonevidence and 3540 were evidence, yielding q0 = 0.504, q1 = 0.496. We know which of the 89 Israeli police precincts collected each arrival and each database entry. Because many pairs of precincts generated no matches during 2006-2008, we use only two categories (e.g., J = 1): intra-location and inter-location. However, to fully exploit the spatial information, it is desirable to merge two locations if it results in a more favorable ratio of intra-location matches to inter-location matches. Each of the 89 Israeli police precincts has an 89-dimensional vector stating the number of matches during 2006-2008 that it has with each precinct. In our analysis, two locations are good candidates for merging if the dot product of their vectors is large; we refer to this dot product as the two locations’ similarity, and solve the graph partitioning problem that maximizes the average similarity within each group. We solve this problem using the k−means algorithm [4] with k = 20, which merges the 89 precincts into 20 groups so as to maximize the average similarity within each group. The k−means algorithm is random, and solutions can vary from run to run. We solve the problem 1000 times, and group precincts into stations if they are classified in the same group in > 90% of the solutions. We left all other precincts as isolated to prevent overfitting. This procedure results in 18 merged groups containing 55 precincts, and 34 isolated precincts, for 2 a total of 52 “new” stations. The resulting 52 × 52 matrix of matches appears in Fig. 6. This merging process increases the intra-location matching proportion from 0.551 to 0.832 for nonevidence, and from 0.595 to 0.875 for evidence. If we let ak and dk denote the number of arrivals from station k and the number of database records from station k, then r0 = ak dk k=1 (P55 a )(P55 d ) l=1 l l=1 l P55 = 0.101 and r1 = 0.899, where the subscript j = 0 corresponds to intra-station and j = 1 corresponds to interstation. Of the 697 matches, 107 were from nonevidence arrivals and 590 were from evidence arrivals. Of the 107 nonevidence arrivals generating matches, 89 were intra-station, giving p00 = 89/107 = 0.832 and p01 = 0.168. Similarly, we have p10 = 516/590 = 0.875 and p11 = 0.125. 2.2 Estimating the Historical Threshold th and the PMF gi (n) The observed PMF gi∗ (m) (Fig. 7) allows us to estimate the historical threshold th for a constant threshold policy that generates an average candidate list size of 10, by assuming that the mean number of true positives is much less than the database size (i.e., P1 i=0 qi P∞ n=1 ngi∗ (n) N ), yielding 1 X i=0 qi ∞ X ngi∗ (n) + N [1 − G(th )] = 10. (3) n=1 Setting N = 11, 350, which is the average value of the database size during 2006-2008, in (3) gives th = 442.6. In the remainder of this subsection, we estimate gi (n), which is the PMF of true matches, from three quantities: gi∗ (m), which is the observed PMF of detected matches, the historical threshold th under a constant threshold system, and the intra-gun similarity score CDF F (x). Let Pmn be the probability that m matches are detected from an arrival in the Israeli database, given that n matches to this arrival exist and the arrival has evidence status i (the argument i in Pmn is suppressed for ease of presentation). It follows that gi∗ (m) = 3 P∞ n=m Pmn gi (n). However, for practical purposes, we truncate this system of equations at m = n = 14 (i.e., we set gi (n) = 0 for n ≥ 15) because gi∗ (m) is only nonzero for m ≤ 14 in our data set. This truncated system of equations can be expressed as P11 P12 0 P22 .. .. . . 0 0 ··· ··· .. . P1,14 P2,14 .. . · · · P14,14 gi (1) gi (2) .. . gi (14) = gi∗ (1) gi∗ (2) .. . gi∗ (14) . (4) Because gi∗ (m) is known and the system of equations in (4) is invertible, our estimation problem for gi (n) reduces to determining the probabilities Pmn , which are a function of m, n and the known false negative probability F (th ) under the constant threshold system, which is denoted by p. Conditioned on n matches existing in the database, the probability Pmn that a new arrival detects m of them depends on two things. First, it depends on the current knowledge about how these n entries are related, which can range from not realizing that any of them are matched to each other (i.e., there are n singletons) to realizing that they are all matched to each other (i.e., they are in a single group of n matches). Second, it depends on which matching groups, if any, the new arrival is detected to belong. Hence, to derive Pmn , we need to construct a detailed dynamic model that tracks the evolution of the n true matches as they sequentially arrive to the system and undergo a matching process (with false negative probability p = F (th )) with the prior arrivals. This model is a Hidden Markov Model (HMM) because we cannot observe the state (or transition probabilities among states) [5]. More generally, the state of the HMM is defined by the current matched groupings of the true matches that have already arrived, where the groupings are given in the ascending order of the size of the groups; e.g., state 1,1,2 means that there are a total of four true matches currently in the database, two of them are connected (i.e., have been correctly detected to be a match) and the remaining two are isolated (i.e., have not been correctly matched with any of the other three true matches). 4 We use the notation P (A|B) to represent the probability that, conditioned on the current matchings being in state B, after a new arrival undergoes the matching process with the prior arrivals, the new state is A (where, by construction, the sum of the numbers in A is always one greater than the sum of the numbers in B). For illustrative purposes, we show how to use the HMM to compute Pm3 for m = 0, 1, 2, 3, and then we provide a broad description of a general algorithm for any value of n. The initial state of the HMM is 1, which refers to the first of the n true matches having already entered the database. For this examples where n = 3, we need to track the HMM through n more arrivals (i.e., until after the fourth arrival) because the fourth arrival has n = 3 true matches in the database. These dynamics are described by the following transitions. Transitions caused by the second arrival: P (1, 1|1) = p, P (2|1) = 1 − p. Transitions caused by the third arrival: P (1, 1, 1|1, 1) = p2 , P (1, 2|1, 1) = 2p(1 − p), P (3|1, 1) = (1 − p)2 , P (1, 2|2) = p2 , P (3|2) = 1 − p2 . Transitions caused by the fourth arrival: P (1, 1, 1, 1|1, 1, 1) = p3 , P (1, 1, 2|1, 1, 1) = 3p2 (1 − p), P (1, 3|1, 1, 1) = 3(1 − p)2 p, P (4|1, 1, 1) = (1 − p)3 , P (1, 1, 2|1, 2) = p3 , P (1, 3|1, 2) = p(1 − p2 ), P (2, 2|1, 2) = p2 (1 − p), P (4|1, 2) = (1 − p)(1 − p2 ), P (1, 3|3) = p3 , P (4|3) = 1 − p3 . Hence, the hidden states after the second arrival are (1, 1) and (2), the hidden states after the third arrival are (1, 1, 1), (1, 2) and (3), and the hidden states after the fourth arrival are (1, 1, 1, 1), (1, 1, 2), (1, 3), (2, 2) and (4). The HMM transition probabilities derived above can be written as the following stochastic matrices, denoted by Mk , M1 = M2 = p 1−p , p2 2p(1 − p) (1 − p)2 0 p2 1 − p2 ! , p3 3p2 (1 − p) 3(1 − p)2 p 0 (1 − p)3 M3 = p3 p(1 − p2 ) p2 (1 − p) (1 − p2 )(1 − p) 0 . 3 3 0 0 p 0 1−p 5 Note that the product M1 M2 equals the probability of arriving at the various states that the fourth true match will see upon arrival: P (1, 1, 1) = p3 , P (1, 2) = 3p2 (1 − p), P (3) = (1 − p)2 (1 + 2p). Finally, we group all the transitions from the fourth arrival according to how many matches are actually found. For example, the transition from (1, 1, 1) to (1, 3) means that two matches are found. Using the law of total probability (i.e., conditioning on all possible states of the current matching and then summing the joint probabilities; see pg 6 of [6]) yields our final result: P03 = P (1, 1, 1, 1|1, 1, 1)P (1, 1, 1) + P (1, 1, 2|1, 2)P (1, 2) + P (1, 3|3)P (3) = p3 , P13 = P (1, 1, 2|1, 1, 1)P (1, 1, 1) + P (2, 2|1, 2)P (1, 2) = 3p4 (1 − p), P23 = P (1, 3|1, 1, 1)P (1, 1, 1) + P (1, 3|1, 2)P (1, 2) = 3p3 (1 − p)2 (1 + 2p), P33 = P (4|1, 1, 1)P (1, 1, 1) + P (4|1, 2)P (1, 2) + P (4|3)P (3) = (1 − p)3 (6p3 + 6p2 + 3p + 1). For a general value of n, the calculation of Pmn for m = 0, 1, . . . , n can be carried out by the following algorithm. 1 - Construct the HMM through the n + 1st arrival, and derive the transition matrices Mk for k = 1, . . . , n. 2 - Compute Qn−1 k=1 Mk , which gives the probability distribution for the various hidden states that the n + 1st true match sees upon arrival. 3 - In the transition matrix Mn , find the number of matches m detected for each possible transition A → B. Using the law of total probability, multiply the transition probability from A to B in Mn by the probability of observing state A, which is Qn−1 k=1 Mk , and add these products over all possible transitions to get Pmn . Running this algorithm for n = 1, . . . , 14 results in the gi (n) PMF plotted in Fig. 7. 6 2.3 Estimating the Age Distributions Of the 1364 matching pairs during 2006-2008, 1227 (or 90.0%) of them have database entries that occurred during 2006-2008, in which case we know the exact age in days. For the remaining 137 (or 10.0%) matching pairs, we only know the year when the database entry was acquired, and so the age is in the interval [tj , tj + 365] for j = 1, . . . , 137. To estimate the PDF hm (a), we solve a maximum likelihood estimation problem with uncertain data. Let the known ages be denoted by the 1227-dimensional vector X, and the age of records with uncertain age be given by the 137-dimensional vector Z. If we denote the lognormal parameters by (µa , σa ), then the likelihood function is Q1227 i=1 f (Xi ) (µa , σa ) to maximize the log-likelihood function, which is Q137 R tj +365 j=1 tj P1227 i=1 f (Zj )dZj . Choosing log[f (Xi )] + P137 j=1 log[F (tj + 365) − F (tj )], yields µa = 4.904 and σa = 1.687. The frequency distribution of these 1364 ages and the resulting lognormal are plotted in Fig. 8. To estimate the age PDF h(a) for all database entries, we assume that the age of an entry is December 31, 2010 minus the acquisition date of the entry. However, we only know the year in which each entry in the Israeli database was acquired. We estimate h(a) by fitting a piecewise cubic hermite interpolating polynomial [7] to the yearly aggregates to estimate an increasing smooth CDF (Fig. 9), and then numerically differentiating it to yield a PDF (Fig. 10). 3 Comparison of a Constant Threshold System and a Rank-based System In this section, we argue that for a simplified setting in which there is never more than one match in the database for an arrival, the rank-based system and the constant threshold system perform very similarly. Equation (1) gives the detection probability for a rank-based 7 system that generates a candidate list of size 10 when the database has N entries. In approximating the detection probability for the constant threshold system that generates an average candidate list size of 10, we make use of the fact that the average number of true positives in the Israeli data is 0.214: 90.24% of the candidate lists have no true matches, and the average number of true matches in the other 9.76% of lists is 2.19. Substituting 0.214 for the first term in equation (3) and solving for the threshold gives th ≈ G−1 1 − 9.786 N . The detection probability for the constant threshold system in this simplified setting is 1 − F (th ), which can be approximated by −1 1−F G 9.786 . 1− N (5) Using the cartridge casings lognormal distributions derived in §1 and plotting the detection probabilities in equations (1) and (5) as functions of the database size N reveal that the performance curves are almost identical (Fig. 11). 4 The False Positive Constraint vs. the List Size Constraint Here we explore the difference between constraining the expected number of false positives, as in equation (3) in the main text, and constraining the expected list size, which is much more difficult analytically. Our concern is that our choice of the false positive constraint may favor the optimal policy because the solution to (1)-(4) in the main text may raise the average number of true matches in the candidate list. In this case, restricting only the average number of false positives would increase the average list size beyond 10. Calculations in this section suggest that the effect of our choice of constraint on the results is extremely small. For Israeli cartridge casings, we know that the probability that there is at least one 8 true match in the candidate list (taking into account arrivals that have no true matches in the database) is 0.0976 for the constant threshold policy and 0.987 (0.0976) 0.931 = 0.1035 for the optimal policy (using the detection probabilities for both policies). Although we know that the mean number of true matches in the candidate lists with at least one true match is 2.19 for the constant threshold policy, we do not know this quantity for the optimal policy. By assuming that this value is 5.0, which should be a significant overestimate, the mean number of true matches in the candidate list increases to 5 × 0.1035 = 0.5175. To approximate the new detection probability under a list size constraint, we should resolve (1)-(4) in the main text with the right side of the constraint reduced by 0.5175-0.214. To be conservative, we reduce the right side of the constraint by 0.5175. The new detection probability is 0.9870, compared to 0.9872 under the original constraint. References [1] Beauchamp A, Roberge D. Model of the behavior of the IBIS correlation scores in a large database of cartridge scores. Unpublished manuscript, 2005. Accessed at www.forensictechnology.com/Default.aspx?app=LeadgenDownload&shortpath=docs/ LargeDatabaseFinal.pdf on September 2, 2011. [2] De Kinder J, Tulleners F, Thiebaut H. Reference ballistic imaging database performance. Forensic Science International 2004;140:207-215. [3] Roberge D, Beauchamp A. The use of BulletTRAX-3D in a study of consecutively manufactured barrels. AFTE Journal 2006;38:166-172. [4] Hespanha J. grPartition - a MATLAB function for graph partitioning. 2004. Accessed at www.ece.ecsb.edu/ hespanha/software/grPartition.html on September 8, 2011. 9 [5] Durbin R, Eddy S, Krogh A, Mitchison G. Biological sequence analysis: probabilistic models of proteins and nucleic acids. Cambridge U. Press, Cambridge, UK, 1998. [6] Karlin S, Taylor HM. A first course in stochastic processes, second edition. Academic Press, New York, 1975. [7] Fritsch FN, Carlson RE. Monotone piecewise cubic interpolation. SIAM J. Numerical Analysis 1980;17:238-246. 10 Figure 2: Actual points (x) on the performance curve for casings from data in [1] vs. the performance curve generated by the best-fit lognormal distribution. 11 Figure 3: The lognormal similarity score PDFs, intra-gun f (x) and inter-gun g(y), for casings. 12 Figure 4: Actual (based on BulletTrax-3D data in [3]) vs. lognormal inter-gun similarity score PDF for bullets. 13 Figure 5: The lognormal similarity score PDFs, intra-gun f (x) and inter-gun g(y), for bullets. 14 Figure 6: All non-zero entries in the matching matrix for the 52 merged stations. Each entry is the number of matches during 2006-2008 between each pair of merged stations. Stations A-R are the merged stations. Yellow squares are the intra-station matchings. 15 Figure 7: The PMFs for the true (gi (n) in blue) and detected (gi∗ (m) in red) matches for arrivals of evidence status i (i = 0 is nonevidence (—) and i = 1 is evidence (- - -)). 16 Figure 8: Frequency distribution of ages of the 1364 matching pairs in the Israeli database, containing 1227 exact ages and 137 ages that are randomly sampled from the correct year, along with the best-fit lognormal PDF, hm (a). 17 Figure 9: Age CDF fit to raw Israeli data. 18 Figure 10: Age PDF h(a) for all database records. 19 Figure 11: Comparison of the performance of the rank-based system and the constant threshold system. 20
© Copyright 2026 Paperzz