Scale of conclusions for the value of evidence

Law, Probability and Risk (2012) 11, 1−24
Advance Access publication on October 24, 2011
doi:10.1093/lpr/mgr020
Scale of conclusions for the value of evidence
A NDERS N ORDGAARD†
The Swedish National Laboratory of Forensic Science (SKL), SE-58194 Linköping, Sweden and
Department of Computer and Information Science, Linköping University,
SE-58183 Linköping, Sweden
R ICKY A NSELL
The Swedish National Laboratory of Forensic Science (SKL), SE-58194 Linköping, Sweden and
Department of Physics, Chemistry and Biology, Linköping University,
SE-58183 Linköping, Sweden
AND
W EINE D ROTZ AND L ARS JAEGER
The Swedish National Laboratory of Forensic Science (SKL), SE-58194 Linköping, Sweden
[Received on 8 April 2011; revised on 12 September 2011; accepted on 13 September 2011]
Scales of conclusion in forensic interpretation play an important role in the interface between
scientific work at a forensic laboratory and different bodies of the jurisdictional system of a country.
Of particular importance is the use of a unified scale that allows interpretation of different kinds of
evidence in one common framework. The logical approach to forensic interpretation comprises the
use of the likelihood ratio as a measure of evidentiary strength. While fully understood by forensic
scientists, the likelihood ratio may be hard to interpret for a person not trained in natural sciences or
mathematics. Translation of likelihood ratios to an ordinal scale including verbal counterparts of the
levels is therefore a necessary procedure for communicating evidence values to the police and in the
courtroom. In this paper, we present a method to develop an ordinal scale for the value of evidence
that can be applied to any type of forensic findings. The method is built on probabilistic reasoning
about the interpretation of findings and the number of scale levels chosen is a compromise between
a pragmatic limit and mathematically well-defined distances between levels. The application of the
unified scale is illustrated by a number of case studies.
Keywords: evidence value; ordinal scales; likelihood ratio; logical approach.
1. Introduction
Forensic science is science used for the purpose of law (Caddy and Cobb, 2004). The demand for
forensic science expertise has grown steadily throughout the last decades. To some extent this is due
to the increased demand for the prosecution to present physical evidence, such as forensic evidence
in court, as well as developments of novel techniques and introduction of forensic databases. Some
disciplines such as DNA and IT forensics has experienced an exponential growth, whereas other
disciplines such as handwriting at the same time experienced a diminishing demand.
† Email: [email protected]
c The Author 2011. Published by Oxford University Press. All rights reserved.
2
A. NORDGAARD ET AL.
In order to be used accurately, the outcoming results of the forensic scientific analyses of traces
and evidence material has first to be interpreted for evidential value which then has to be converted
into phrasings standardized and understandable in the court process. One facilitator in bridging different scientific findings and make their weight transparent and comparable to the legal process is to
report their weight by using a verbal scale.
2. Forensic casework
Forensic casework in Sweden today is to some extent served by county police laboratories, though
the bulk is served by one centralized national forensic science laboratory, SKL (Swedish National
Laboratory of Forensic Science) positioned as an independent laboratory within the police. In Sweden, the disciplines of forensic medicine, forensic psychology, forensic toxicology and paternity
testing are positioned outside the police forming a separate authority serving those specific demands.
The police laboratories perform a limited array of forensic investigations, whereas the duties of the
national laboratory cover most forensic disciplines and also include research and development of old
and novel forensic skills and techniques.
Thus, the vast majority of the forensic investigations requested for at SKL originate from the police and they cover investigation of any kind of evidence seized at a crime scene investigation or from
individuals involved. The investigations requested for vary considerably in complexity, depending
on the specific case in question. A large part of the investigations performed at the laboratory are
limited to one specific or a few forensic disciplines, whereas other cases such as murder and armed
robbery cases cover a broad span of expertise needed to fulfil the investigations requested.
The laboratory digital case management system (LCMS) in use is modern and there is a high
degree of electronic information passing to and from the police and within the different laboratory
functions. More or less digitally connected to the LCMS are a magnitude of different expert systems
and databases that support different expertise in use. The laboratory information management system
(LIMS) created for the analyses and information flow for biological samples and DNA, in particular
the high throughput chain handling reference DNA samples from sampling to DNA databasing and
hit reporting, is highly developed (Hedman et al., 2008).
As the laboratory delivers expert witness statements from multiple forensic disciplines to serve
as part of the crime investigation report, sometimes reported in one multidisciplinary statement, a
unified way to report the findings are needed. In addition, the combination of evidence from results
originating from different disciplines must be achievable in a balanced way. The endpoint of any
forensic crime investigation is the presentation of the forensic evidence in court. One or several
forensic reports covering different areas of expertise will be part of the written evidence presented
by the prosecution in the court proceedings. The forensic expert, or reporting officer, may also be
required to attend court as an expert witness giving oral evidence, although this is generally not the
case in Sweden. A vital part of the testimony is to present the robustness and evidential weight of
the findings.
3. A unified scale of conclusions at the Swedish National Laboratory of Forensic Sciences
The forensic interpretation at SKL has almost always been made according to a scale of conclusions.
The wording in the scale has somewhat shown a span between the inconclusive and some type of support for an assumption—there are words, intensifiers, to pinpoint a rising strength. During the 1990s,
SCALES OF CONCLUSIONS IN FORENSIC INTERPRETATION
3
a number of predominant intensifiers used became the skeleton of some type of a common scale of
conclusions, although scales with other wordings or other levels still existed in some of the disciplines. Moreover, there was no congruence in what way the questions from the commissioner were
answered or when and how an interpretation was done. The differences between identifying something, interpreting what could be source of something and interpreting some type of activity/event
were also not distinct. The propositions forwarded often concerned a mix of those three issues, which
led to diffuse statements, and it was frequently unclear how the outcomes of different methods interfered in the interpretation according to propositions at different levels. Should the intrinsic features
of a trace and the location, number and distribution of findings have an input on the interpretation of
a source or of an activity? Then, which evidence could be combined and in what way?
A project was initiated during the late 1990s with the aim to decide what principle of evidence
interpretation to be used, standardize a common scale of conclusions and increase the overall knowledge about the interpretation of evidence. The existing scale of conclusions today at SKL (first edition in 2005) is designed in such a way that the interpretation of the findings is made with respect to a
pair of propositions, and at the same time the probabilities of the propositions themselves are not interpreted. It has also been designed to fit a logical approach using likelihood ratios (Buckleton, 2005).
In our training (in interpretation) of forensic experts, it has been pinpointed as important to have
a clear and transparent addressed proposition, and also to state what is included in the alternative
proposition. It has further been thoroughly discussed what shall and can be used when interpreting
at a source level of propositions compared to an activity level. Today we often reformulate the issue
of the commissioner to distinct propositions, and try to pre-assess the value of the investigation, even
before the investigation is commenced.
In Fig. 1 is shown the current edition of the scale of conclusions at SKL translated into English.
The scale has nine levels expressed as consecutive integers ranging from −4 to +4. The positive
numbers are used when the findings are such that they are more consistent with the proposition
forwarded by the commissioner (or more specifically the proposition that is the reformulation of
the commissioner’s question) than with the alternative proposition. The negative numbers are used
for the opposite situation. Level 0 is used when the findings are (in principal) equally consistent
with both propositions. To exemplify: If the commissioner asks whether a trace G originates from a
potential source S, this question is typically reformulated as the proposition ‘S is the source of G’
and an alternative proposition is typically formulated as ‘Some other source is the source of G’. If
the findings are more consistent with the former proposition, a positive level is used, and if they are
more consistent with the latter proposition then a negative level is used.
The question from the commissioner does not however lead to a formulated proposition that is
incriminating. As an example, the commissioner may ask: Is this passport authentic? The reformulated proposition would be ‘The passport is authentic’, but the suspicion behind the question is of
course that the passport is a forgery which also becomes the alternative proposition. Thus, if the
findings are consistent with a forgery, they will be reported with a negative level in the scale, which
in turn shall be interpreted as support for a criminal activity.
Each level of the scale comes with its number, a verbal equivalent to this number and an explanatory text (in italics in Fig. 1). The explanatory text is for most levels formulated in such a way that
it should be clear that the logical approach (Buckleton, 2005) is used for evaluation. Exceptions are
the levels −1, 0 and 1 where a more simple explanation is used, still not jeopardizing the meaning
of the level. Note that in the scale the word ‘hypothesis’ is used instead of ‘proposition’. Generally
we consider these two to be synonymous expressions in the framework of forensic interpretation,
4
A. NORDGAARD ET AL.
FIG. 1. Scale of conclusions used at The Swedish National Laboratory of Forensic Science (Statens Kriminaltekniska
Laboratorium).
but in the scale we have avoided to change the wording as the word hypothesis is basically the
same in Swedish. In the following, we shall merely use the word ‘proposition’ with some exceptions
(e.g. when relating to general statistical theory).
4. Ordinal scales in forensic interpretation
4.1
Scales in general
Many features in our daily environment are valued against scales of different kinds. One of the most
common scales is the temperature scale. No person without knowledge about physics would be able
give an objective and transparent definition of what is meant by stating that the temperature is 10◦ C
SCALES OF CONCLUSIONS IN FORENSIC INTERPRETATION
5
or 60◦ F, but most people would have a personal opinion about what such a statement means to
her or him. If we could follow the interpretation within a person’s mind we would probably find a
less accurate scale with levels like ‘extremely cold’, ‘very cold’, ‘cold’, ‘chilly’, ‘moderately chilly’,
‘moderately warm’, ‘warm’, ‘hot’, ‘very hot’, ‘extremely hot’. There are several levels used with this
scale, but still far from the accuracy that characterizes e.g. a thermometer. While the thermometer
has what is referred to as an ‘interval’ scale, the verbal levels described above constitute an ‘ordinal’
scale. The difference between these two is about the distances between levels. For the interval scale,
the distance is the same between two consecutive levels no matter where in the scale the measuring
is done. For the ordinal scale, the corresponding distances (may) vary. Ordinal scales may be easier
to use since the levels are limited in number, but the interpretation often suffers from subjectivity.
This is particularly true for different types of grading scales, such as course grades, DNA quality
grades, wine tasting grades etc. Hence, to make an ordinary scale useful, there is need for careful
instructions about how to select levels in a particular situation.
Ordinal scales are frequent in the area of jurisdiction and forensic interpretation of evidence.
For instance, the (Swedish) Police use a scale to express in what degree of suspicion an individual
is arrested or remanded in custody for being the perpetrator of a crime, with the levels ‘identification’, ‘reasonable suspicion’ (lower degrees) and ‘probable cause’, ‘objectively grounded suspicion’
(higher degrees). At the other end of the legal process, the court statements would be dichotomous in
the sense that they would either convict ‘beyond reasonable doubt’ or acquit ‘with respect to insufficient evidence’. One would expect that in a particular case, the level of such a scale is selected with
almost no element of subjectivity, but there is yet no guarantee that two cases that are in principle
identical would result in identical levels. The science of law is not exact and the use of scales therein
cannot be compared with the selection of scale levels within e.g. physics or chemistry.
The ordinal and interval scales are the most frequent ones for evaluation purposed to judge upon
whether a particular value should be considered low or high (or anything in-between). For classification purposes, there is also the ‘nominal’ scale, which completely lacks numerical relationships
between the levels (e.g. nuances of paint). Scientific measurements are most often given on a ‘ratio’
scale in which there is a well-defined zero value allowing us to compare values on a relative basis.
These two scales, however, are not in focus for the evaluation of evidence and will therefore not be
considered in the current paper.
4.2 Scales of conclusions
Forensic science takes an intermediate position in that it comprises several parts from various disciplines, more or less exact. Ordinal scales of conclusions are practised within forensic interpretation
objecting to present the findings in a form that is free from statements and terminology requiring knowledge within such different scientific areas. In particular, statements regarding forensic
evidence will by natural reasons contain probabilistic reasoning, but probabilities themselves may
be very difficult to communicate outside the field of expertise. Therefore, the selection of wording
in an ordinal scale of conclusions for forensic evidence evaluation becomes a challenge. We shall
illustrate this through an example.
At a forensic laboratory, one might have come to the conclusion that a particular piece of evidence is consistent with a statement forwarded by the police, for instance that a glove was worn by a
particular suspect. Let us assume that the police bring the glove and some biological samples (hairs,
blood etc.) from the suspect to the forensic laboratory along with the question ‘Was this glove worn
6
A. NORDGAARD ET AL.
by the suspect?’. Without careful thinking, it might be close to express the findings like ‘There is
a high probability that the glove was worn by the suspect’ since this in fact is a direct answer to
the question. However, such a statement is not correct as value of evidence but as a statement of the
case once the forensic findings have been applied to the prior beliefs about whether the suspect wore
the glove or not (cf. Aitken and Taroni, 2004). It is not up to the forensic examiner to give statements
about the case since he or she should not possess any prior beliefs.
Now let us alter the statement so it reads, ‘Our findings give that the probability that the glove
was worn by the suspect is high’. A person inexperienced with probability reasoning might find the
meaning of the latter statement different from the meaning of the former, but they of course mean
the same, and it is only the wording that is different. To change the meaning the word probability
(or a synonymous term or expression) must go for the findings and not for the case. Consider the
following two expressions:
(i) ‘Our findings are highly probable if the glove was worn by the suspect’.
(ii) ‘There is much higher probability to obtain our findings if the glove was worn by the suspects
than if it was not’.
What is the main difference between these two statements? Both statements are consistent with
the way evidence evaluation is pursued as they both express probabilities for the findings and not
for the statements forwarded by the police (or by a prosecutor). However, (i) is not complete since it
only concerns the probability of the findings if the suspect actually wore the glove. There is no value
in this statement useful to the court because the probability of the findings might be equally high
if the suspect did not wear the glove. Statement (ii) concerns two probabilities for the findings, one
given the glove was worn by the suspect and one given it was not. Thus, the findings have a value
that can be related to prior beliefs about the case.
The example shows that care must be taken what the word probability goes for but in addition
the statement cannot be built on a single probability. This in turn implies that the probability scale
cannot be used directly for evidence values, which might sound like a drawback keeping in mind that
the inclusion of technical evidence in a case requires probabilistic reasoning. However, it is not until
the evidence has been applied to the prior beliefs that statements may be forwarded with a degree of
(un)certainty; in other words, that the final statement is true up to a probability.
As we shall discuss below, statement (ii) is built up from the ‘ratio’ of two probabilities and such
a ratio has another scale than has the probability scale. It is seldom easy to interpret a statement
telling how much higher is the probability of the findings given that the glove was worn by the
suspect compared with the case were it was not. For instance, what would we mean by stating that
‘the findings are 10 000 times more probable if the glove was worn by the suspect than if it wasn’t’?
To make the statement more interpretable, we must translate it from wording with probabilities
to wording with more ordinary expressions. The statement (ii) may for example be alternatively
expressed as
(ii*) ‘Our findings strongly support that the glove was worn by the suspect’.
Note that this wording might be interpreted as if we were referring to the case and not the evidence. The expression is then confused with something like ‘Our findings give with high probability
that the glove was worn by the suspect’. Nevertheless, there is a distinction between the expressions ‘strongly supports’ and ‘gives with high probability”. The former means that looking at the
evidence from different perspectives (i.e. wore or wore not the glove), we consider the findings to be
SCALES OF CONCLUSIONS IN FORENSIC INTERPRETATION
7
(much) more expected if the glove was worn by the suspect and thus makes that proposition the best
explanation for the findings. The latter means that we have come to a conclusion about the state of
the proposition with a high probability.
Once we have selected a system for rephrasing statements originally involving probabilities, as
is the case when translating (ii) to (ii*), we can construct an ordinal scale within this system. A
statement representing a lower value than does (ii*) can be ‘Our findings support that the glove was
worn by the suspect” and with an even lower value it can be ‘Our findings support to some extent
that. . . ’. Corresponding expressions amplifying the degree of support will give higher levels on the
scale. The resulting scale, i.e. the collection of levels, becomes ordinal, since there is no possibility to
select expressions that have numerically meaningful distances (that would be normalized) unless the
expressions are just verbal counterparts to real numbers. The number of levels that should be defined
on such a scale depends to a great extent on how the scale is adopted among jurors, prosecutors,
defence attorneys and the police. Usually there must be a multistage procedure of developing a
scale, including several propositions, feedback and revision before the final scale is settled.
4.3 Issues of interpretation
Interpretation of verbal expressions for the value of evidence has previously been studied. Sjerps and
Biesheuvel (1999) made an experiment in which they studied jurists’ opinions about two different
scales. The first scale was the one currently used at that time within The Netherlands Forensic Institute. This scale used expressions that from a probabilistic point of view related more to the case and
less to the evidence. The second scale was a suggested alternative in which the formulations were in
line with the expressions (ii) and (ii*) above, i.e. the scale levels related to how probable the findings
were under the assumption that one proposition was true compared with how probable they were
under other assumptions. The results nevertheless showed that jurists’ in general had problems to
understand the necessity of the second scale and preferred the first one. Broeders (1999) addresses
the same type of problem, but by studying in which way forensic scientists report and understand
their reporting of their findings. Recently, SKL has conducted a survey among different actors in the
Swedish judicial process (Nordgaard et al., 2010) about their perceived strength of different phrases
(in Swedish) used as ‘amplifiers’ of the support. The results from this survey show that there is substantial variation among the actors in perceived strength but also in their final opinion about a particular case when a certain phrase have been used in the evaluation statement for the technical evidence.
For instance, wording that from a linguistic perspective should mean that the support in practise
implies certainty about the case was by many respondents interpreted as just moderately strong.
So far we have not included probabilistic formalism into the discussion. This is partly due to the
fact that mathematics is problematic to use in the interface between the forensic laboratory, which
to a large extent uses scientific methods originating in natural and mathematical sciences, and the
judicial community (police, prosecutors, lawyers and judges), where such scientific methods may
be very hard to understand. Later, we will shortly review and use results from probability theory to
explain how findings can be translated to an ordinal scale, but for the moment we may say that the
development of ordinal scales for forensic interpretation is thoroughly built on correct probabilistic
reasoning and in addition on how relations between probabilities can be interpreted. This is a work
that needs to be done by forensic scientists, with sufficient knowledge about probabilistic reasoning
and its applicability in forensic biology, chemistry, informatics etc., but at the same time integration
with the judicial community is by all means necessary.
8
A. NORDGAARD ET AL.
5. Reporting forensic findings on the scale of conclusions
5.1
The likelihood ratio
The state-of-art in forensic interpretation is to evaluate forensic evidence with the use of a likelihood ratio (cf. Aitken and Taroni, 2004). The likelihood ratio expresses the relative strength of the
evidence in the comparison of one proposition, often referred to as the ‘prosecutor’s hypothesis’,
against another referred to as the ‘defence’s hypothesis’. The former will hereafter be denoted HP
and the latter HD . It should be mentioned that the terms ‘prosecutor’ and ‘defence’ should not always
be literally interpreted, even if this is the case when a particular piece of evidence is handled within
the courtroom. One objective of the current paper is to show how a scale of conclusions can be developed within a forensic laboratory primarily working with evidence material from crime (scene)
investigations. However, the body of commission is usually the police and most of the evidence evaluation addresses source level propositions. The prosecutor’s hypothesis may therefore very well be
a reformulation of the question (task) forwarded by the police into a proposition (cf. Section 3) and
the defence’s hypothesis is the (natural) alternative to that proposition. In other areas of evidence
evaluation, these two hypotheses are just two disjoint alternatives where the task is to decide upon
which of these is the one most probable.
The likelihood ratio used in evidence evaluation is a particular component in the theory of
Bayesian hypothesis testing, in which the hypotheses are evaluated by the so-called ‘Bayes factor’ (cf. Berger, 1985). For a pair of mutually exclusive and exhaustive hypotheses (HP , HD ) the
prior ‘odds’ of HP to HD is the ratio Pr(HP )/Pr(HD ) and the posterior odds of HP to HD given
the observed data E is the ratio Pr(HP |E )/Pr(HD |E ). The Bayes factor, B, is defined as the ratio
of the posterior odds to the prior odds, i.e.
B=
Pr(HD |E)/Pr(HD |E)
.
Pr(HP )/Pr(HD )
(5.1)
If both hypotheses are simple, i.e. each hypothesis depicts a fixed scenario for which a prior probability can be assumed, the Bayes factor simplifies to a likelihood ratio:
L(HP ; E)
= V,
L(HD ; E)
(5.2)
where the likelihoods L(HP ; E) and L(HD ; E) measure how likely the observed data are (relative
to other data) under the assumptions that HP and HD , respectively, are true. For evidence evaluation,
E stands for the evidence itself, or the findings from a case. Depending on the scale of the measured
data of these findings, the likelihood is either a distinct probability or a value from a probability
density function. The former case is the one mostly appearing in forensic literature and the likelihood
ratio can then be written
Pr(E|HP )
V =
.
(5.3)
Pr(E|HD )
As a simple example of the latter, consider a case where the piece of evidence is a digital
image of a disguised person and the question is about whether it is a man or a woman. The forensic
method used is estimating the length of the person from the image. Let us say the length
estimate is 177 cm. This scale of measurements is not of the kind that we may address distinct
SCALES OF CONCLUSIONS IN FORENSIC INTERPRETATION
9
probabilities to each value. Instead utilization of the probability densities of lengths for males and
females must be used. Depicting the former f M (x) and the latter f F (x) the applicable likelihood
ratio is V = L(HP ; E)/L(HD ; E) = f M (177)/ f F (177).
Equating the Bayes factor with the likelihood ratio, (5.1) can be rewritten as
Pr(HP )
Pr(HP |E)
=V∙
Pr(HD |E)
Pr(HD )
(5.4)
(also known as Bayes’ theorem on odds form). A likelihood ratio greater than one will thus make
the posterior odds higher than the prior odds, while the opposite holds for a likelihood ratio less
than one.
In a forensic case where V is of the kind (5.3) and greater than one the test result can be expressed as ‘the probability of the findings when HP is true is V times higher than the probability
of the findings when HD is true’, but this expression is less proper when V is a ratio of evaluated
densities. A more general expression would be ‘the findings increase the prior odds of HP to HD with
a factor V ’ and such an expression also moves the focus from discussing probabilities in court to the
discussion of supporting evidentiary strength. Moreover, the last expression is also consistent with
the general case of Bayesian hypothesis testing, i.e. without substituting the likelihood ratio for the
Bayes factor. One example where the Bayes factor cannot be separated from the prior odds (as is
the case when it is replaced by the likelihood ratio), is where HP states that the donor of a blood
stain is X and HD states that the donor is either the brother of X or another person not related to
X. The evidence is a match between the DNA profile obtained from the stain and the DNA profile
of X. Here HD is the union of two mutually exclusive propositions with different likelihoods and
possibly different prior probabilities, and therefore the likelihood ratio (as expressed in (5.2)) is not
the equivalent of the Bayes factor (cf. Buckleton et al., 2006).
The more general view of interpreting the likelihood ratio (or the Bayes factor) as how much
the prior odds is increased (or decreased) is supportive of the way evidence evaluation should be
reported to the commissioner (e.g. in court) when the likelihood ratio cannot be computed explicitly,
but estimated to a level of magnitude. Instead of interpreting the likelihood ratio as how more (or
less) probable are the findings given one proposition is true than given the other is true, we can say
that the findings support one of the propositions and thus imply an amplification (or attenuation) of
the prior odds to get the posterior odds. In the absence of an explicit numerical likelihood ratio, the
degree of support can be given on an ordinal scale.
5.2
From a likelihood ratio to a scale level
Once the likelihood ratio (or the Bayes factor when they are not equivalents) has been obtained, it
should be interpreted on the scale of conclusions used. Evett et al. (2000) suggested a scale where
a likelihood ratio between 1 and 10 would be reported verbally as ‘Limited evidence to support’,
while a likelihood ratio above 10 000 would be reported as ‘Very strong evidence to support’. DNA
evidence is probably still the area of application where a likelihood ratio can be reported on almost
a continuous scale, and it is far known that calculated likelihood ratios to a vast majority would
exceed the value of 10 000. The Netherlands Forensic Institute uses a scale (NFI, 2008), in which
the findings (analogously to Evett et al., 2000) are reported to support with different degrees the
statement in hand. The scale currently used at SKL (Fig. 1) is analogous to the ones in Evett et al.
10
A. NORDGAARD ET AL.
(2000) and NFI (2008), with the exception that the statement for which the level is reported is
always the proposition advanced in the forensic examination. This means that likelihood ratios above
and below one are differently interpreted on the scale. For the other scales mentioned above only
likelihood ratios greater than one are interpreted and the proposition supported may vary from case
to case.
Consider the scale of Fig. 1. To simplify, we will hereafter refer to this scale as the SKL scale.
There are nine levels used but the levels −1, −2, −3 and −4 can be seen as ‘mirrors’ of the Levels
+1, +2, +3 and +4 since shifting the propositions would change a level −Lto level +L and vice
versa (L = 1, 2, 3, 4). Rules for translation of obtained likelihood ratios to the scale can therefore be
developed for the Levels 0, +1, +2, +3 and +4 and be applied either directly to the likelihood ratio
or to its inverse. One might ask why the scale is not constructed with only five levels, but the current
reporting system at the laboratory is such that findings consistent with what could be the prosecutor’s
hypothesis should be reported with positive levels, and findings not consistent with that proposition
should be reported with negative levels.
Translation now means dividing the range from one to infinity of possible likelihood ratio values
into five intervals corresponding with the Levels 0, +1, +2, +3 and +4. It might be argued that Level
0 should be used solely for the case where the likelihood ratio is exactly one, but here and in the
forthcoming we must remember that a likelihood ratio obtained in a particular case is most often
a point estimate with the amount of uncertainty it may contain (i.e. the likelihood ratio used is an
approximation). Thus, an interval for Level 0 is also motivated and the lower limit of this interval is
naturally one (for likelihood ratios greater than or equal to one).
Each individual may have their own opinions about how intervals should be chosen to be consistent with the verbal expressions used in the SKL scale, and there is no unique mathematical rule for
doing it. In a forensic laboratory, the choice of a unified translation system is therefore a question of
compromising between the different opinions within and outside that laboratory. We will however
suggest a statistically assisted choice based on Bayes’ theorem and some common opinion about
uncertainty.
Assume that one piece of evidence is the component that should decide whether a suspect should
be convicted or acquitted. The evidence value is then combined with the prior odds of guilt to obtain the posterior odds of guilt. What is the lowest posterior odds that can lead to conviction? This
question has naturally no distinct answer, but reasoning about it is important for the construction of
intervals for the likelihood ratio. Ceci and Friedman (2000) discuss this topic in terms of the social
cost of conviction. One conclusion is that the celebrated statement by the British judge Sir William
Blackstone (1723–1780) that ‘it is better that ten guilty persons escape, than that one innocent suffer’
should be updated, and if numbers are to be set out it would be more consistent with today’s practice
in conviction ‘beyond reasonable doubt’ to substitute 99 for 10 in Blackstone’s statement. Adopting
this, a posterior odds of 99:1 or equivalently a posterior probability of 0.99 is considered to be high
enough to say that there is support for adopting the corresponding proposition. The same argument
was used by Thompson et al. (2003) when discussing how the probability of false positives affects
the posterior probabilities in DNA cases. Following these arguments, we suggest that Level +2 of
the SKL scale (‘support’ without attributes) should be reached when the likelihood ratio is greater
than or equal to a value that ‘on the average’ would give a posterior probability of at least 0.99 for
the proposition forwarded.
To explain this more thoroughly, consider again Bayes’ theorem on odds form as presented in
(5.4). If we assume HP and HD to be complementary propositions, the probability ratios of that
11
SCALES OF CONCLUSIONS IN FORENSIC INTERPRETATION
expression are true odds and we may rewrite the relationship as
Pr(HP )
Pr(HP | E )
=V∙
= V ∙ Odds(HP ),
1 − Pr(HP | E )
1 − Pr(HP )
(5.5)
where Odds(HP ) stands for the ‘prior’ odds of HP . From (5.5) we deduce the posterior probability
of HP as
Pr(HP |E) =
V ∙ Odds(HP )
.
V ∙ Odds(HP ) + 1
(5.6)
V
.
V +1
(5.7)
‘On the average’ may now be represented by a prior odds equal to one, which changes (5.6) into
Pr(HP | E ) =
Another argument for temporarily setting the prior odds to 1 is that it is the case where the posterior
odds are solely determined by the value of evidence. It therefore seems natural to ‘normalize’ the
scale at this point.
For the posterior probability to be at least 0.99, the likelihood ratio V must be at least 99. This
is however an odd number and the choice of 100 is easier to communicate. Hence, the lower limit
of the interval interpreted as Level +2 on the SKL scale is settled. The interval corresponding with
Level +4 is by natural reasons open-ended towards infinity. The current practice up till now for
reporting DNA evidence values on the scale has been to require the random match probability to be
less than one in a million to reach Level +4. This choice depends on the sizes of the populations of
potential perpetrators involved in Swedish crimes. We have no reason today to question this choice
and therefore the lower limit for the interval interpreted as Level +4 of the SKL scale is set to one
million. The corresponding lowest posterior probability with a prior odds of one is 0.9999999, which
is far beyond any debate about the uncertainty.
This anchoring of the lower interval limits for the scale Levels +2 and +4 will now be used to
construct the complete interval division of the range one to infinity of likelihood ratio values. The
first two interval limits, i.e. 1 and 100 are exactly the scale levels exponentiated with base 10. It is
then close to think of the scale as a logarithmic one even if the third interval limit (i.e. one million)
does not fit into this. In another perspective, we would strive for coming above Level 0 for evidence
that clearly puts one proposition in favour of another even if we consider the evidence value not
to be strong. This is particularly important for evidence primarily analysed for intelligence work.
As a consequence, the interval lengths must increase with increased level. They certainly do with a
full Briggsian1 logarithmic scale, i.e. logarithms with base 10, but using all consecutive numbers in
such a scale would require seven levels before the interval limit one million is reached. To avoid that
many levels, we suggest the following: The increase in Briggsian logarithm of two consecutive lower
interval limits should be (approximately) proportional to the level corresponding with the higher of
1 The logarithm with base 10 is often referred to as the Briggsian logarithm after the British 17th century mathematician
Henry Briggs (1561–1630).
12
A. NORDGAARD ET AL.
the two limits. We can write the intervals and corresponding scale levels as
1 6 V < R1 :
0
R1 6 V < R2 : +1
R2 6 V < R3 : +2
(5.8)
R3 6 V < R4 : +3
R4 6 V < ∞ : +4,
where we have already fixed R2 to 100 and R4 to 106 . The lower limit 1 for the interval corresponding
with Level 0 is as natural as is the upper undefined limit (∞) for the interval corresponding with
Level +4. Thus, the first and the last interval are left out of this construction, and we require
log10 Ri − log10 Ri−1 ≈ k ∙ i;
i = 2, 3, 4.
(5.9)
With R2 = 100 and R4 = 106 , a solution to (5.9) is found in which k ≈ 0.5, R1 ≈ 5.625 and
R3 ≈ 5625. Now what do these values imply for the posterior probabilities? With prior odds equal
to one (as before), we obtain the lowest posterior probabilities for Levels 1 and 3 as 0.8491 and
0.9998, respectively. The former should be consistent with that the findings support HP ‘to some
extent’ according to the SKL scale (Fig. 1). Any probability above 0.5 would be consistent with
that expression depending on how ‘some extent’ is interpreted, but of course 0.8491 would not be
a debatable value in that sense. The latter probability, i.e. 0.9998, is quite high but still difficult to
interpret from general probability reasoning. Although it is not appropriate to compare posterior
probabilities with P values from traditional frequentistic approaches to hypothesis testing, we may
reflect on common discussions about such quantities. In general a P value below 0.05 is interpreted
as significant evidence and a P value below 0.01 as strong significant evidence. In medical studies
on effects of drug therapies, P-values below 0.001 are considered to give very strong statistical
evidence of effects. In the view of this, a posterior probability of at least 0.9998 would be consistent
with the expression ‘strong support’ as given in the SKL scale (Fig. 1).
It may be argued that the derived lower limits (1), 5.625, 100, 5625 and 106 are impractical from
a numerical point of view. Rounding the numbers upwards, i.e. substituting 6 for 5.625 and 6000 for
5625 gives a more convenient representation. The resulting minimum posterior probabilities with
prior odds equal to one will then be 0.8571 and 0.9998 (the latter is slightly changed although not
visible until the fifth decimal). The increase in logarithms of lower limits between two consecutive
levels will no longer be proportional to the higher of the two limits but fairly close. The resulting
translation table between values of V and scale levels (both sides of 0) is given in Table 1. In Figure 2
are plotted lowest posterior probabilities against prior odds for the five scale levels 0 to +4 (equating
each level as the lowest possible likelihood ratio).
5.3
Reporting on the scale without reference data
Forensic scientists may be sceptical to the logical approach of evidence evaluation. This is so because
a majority of forensic cases are still such that no explicit reference data exist with which estimations
of likelihoods can be done. It is however important to realize that the lack of explicit reference data
do not disqualify the use of the logical approach, not even the use of a likelihood ratio. The latter
13
SCALES OF CONCLUSIONS IN FORENSIC INTERPRETATION
TABLE 1 Intervals of V (likelihood ratios) and corresponding scale levels
Interval
V 6 10−6
10−6 < V 6 1/6000
1/6000 < V 6 1/100
1/100 < V 6 1/6
1/6 < V < 6
6 6 V < 100
100 6 V < 6000
6000 6 V < 106
106 6 V
Scale level
–4
–3
–2
–1
0
+1
+2
+3
+4
FIG. 2. Posterior probabilities plotted versus prior odds for each of the Levels 0, +1, +2, +3 and +4 of the SKL scale of
Fig. 1. Calculation of posterior probabilities were done with a likelihood ratio equal to the lower limit for the likelihood ratios
for each level.
always exists for simple propositions and the problem lies in its estimation. By ‘explicit reference
data’, we mean an objectively compiled database of observations of the same kind as our findings for
the current piece of evidence, and with enough background data to make it possible to classify these
observations with respect to the propositions in question. However, reference data may still exist
although not formally stored in a database. An eyewitness may state that the person he saw was a
man, implicitly saying that it was not a woman. How is it possible to give such a statement? There are
several possibilities, but one may be the case of exclusion: ‘If that person was a woman I wouldn’t
have expected her to have the skull shaved, that’s why I think it was a man’. Note that this way of
thinking is the same as what is promoted with the logical approach. His findings consist of the shaved
skull and these findings are in his opinion more probable was it a man that was it a woman, although
he merely uses the conditional probability of his findings given it was a man (i.e. the numerator of the
likelihood ratio). To be fair, however, his thinking could have been the opposite: ‘The person’s skull
was shaved, thus I think it was a man’. The latter is a statement corresponding with a conditional
probability that the person was a man given his findings and would be the numerator of the posterior
odds.
14
A. NORDGAARD ET AL.
The important issue of the previous example is that reasoning with probabilities or likelihoods
can be done without any explicit databases. The forensic scientist has, similar to the eyewitness,
an experience that allows her to make likelihood statements with a precision depending on this
experience. The statements will to some extent be subjective, yes, but so are the statements of the
eyewitness. The difference is that while the eyewitness bases his conclusions upon what may be
referred to as common experience, the expert witness bases her conclusions on specific experience.
The logical approach still works no matter if the conclusions are based on an objective database or
on specific experience as long as the evidence value is in the form of a likelihood ratio. The forensic
scientist should consider the two competing propositions, and based on her experience both from
historical cases and from the field of expertise itself she should find out the degrees of expectation of
the findings under each of the propositions. The proposition under which the degree of expectation
is maximized is the proposition that is supported by the findings. The degree of support is based
on how much more expected are the findings under that proposition than they are under the other
proposition. This judgement is the evidence value and can be reported on the scale and is at the same
time the best estimation at hand of the underlying likelihood ratio.
It is possible that neither of the propositions is particularly supported by the findings, or that
both propositions are approximately equally supported. The evidence value should then be reported
with 0 on the scale. This case is probably the simplest to handle at the forensic laboratory since
referencing to general uncertainty may serve as an argument for the stated degree of support. What
about all cases where support has been concluded? As soon as the findings give more support to
one of the propositions, and this cannot be treated as just having occurred by chance, the forensic
scientist should report the evidence value at least at Level +1 or Level −1 of the scale depending on
which of the two propositions gets the more support. For sake of simplicity, we will from now on
only discuss reporting on the positive side of the scale keeping in mind that the negative side can just
be thought of as a mirror of the positive side.
Equating the Level +1 to a likelihood ratio of at least 6 according to Table 1 means that we
state that the findings support the forwarded proposition at least six times more than the alternative
proposition. It might be argued that ‘6 times more’ is too strong to be consistent with the argumentation above stating that +1 should be used whenever the indicated support is not just a matter of
chance. However, consider the following simple example: A 12-sided dice is suspected to be false
in that sense that it gives the result ‘twelve’ every second roll, while each of the other 11 results
are equally likely (i.e. comes each with probability 1/22). The alternative proposition says the dice
is balanced. We roll the dice once and obtain twelve. The likelihood for the proposition of a false
dice is then 1/2 and the likelihood for the proposition of a balanced dice is 1/12. Thus, the likelihood
ratio becomes exactly 6 and our findings support the proposition of a false dice six times more than
they support the alternative proposition. However, if our result had been e.g. ‘five’, the likelihood
ratio would have been 0.54, which supports the proposition of a balanced dice about 1.8 times more
than the proposition of a false dice. This example is of course not comparable with a typical forensic
case but still illustrates that a likelihood ratio of 6 is not at all too large to harmonize with the scale
Level +1. No jury or court would come to the verdict that the dice is false just on the basis of one
roll, but we cannot ignore that the only findings in the case, i.e. the observed twelve, give support to
some extent for that proposition. The findings may have occurred by chance but that is not the only
explanation.
The highest level of the scale, i.e. +4, might from a first point of view be impossible to choose
without explicit reference data. This is however not true, and this level may on the contrary be quite
SCALES OF CONCLUSIONS IN FORENSIC INTERPRETATION
15
natural under the right conditions. The forensic scientist still has two propositions (HP and HD ) to
consider. In a case where she believes that the findings cannot by any means be obtained given HD is
true, while they very well can if HD is true, the likelihood of the former proposition is very close to
zero, and the resulting likelihood ratio must be so large that scale level +4 is reached (in support of
HP ). From a philosophical point of view, this situation resembles the one where it can be stated as a
fact that one of the propositions is false, i.e. the so-called leap-of-faith has disappeared. The question
whether a scale of conclusions should be used in such cases is of course interesting but is out of the
scope of the current paper. Depending on how certain the forensic scientist is about the impossibility
of obtaining the findings under one of the propositions, the leap-of-faith is more or less pronounced.
The two remaining levels, i.e. +2 and +3 are the ones most difficult to decide upon in a particular
case (without explicit reference data). Besides long experience within her own field of expertise, the
forensic scientist needs to ‘calibrate’ her opinions against casework from other areas. There are a
lot of similar situations in the various fields of forensic interpretation and within a laboratory you
should benefit on this. At SKL, we are practising a system of ‘calibration meetings’, where forensic
scientists from different fields discuss each other’s cases. A case is prepared in such a way that you
clarify what are your propositions, what are your findings, what are your conclusions and why. This
should be done in such a way that e.g. a person trained in forensic biology could understand a case
from forensic informatics and vice versa. By presenting your own case, assimilating questions from
your colleague and then doing the opposite procedure the general likelihood reasoning about your
own cases is enhanced.
6. Examples from casework at SKL
We will here present three typical cases from daily work at the laboratory to illustrate the use of
the scale of conclusions. The first case is about forensic interpretation of glass fragments and shows
how the scale is used when the likelihood ratio can be estimated on the basis of reference data. The
second case is about comparison of paints and illustrates how evidence evaluation is made and how
the scale is used without the access to reference data. The third case comes from DNA analysis
and shows a combination of evidence evaluation with and without reference data. We would like to
make clear that these examples illustrate the current practise at the laboratory, and our objective is
not to communicate recommendations about how particular cases should be handled with a scale of
conclusions, merely to show how this handling may work.
6.1
A glass case
6.1.1 Background. A side window of a car has been broken in a smash-and-grab incident. The
police later seized two people as suspects of the incident. Control glass from the broken window was
collected and the suspects’ jackets were sent for investigation.
6.1.2 Investigation and results. From the jacket of Suspect 1 about 20 glass fragments were recovered. The glass fragments were green-tinted tempered float glass with a thickness of 3.16 mm.
Four of the fragments were examined by means of refractive index (RI) with a GRIM instrument
(GRIM) before (RIbefore ) and after (RIafter ) annealing. The following data were obtained: (RIbefore )
1.52084, (RIafter ) 1.52252. The elemental composition of one of the recovered glass fragments was
also examined by means of scanning electron microscope equipped with an energy dispersive X-ray
16
A. NORDGAARD ET AL.
detector (SEM/EDX). From the jacket of Suspect 2 ten glass fragments were recovered. Four of these
fragments were examined by means of RI before annealing, and the following data were obtained:
(RIbefore ) 1.52088. The recovered glass fragments were too small for further examinations.
The control glass from the broken side window was green-tinted tempered float glass with a
thickness of 3.16 mm. This glass was examined analogously to above with the following data
obtained: (RIbefore ) 1.52081 (RIafter ) 1.52248. The control glass was also examined regarding the
elemental composition in SEM/EDX.
6.1.3 Evaluation. The findings of the recovered glass fragments from Suspects 1 and 2 were
compared with the findings of the control glass. The examined glass fragments from the jacket of
Suspects 1 and 2 could not be distinguished from the control glass from the side window by means
of the above-mentioned techniques. It is important to notice that this step cannot by itself grade
the conclusion. The results of the glass examination need to be evaluated against two propositions,
the forwarded proposition (HP ), stemming from the commissioner’s question, and the alternative
proposition (HD ):
Proposition HP : ‘The examined glass fragments from the jacket of suspect 1 originate from the
broken side window of the car.’
Proposition HD : ‘The examined glass fragments from the jacket of suspect 1 originate from some
other glass source.’
The same propositions are used for the recovered glass fragments from Suspect 2.
The probability of obtaining matching results if the proposition HP is true is considered to be
approximately 1. This is so because we consider the potential sources of errors that may lead to
a non-match in either RI or elemental composition if this proposition is true to be negligible. The
probability of obtaining matching results if the alternative proposition HD is true depends on the
probability of obtaining matching results even though the glass fragments originate from another
glass source. To estimate the magnitude of this probability, a database is used. The approach here is
simplified in the sense that the findings will be interpreted as belonging to intervals for which probabilities are estimated. The practise at the laboratory is moving towards the use of probability densities
(cf. Lindley, 1977; Aitken et al., 2007), but a framework for this has not yet been implemented.
The glass database contains about 3000 control glasses. The control glasses have been collected
from casework over the years. Parameters that can be searched in the database are for example type
of glass, float glass or not, colour, thickness, refractive index before and after annealing. A more
relevant database to use would be the one with data from glass findings in clothes, but there is no
such database available at SKL at present. The search results of the parameters that were possible to
examine in the recovered glass fragments are shown in Tables 2 and 3. When searching in the glass
database, different search intervals are used for the parameters. For RIbefore the interval (RI average
±1 × 10−4 ) is used and for RIafter the interval (RI average ±2 × 10−4 ) is used. The use of different
intervals is based on studies made at SKL on the variety of RI within glass windows. Moreover,
based on studies at SKL, a glass is considered as tempered if the difference between RIbefore and
RIafter is more than 1 × 10−3 . In the database search on thickness, the interval (thickness average
±0.2 mm) is used if the glass is thinner than 6 mm. If the average thickness is more than 6 mm an
interval of ±0.3 mm is used. The use of different intervals here is based on information, provided by
glass producer, about the variety of thickness in production of glass windows.
SCALES OF CONCLUSIONS IN FORENSIC INTERPRETATION
17
TABLE 2 Search results of glass fragments recovered from the jacket of Suspect
1. (Note that the frequency column contains successively lower frequencies the
more specified the search becomes.)
Parameter
RIbefore
+Tempered glass
+RIafter
+Colour
+Thickness
+Float glass
Search interval
1.52074–1.52094
Yes
1.52230–1.52270
Green tinted
2.96–3.36 mm
Yes
Frequency in database
71/2920
54/2920
51/2920
25/2920
11/2920
11/2920
TABLE 3 Search results of glass fragments recovered from the jacket of
Suspect 2
Parameter
RIbefore
Search interval
1.52078–1.52098
Frequency in database
75/2920
At the moment the comparison of the elemental composition of glasses by SEM/EDX is mainly
used for the possibility to distinguish glasses. There are plans for implementing the elemental composition of control glasses in the database as well, but these ideas have not yet been realized.
6.1.4 Conclusions of glass fragments recovered from the jacket of Suspect 1. The probability of
obtaining matching results if the proposition HP is true is considered to be approximately 1. The
probability of obtaining matching results if proposition HD is true is estimated as 11/2920, which
gives a likelihood ratio of approximately 265, and corresponds to Level +2 on the scale. The results
thus support that the examined glass fragments recovered from the jacket of Suspect 1 originate from
the broken side window of the car.
6.1.5 Conclusions of glass fragments recovered from the jacket of Suspect 2. The probability of
obtaining matching results if the proposition HP is true is considered to be approximately 1. The
probability of obtaining matching results if proposition HD is true is estimated as 75/2920, which
gives a likelihood ratio of approximately 40, and corresponds to Level +1 on the scale. The results
thus support to some extent that the examined glass fragments recovered from the jacket of Suspect
2 originate from the broken side window of the car.
6.1.6 Discussion. There are several factors that affect the possibility of obtaining matching results even though the glass fragments originate from another glass source. In the example, a lower
frequency was received when more analytical methods could be used. However, it does not necessarily have to be this way. A very common glass might not lower the frequency even though several
different methods can be used on the glass. On the other hand, a small glass fragment with a very
rare RI can present a very low frequency even though only one method is possible to use due to the
size of the fragment. In this approach, a search was made in the database to end-up with a hyperrectangle, the probability of which was estimated by calculating its empirical frequency in the database.
18
A. NORDGAARD ET AL.
If it can be argued that several of the measurements of the parameters in Table 2 are statistically
independent, the evidentiary strength may be increased by evaluating the final likelihood ratio as a
product of the individual likelihood ratios from the measurements of those parameters. In particular,
the three parameters RIbefore , RIafter and tempered glass could be transformed to the two RIbefore and
ΔRI = RIafter − RIbefore , the latter taking into account the issue of whether the glass is tempered
or not but on a continuous scale (cf. Zadora, 2009). However, the issue of independence has not yet
been investigated at the laboratory but will be part of a more general framework for glass analysis in
the future.
6.2
A paint case
6.2.1 Background. A red Volvo ran into a blue Saab and the Volvo left the scene before the
police arrived. The damaged front left wing area of the blue Saab was investigated and a red paint
flake was recovered. Later the police found a red Volvo in a parking place with scratches on the right
front door. Comparison paint from the damaged area of the door was collected.
6.2.2 Investigation and results. The red paint flake recovered from the blue Saab consisted of
five layers; a primer, a primersurfacer, a basecoat, a clearcoat and one layer of repair paint. The paint
flake was examined layer by layer by means of a light microscope equipped with reflected visible
light, transmitted visible light, ultraviolet light and polarized light. The layers were also examined
with Fourier Transform Infrared Spectroscopy (FTIR) and SEM/EDX. The comparison paint from
the red Volvo, which consisted of four industrial painted layers and one layer of repair paint, was
examined the same way as the recovered paint flake from the blue Saab.
6.2.3 Evaluation. The results from the analysis of the recovered paint flake were compared layer
by layer with the comparison paint flake. The layers of the recovered paint flake could not be distinguished from the corresponding layers of the comparison paint flake by means of the abovementioned techniques. Like for the glass case in Section 6.1, this step cannot by itself grade the
conclusion. Similar to the glass investigations, the results from the paint investigation are evaluated against two non-overlapping propositions, the forwarded proposition (HP ) and an alternative
proposition (HD ).
Proposition HP : ‘The red multilayered paint flake recovered from the blue Saab originates from
the red Volvo’.
Proposition HD : ‘The red multilayered paint flake recovered from the blue Saab originates from
another car’.
How an alternative proposition is chosen depends on the information of the questioned material.
For instance, if it were obvious that the questioned material is from a car, then a suitable alternative
proposition would be that the questioned material originates from another car. On the other hand,
if the recovered material is just a paint flake that could originate from other items, not just a car,
a suitable alternative proposition could be that the recovered paint flake originates from something
else (another painted object). The propositions above are chosen for illustrative purposes.
On the contrary to a glass investigation, where a database is available for the evaluation of
the findings, a paint evaluation is made without an explicit database. The probability of obtaining
SCALES OF CONCLUSIONS IN FORENSIC INTERPRETATION
19
matching results if the proposition HP is true is considered to be close to 1. The second step in the
evaluation process is to consider how likely it is to obtain matching results given that the alternative
proposition is true, i.e. the probability of obtaining matching results even though the paint originates
from another source. It is generally agreed by the paint examiners at the laboratory that a match
between two samples of industrial car paint in four layers is quite rare. It would occur in less than 1
of 100 cases if HD is true, which gives a level of at least +2 on the scale. When there are a couple of
additional layers to the industrial paint layers and/or a cross-transfer of paint layers, the probability
of obtaining matching results if HD is true may be considered to be so small that the strongest level
of support can be used, i.e. Level +4.
6.2.4 Conclusions of paint flake recovered from the blue Saab. The likelihood ratio has not been
estimated numerically. The conclusion is based upon logical reasoning. The probability of obtaining
matching results if the proposition HP is true is considered very large. The probability of obtaining
matching results if the alternative proposition HD is true is considered to be smaller than if the
paint flake had consisted of only industrial coating due to the repaint layer. Although the match is
unexpected, it could not be classified as almost unique (see further the discussion below).
The results therefore strongly support (Level +3) that the recovered red multilayered paint flake
originates from the red Volvo, from which the comparison paint is collected.
6.2.5 Discussion. The judgement leading forward to the conclusion that four matching layers of
industrial paint would correspond with Level +2 on the scale is built on the common experience
collected by the paint examiners at the laboratory and collected information about how cars are
painted and the variety of paint colours. In Table 4, the appreciated levels of knowledge of commonness about car paint among paint examiners at SKL is listed. As can be seen from the table, the
knowledge is concentrated to the number of layers and the colour.
At the moment SKL does not have a reference database on car paints. However, there is a
database (EUCAP), containing information about industrial paint layers (among other things), that
is used for example in Germany and Belgium. One problem with EUCAP from a Swedish point of
view is that it is not known how well it reflects the Swedish car market. One possibility to reach
a more precise estimate of the rarity of four particular layers of industrial paint is to make a detailed review of historical case files at the laboratory. The used figure (less than 1/100) is to a large
extent the result of an informal review of such cases, but there has yet not been any attempt to
formalize it.
TABLE 4 Appreciated levels of knowledge of commonness about car paint
among paint examiners at SKL
Parameter
Number of layers
Colour
IR
SEM/EDX
Fluorescence
Polarization
Level of commonness knowledge
High
Moderate
Low
Low
Low
Low
20
A. NORDGAARD ET AL.
In cases where there are layers of repair paint, repaint layers and/or cross-transfer of paint in
addition to the industrial paint layers, a cognitive process of evaluation can be accepted. It should
be consensus that when there are a couple of additional layers of paint to the industrial paint layers
and/or cross-transfer of paint layers, a match could in principle not be found anywhere else than in
this specific case. It could not be stated as a fact that the Volvo was the source of the paint on the
Saab (and vice versa for the cross-transfer) but the leap-of-faith is so small that Level +4 in that case
may be the most natural choice. When there is no cross-transfer of paint layers but still at least one
extra layer (beyond the four industrial paint layers), the match is still unexpected but a conservative
choice of level would be +3 in that case.
6.3
A DNA case
6.3.1 Background. A burglary was reported in a residential housing estate. Small bloodstains
were found adjacent to a rummaged wardrobe and later recovered by the police. A couple of days
later the police seized one individual as a suspect of the incident. Subsequently, the suspect was
swabbed for DNA. Blood from the crime scene and the reference DNA sample were sent for
investigation.
6.3.2 Investigation and results. The stain swabbed at the crime scene was tested positive for
blood with traditional blood presumptive leucomalachite reagent. DNA amplification revealed a partial DNA profile from a single male donor with results in only six of the autosomal STR markers
amplified. The DNA was typed also for Y-chromosomal DNA giving a haplotype in all 17 STR
markers analysed.
6.3.3 Evaluation. The DNA profile from the suspect was compared against the partial DNA profile revealing a match. DNA from the suspect is typed also for Y-chromosomal DNA and the haplotype obtained is the same. The results are to be evaluated against two propositions, the forwarded
proposition (HP ), stemming from the commissioner’s question, and the alternative proposition (HD ):
Proposition HP : ‘The blood found on the crime scene originates from the suspect’.
Proposition HD : ‘The blood found on the crime scene originates from another (non-related)
individual than the suspect’.
The random match probability for the autosomal DNA result was calculated to 1 in 743 000, by
using a Swedish population database taking into consideration a 1% FST correction (Wright, 1951)
and a lowest allele frequency value of 2%.
The rarity of the crime stain Y-chromosomal haplotype was assessed with the YHRD database
(YHRD.ORG 3.0, 2011). The haplotype in question has previously not been reported but belongs to
the most common haplogroup in Sweden, I1a* (Karlsson et al., 2006).
6.3.4 Conclusions for blood stains recovered from the crime scene matching the suspect. The
probability of obtaining matching results if proposition HP is true is considered to be approximately
1. The probability of obtaining matching results if proposition HD is true is calculated to 1/743 000
for the autosomal DNA, which gives a likelihood ratio of 743 000, corresponding to Level +3 on
SCALES OF CONCLUSIONS IN FORENSIC INTERPRETATION
21
the scale. Thus, according to the scale, the results strongly support that the bloodstain found on the
crime scene of the burglary originates from the suspect.
The Y-chromosomal haplotype of the crime stain matches the suspect’s haplotype. The haplotype
has previously not been reported to the database. A matching 17 STR haplotype alone, despite a
match in a database, would as a general approach even without a given number render a Level
+1 conclusion using the scale. The haplotype frequency for the most common haplotype found in
Sweden is 5.8% (Holmlund et al., 2006), which corresponds to 1 in 17. This figure can be used for
an approximation of a value for the haplotype obtained, to a moderate likelihood ratio of 17.
By combining the objective value for the autosomal STRs with the approximate value for the
Y-chromosomal STRs simply using the product rule, a combined approximate likelihood ratio of
12 600 000 (743 000 × 17) is obtained. The combined result corresponds to a Level +4 conclusion
on the scale. However, it is not necessary to force an explicit approximate likelihood ratio based on
haplotype data to attain Level +4 for the combined values of evidence. The match itself renders a
Level +1 on the scale and it suffices to use the lower limit 6 (Table 1) as a likelihood ratio proxy
since 743 000 × 6 = 4 458 000, which exceeds the lower limit for Level +4.
6.3.5 Discussion. The strength in using lineage markers is at its best for kinship and identification cases. In crime cases, Y-chromosomal DNA analysis is often connected to sexual crime due
to the tendency to get DNA mixtures, with minor components showing the culprit’s DNA and the
overwhelming part being the victim’s DNA, with a need to add evidential strength or to exclude a
suspect. Also a partial single donor profile, as used in this example, can be of interest to extend with
Y STRs. Depending on the case investigation as a whole, a Level +3 conclusion for a crime scene
stain might not be evidence strong enough for a court conviction.
Interpreting the power of the evidence for lineage markers like Y STRs is challenging or even
problematic (Palo, 2007). Despite increasing knowledge and growing databases, haplotype occurrences are still not well known and in addition the traits of heritage complicates it all. What weight
can be assessed and how should it be reported? Are the haplotype results obtained supported by any
autosomal DNA results?
The scale of conclusions used at SKL together with the experience of the forensic scientist are
important underlying factors for combining results, even if background data clarifying a findings
value is meagre. The combination of autosomal and Y-chromosomal DNA has been performed at
SKL since 2005. A general and moderate approximation of the obtained haplotypes has been used.
Approaches using joint match probabilities have been proposed by for example Walsh et al. (2008).
It is used in paternity investigations (Gjertson et al., 2007), but a combination will obviously add no
value if male relatives sharing the same haplotype are part of the alternative hypothesis. This should
be the case for crime stains as well. The approach of combining autosomal and lineage marker
results is generally not adopted or even accepted throughout the forensic DNA community (cf. de
Knijff, 2003; Amorim, 2008). According to Buckleton et al. (2011), combining autosomal and linear
markers is mainly a matter of communication not calculation.
To support a reported autosomal DNA match with a database-generated haplotype frequency to
court is prone to end as an overestimation of the evidential value. To report a non-weight haplotype match to court leaves it all to the court to decide. In the case of having both autosomal and
Y-chromosomal DNA results for the same stain it will, with or without theoretical and statistical
support (in the statement or by testifying in court), add to the difficulties for the court to asses the
22
A. NORDGAARD ET AL.
evidential value. If the scientist cannot interpret or approximate the value of a scientific result in a
written statement, then how could it be possible to do so by testifying in court, or worst case just
leave it to the court alone to assess? Court will in the end, whatever route is taken, somehow sum up
the DNA results and all other evidence presented. There is an obvious risk of a randomized over- or
underestimation between different courts and different cases.
The overall knowledge on haplotype appearances and distribution at for instance local and regional levels is still restricted. One moderate approach, obviously not in detail exact, is to assess
the value using the frequency of the most frequent haplotype reported for in the specific country or
region of interest.
7. Concluding remarks
Reporting the value of evidence on a scale is not controversial itself, but it is merely a way of simplifying the interpretation of numbers. However, this study has shown that the construction of a
scale is far from trivial and this is particularly true when it comes to choosing the levels of the scale
with respect to the underlying likelihood ratio. Apart from the obvious baseline of the scale representing the inconclusive state, our work has been built on the limitation to four levels. Although it
might be agreed that the distances between successive levels should increase with level (it would
not even be possible to have equal distances as the likelihood ratio has no upper limit), there are no
judicial or common knowledge grounds for how to select the distances. Our choice is mathematically based on the anchoring of three levels and the anchoring itself is a mixture between what is
commonly accepted as reasonable posterior odds for conviction and what the limits in a Swedish
population are.
We think that there should be no more than one level between the inconclusive baseline and the
level where we consider the evidence to be clearly supportive of one of the forwarded propositions.
However, the number of levels above the latter may depend on general sizes of the populations
involved when obtaining the likelihood ratio. The interpretation of DNA evidence can be helpful
here; the estimated size of the population of possible donors of a stain is a good base for the lower
limit of the likelihood ratio for the highest level. If that limit is very high, there is need for more than
four levels above the baseline, but the mathematical construction of interval lengths still apply.
Once a numerical likelihood ratio has been obtained, its reporting on the scale is trivial. Notwithstanding, the scale can be used even if the likelihood ratio cannot be numerically estimated. This is
not a novelty within forensic science but has been part of evidence evaluation long before the logical
approach with likelihood ratios was established. Part of the current study has aimed to show more
formally how a non-numerical value of evidence can be incorporated to the logical framework of evidence evaluation. A unified scale of conclusions within a laboratory will make this procedure easier
to carry out especially when the scale is continuously used at in-house calibration and training.
An extra feature with the type of scale presented in this paper is the possibility to use Table 1 (the
translation of intervals of likelihood ratios to scale levels) backwards. Once a scale level has been
decided for some particular findings, it is possible (though maybe conservative) to find a lower limit
for the likelihood ratio, which in turn may be used if these findings are to be combined with other
findings (conditionally independent of the former) of the same criminal case. Instead of leaving the
issue of combination to the court, the forensic laboratory may thus investigate the total evidentiary
strength of the findings addressed at activity level propositions.
SCALES OF CONCLUSIONS IN FORENSIC INTERPRETATION
23
Acknowledgements
The authors would like to thank colleagues at the Swedish National Laboratory of Forensic Sciences,
in particular members of the evidentiary value project group: Inger Wistedt, Jenny Elmqvist, Tobias
Höglund, Mirja Lenz Torbjörnsson, Jane Palmborg, Siw Sullivan and Ing-Marie Wigilius, for valuable inputs, Anna Emanuelson for discussions regarding the glass and paint examples and Staffan
Jansson for discussions regarding the DNA examples. The authors would also like to thank Professor
Christophe Champod, University of Lausanne for his comments and two anonymous reviewers for
valuable inputs.
R EFERENCES
A ITKEN , C. G. G. AND TARONI , F. (2004). Statistics and the Evaluation of Evidence for Forensic Scientists.
2nd ed. Wiley, Chichester.
A ITKEN , G. G. G., Z ADORA , G. AND L UCY D. (2007). A two-level model for evidence evaluation. Journal
of Forensic Science 52, 412–419.
A MORIM , A. (2008). A cautionary note on the evaluation of genetic evidence from uniparentally transmitted
markers. Forensic Science International: Genetics 2, 376–378.
B ERGER , J. O. (1985). Statistical Decision Theory and Bayesian Analysis. 2nd ed. Springer-Verlag, New York.
B ROEDERS , A. P. A. (1999). Some observations on the use of probability scales in forensic identification.
Forensic Linguistics 6, 228–241.
B UCKLETON , J. (2005). A framework for interpreting evidence. In: Forensic DNA Evidence Interpretation (J.
Buckleton, C.M. Triggs, S.J. Walsh eds.), 27–63. CRC Press, Boca Raton, FL.
B UCKLETON , J. S., T RIGGS , C. M. AND C HAMPOD , C. (2006). An extended likelihood ratio framework for
interpreting evidence. Science & Justice 46, 69–78.
B UCKLETON , J. S., K RAWCZAK , M. AND W EIR , B. S. (2011). The interpretation of lineage markers in
forensic DNA testing.Forensic Science International: Genetics 5, 78–83.
C ADDY, B. AND C OBB , P. (2004). Forensic science. In: Crime Scene to Court. The Essentials of Forensic
Science. 2nd ed. (White PC, eds.), 1–20. The Royal Society of Chemistry, Cambridge.
C ECI , S. J. AND F RIEDMAN , R. D. (2000). The suggestibility of children: scientific research and legal implications. Cornell Law Review 86, 33–108.
E VETT, I. W., JACKSON , G., L AMBERT, J. A. AND M C C ROSSAN , S. (2000). The impact of the principles of
evidence interpretation on the structure and content of statements. Science & Justice 40, 233–239.
G JERTSON , D. W., B RENNER , C. H., BAUR , M. P., C ARRACEDO , A., G UIDET, F., L UQUE , J. A., L ESSIG ,
R., M AYR , W. R., PASCALI , V. L., P RINZ , M., S CHNEIDER , P. M. AND M ORLING , N. (2007). ISFG:
Recommendations on biostatistics in paternity testing. Forensic Sciences International: Genetics 1,
223–231.
H EDMAN , J., A LBINSSON , L., A NSELL , C., TAPPER , H., H ANSSON , O., H OLGERSSON , S. AND A NSELL ,
R. (2008). A fast analysis system for forensic DNA reference samples. Forensic Science International:
Genetics 2, 184–189.
H OLMLUND , G., N ILSSON , H., K ARLSSON , A. AND L INDBLOM , B. (2006). Y-chromosome STR haplotypes
in Sweden. Forensic Science International 160, 66–79.
D E K NIJFF , P. (2003). Son, give up your gun: Presenting Y-STR results in court. Profiles in DNA 6(2), 3–6.
K ARLSSON , A. O., WALLERSTR ÖM , T., G H ÖTERSTR ÖM , A. AND H OLMLUND , G. (2006). Y-chromosome
diversity in Sweden—a long time perspective. European Journal of Human Genetics 14, 963–970.
L INDLEY, D. (1977). A problem in forensic science. Biometrika 64, 207–213.
24
A. NORDGAARD ET AL.
NFI (2008). Vakbijlage Reeks waarschijnlijkheidstermen - versie 2.0. The Netherlands Forensic Institute.
N ORDGAARD , A., W ISTEDT, I., D ROTZ , W., E LMQVIST, J., H ÖGLUND , T., JAEGER , L., T ORBJ ÖRNSSON ,
M. L., PALMBORG , J., S ULLIVAN , S. AND W IGILIUS , I. (2010). Uppfattning av värdeord i sakkunnigutlåtanden - En studie genomförd bland olika aktörer i rättsprocessen i Sverige. SKL Rapport 2010:01.
Swedish National Laboratory of Forensic Sciences.
PALO , J. U., H EDMAN , M., U LMANEN , I., L UKKA , M. AND S AJANTILA , A. (2007). High degree of Ychromosomal divergence within Finland—forensic aspects. Forensic Science International: Genetics 1,
120–124.
S JERPS , M. AND B IESHEUVEL , D. (1999). The interpretation of conventional and ‘Bayesian’ verbal scales for
expressing expert opinion: a small experiment among jurists. Forensic Linguistics 6, 214–227.
T HOMPSON , W. C., TARONI , F. AND A ITKEN , C. G. G. (2003). How the probability of a false positive affects
the value of DNA evidence. Journal of Forensic Science 48, 47–54.
WALSH , B., R EDD , A. J., AND H AMMER , M. F. (2008). Joint match probabilities for Y chromosomal and
autosomal markers. Forensic Science International 174, 234–238.
W RIGHT, S. (1951). The genetical structure of populations. Annals of Eugenics 15, 323–354.
YHRD.ORG 3.0 (2011). Y-STR Haplotype Reference Database. www.yhrd.org. Date-of-visit 2011-03-28.
Z ADORA , G. (2009). Evaluation of evidence value of glass fragments by likelihood ratio and Bayesian Network
approaches. Analytica Chimica Acta 642, 279–290.