Y-STR Workhshop – JBallantyne

Current Status and Future of
Y-STR Analysis
AFDAA Workshop June 2012
Jack Ballantyne, PhD
National Center for Forensic Science
University of Central Florida
UCF
Schedule
Part 1. Biology and Forensic Uses
• Y Chromosome Biology
• Forensic Use of Y-STRs
– Development and Validation of Y-STR
Multiplexes
– Extended Interval Post-Coital Samples
– Improved Detection of Male DNA
• Future Developments
Part 2. Y-STR Match Statistics
• Basic Y-STR statistics
• Use of the National US YSTR Database
Y Chromosome Biology
Function of Y Chromosome?
• Aneuploidies for the X and Y
– 47, XXY (Klinefelter synd.)  males
• 48, XXXY; 49, XXXXY; 50, XXXXXY 
males
– 45, XO (Turner synd.)  females
– 47, XXX (triple-X karyotype) ‘normal’
female
– 47, XYY karyotype ’normal’ male
• Sex Reversed Humans
– XY  female (Y minus TDF)
– XX  male (X plus TDF)
5
6
Classic View of Y-Chromosome
TDF master gene
patrilineal inheritance
no recombination in NRY
recombination in PAR
junk-rich, gene poor
7
SDA: sex determining allele
PAR: psedo-autosomal region
SRY: sex determining region on the Y
9
Y Chromosome NRY is a Mosaic of
Discrete Sequence Classes
Euchromatin = 23 Mb
10
11
Mutation rates:
Y Filer: ~ 3 x 10 -3
RM: ~2 x 10-2
12
13
Forensic Uses of YChromosome STRs
Reasons Y?
• Males
– 80% of all violent crime
– 95% of all sex offenses
– (Criminal Victimization in United States, BJS 2001)
• When trying to determine the genetic profile of the
male donor in a male/female DNA admixture (when
F/M > 20, often >1000) and autosomal STR analysis
fails (is not informative) or not possible
• Determination of number of semen donors
• Missing persons (MP)
– haplotype of MP determined by typing male relative
• son, brother, father, uncle, nephew
Reasons Y?
• Additional statistical discrimination
– mixture/partial profiles
• Familial Searching
– Reduce number of potential male relatives obtained by low
stringency match of sample profile to offender database
• RM Y-STRs may differentiate father/son/grandfather
etc
White et al:
Y-PLEX-6TM
Y-PLEX-12TM YFilerTM
released
novel YSTRs
released
released
(ABI)
(Reliagene)
(Reliagene)
Ayub et al:
Kayser et al:
SWGDAM
novel YSTRs
novel YSTRs
core
1997
1998
Kayser et al:
‘minimal
haplotype’
1999
2000
2001
Prinz et al:
forensic
validation
(pentaplex)
UCF: begin Ychromosome
project
2002
2003
2004
Y-PLEX-5TM
released
PowerPlex® Y
released
(Reliagene)
(Promega)
Y-STR Development Timeline
Hanson et. al:
Y-STR
physical map
2005
2006
US YSTR
Database
released
K. Ballantyne et al:
Rapidly mutating
YSTRs
PowerPlex®
Y23
(RM loci)
(Promega)
(online)
2007
2008
2009
2010
2011
2012
“Minimal Haplotype” Loci
(Kayser et. al. 1997):
 DYS 19
DYS 391
DYS 389I DYS 389II DYS 390,
DYS 392 DYS 393
DYS 385 I/II
• First used in Europe
• Formed the basis for multiplex development in
the U.S.
SWGDAM Core Loci
• Recommended in 2003
• DYS19 DYS389I DYS389II DYS390,
DYS391 DYS392 DYS393 DYS385 I/II
• Also: DYS438, DYS439
Commercially Available Y-STR Multiplex Kits
• Reliagene: Y-PLEX-5, -6, and -12
– No longer available
• ABI: YfilerTM
– 17 loci in a single reaction
• Promega: PowerPlex YTM
– 12 loci in a single reaction
Novel
Markers?
DYS388
DYS463
DYS522
DYS588
DYS688
DYS425
DYS464
DYS525
DYS590
DYS698
DYS385
DYS426
DYS468
DYS527
DYS591
DYS703
DYS389I/II
DYS434
DYS476
DYS531
DYS593
DYS707
DYS435
DYS481
DYS533
DYS594
Y-GATA-A7.1
DYS436
DYS484
DYS534
DYS596
Y-GATA-A7.2
DYS391
DYS441
DYS485
DYS535
DYS598
Y-GATA-A10
DYS392
DYS442
DYS488
DYS540
DYS607
YAP
DYS443
DYS490
DYS543
DYS617
DYS444
DYS494
DYS549
DYS622
DYS437
DYS445
DYS495
DYS552
DYS627
DYS438
DYS446
DYS497
DYS556
DYS630
DYS447
DYS499
DYS557
DYS631
DYS449
DYS503
DYS561
DYS634
DYS448
DYS450
DYS505
DYS562
DYS636
DYS456
DYS452
DYS508
DYS570
DYS638
DYS453
DYS510
DYS572
DYS641
DYS454
DYS511
DYS575
DYS643
Y-GATA-C4
DYS455
DYS513
DYS576
DYS637
Y-GATA-H4
DYS459
DYS520
DYS578
DYS660
DYS462
DYS521
DYS587
DYS685
DYS19
DYS390
DYS393
DYS439
DYS458
NCFS LOCI
SWGDAM core
+ commercial
kits
NCFS Developed Multiplexes
•
•
•
•
•
•
•
•
•
•
Multiplex I (9 loci)
Multiplex II (10 loci)
Multiplex III (8 loci, 1 InDel)
Multiplex IV (19 loci)
Multiplex V (13 loci)
Multiplex VI (14 loci)
Multiplex VII (12 loci)
Multiplex VIII (11 loci)
Multiplex IX (7 loci)
Multiplex X (6 loci)
= 109
Markers
*Additional 25 Loci screened and rejected 
34% of known loci evaluated
Sexual Assault Investigations
SEXUAL ASSAULT INVESTIGATION
• Assume that a woman (victim, V) is raped by an
individual (perpetrator, P) but shortly before this
she has consensual intercourse with her
husband/boyfriend (B)
• Before consideration of genetic marker results:
– semen may only have come from P
– semen may only have come from B
– semen may comprise a mixture of semen from P and
B
– semen may not be present
• Absence of semen from P does not mean that P
did not rape V
– semen from P does not prove he raped V (could be
consensual)
Occasional Problem
•
•
•
some rape victims provide vaginal samples > 2436 hours after the incident
ability to obtain an autosomal STR profile of the
semen donor from the living victim diminishes
rapidly as the post-coital interval is extended (>
24-48h difficult, >72 h not normally possible)
from classical forensic serology and reproductive
biology studies
– sperm persist in the post-coital vaginal canal up to 3
days after intercourse
– from the medical literature sperm may be detectable
up to 7 days after intercourse in the cervix
•
sperm are few in number after these extended intervals
Sperm Loss Over Time
1.
2.
3.
vaginal lavage
vaginal drainage
normal intra-cervicovaginal sperm degradative changes
–
4.
5.
sperm become damaged and fragile
below analytical sensitivity of the test
during the differential extraction process within the laboratory
–
–
multiple manipulations of the sample
few remaining sperm lyse into female fraction
» kinetics of the PCR process
»
»
»
majority component in an admixture will titrate out critical PCR
reagents
female/male DNA ratio > 300/1
failure to type the male component
Potential Solutions to Problem?
• use Y-STRs
– theoretically can detect male in female/male
admixtures
• ‘ignores’ female component
– no differential extraction
• avoids unnecessary sample loss
• low copy number approach
– add large quantities of DNA (300-450 ng)
• to permit adequate sampling of small number of still
persisting sperm
– increased PCR cycle number (34-36 cycles)
• ensure cervicovaginal sampling
– low to mid-vaginal sampling may not recover
sperm after extended post coital interval
Not All Y-STRs Are Made Equal
• Y chromosome retains high level of sequence
homology with the X
• Different Y-STR primers will possess different degrees
of homology with the X chromosome
Early Work Conclusions
• it is possible to obtain the genetic profile of
the semen donor in postcoital cervicovaginal
samples recovered up to 4 d after intercourse
– use of carefully selected Y-STR markers
• chosen for their ability to detect the male donor in an
overwhelming quantity of background female DNA
–
–
–
–
no differential extraction
300-450 ng of input template DNA
increased cycle number (34-35 cycles)
cervicovaginal sampling
• such strategies may significantly impact the
recovery and processing of rape evidence
More recent work
• Number of DNA profile enhancement strategies
employed
– Cervical brushing
– Differential versus non-differential DNA extraction
– Post-PCR purification
• Standard manufacturers’ cycling conditions used
• Y-STR profiles
– Full profiles routinely 3-5 days after intercourse
– Profiles > 6 days after intercourse but mainly partial
– Use of post PCR purification significantly improved
ability to obtain profile, especially from the 5-6 day
samples
– Better profiles with differential lysis (sperm fraction)
– 8 locus Y-STR profile from 7 day post-coital sample
• Approaching limit for sperm detection in cervix
Enhanced DNA Profiling for
Detection of the Male Donor in
Trace DNA Samples
Nested-PCR Strategy
MinElute purification to
remove excess primers
Nested PCR Pre-Amplification Multiplex
Yfiler Amplification kit (ABI)
Improvement in allele recovery and signal intensity with prior pre-amplification
5pg Input Male DNA (Single Source)
Without Pre-Amp
With Pre-Amp
LCM – 1 epithelial cell
1 cell
Without pre-amplification
(1/17 alleles)
1 cell
With pre-amplification
(11/17 alleles)
Door Handle
Without pre-amplification
(0/17 alleles)
With pre-amplification
(17/17 alleles)
7 days after intercourse
Without pre-amplification
(5/17 alleles)
With pre-amplification
(16/17 alleles)
FUTURE?
• Improved Discrimination
Additional Y-STR Markers (new kits)
• Incorporation into CODIS or other local
investigative Databases
• Ability to distinguish males from same
lineage
• Surname calling card
Novel Y-STR Markers
• Promega: PowerPlex Y23 TM
– 23 loci in a single reaction
– Not yet available
• RM loci
• Work done at NCFS
PowerPlex® Y23
• 17 Y-STR loci currently contained in commercially
available Y-STR kit (YfilerTM)
• Six new highly discriminating Y-STR loci:
– DYS481, DYS533, DYS549, DYS570, DYS576,
DYS643
• Adds more evidential power to a forensic Y-STR
analysis
• Lebanese population (n = 502)
– Y Filer 431 unique haplotypes, 86% unique
– Y-23 478 unique haplotypes, 95% unique
RM loci
• With current Y-STRs, often a failure to distinguish
between males of the same lineage
• Possible to utilize highly mutating Y-STR loci
– Differences between males in same lineage
• 13 Y-STRs with the highest mutability recommended
– “RM Y-STR” loci
– Ballantyne, K. et al. Am. J. Hum Genet 2010
• Likely to have significant impact on Y-STR analysis
– Additional discriminatory power to currently used loci
– In some cases differentiate male lineage relatives
Example (RM Loci)
Son (Twin 1)
Father
Son (Twin 2)
Statistics
“A detailed
understanding of
the influence of all
factors on the
evolution of profile
proportions
requires a lifetime
of study, and
more”
'Some people believe
football is a matter
of life and death.
I'm very disappointed
with that attitude.
I can assure you it is
much, much more
important than that.‘
Balding 2005
Bill Shankly, 1960s
“Those who cannot learn
from history are doomed
to repeat it”
George Santayana
– The Life of Reason (1905-1906)
Tasteful Statistics Jokes
• A statistician is a professional who diligently
collects facts and data and then carefully draws
confusions about them
• You can lie with statistics but even better without
• Statistics means you never have to say you’re
certain (wrong)
DNA Mixture Interpretation
Presentation Title
47
Seriously True Precepts
• All models are wrong. Some are useful.
– George Box
• There are no facts, only interpretations.
– Frederick Nietzsche
DNA Mixture Interpretation
Presentation Title
48
Basic Y-STR Interpretation Guidelines

Similar general issues to autosomal STRs
• Thresholds for detection and interpretation
• Stutter
• Probability of a match (STATISTICS!)
• Mixtures
Profile and Match Probabilities
• estimating the rarity of a Y DNA profile is performed
differently than for autosomal DNA markers
• because of linkage, each haplotype is treated as an allele
and the total number of possible haplotypes comprise
the alleles of a single locus
– Composite multi-locus profile is treated as a single locus or
haplotype
• no evidence for recombination across the majority of the
Y-chromosome
• cannot employ the product rule to estimate the rarity of
the Y types in a profile
Current Approach: Counting Method
• The counting method is very simple
• A Y-haplotype (evidence sample) is compared to a reference
database(s) of unrelated individuals
• The number of times the Y-haplotype is observed in a database
– The size of a database can be and is often limited
– With databases (e.g., n = 100 to 20000), many possible haplotypes
will not be observed and there will be sampling error
• A confidence interval can be placed on the observation (ie
statistical sampling correction)
– Can convey with a high degree of confidence that the rarity of the
evidence Y-haplotype among unrelated individuals in a given
population(s) is less than the upper bound of the estimate
Estimation of haplotype frequencies
• If haplotype A is seen x times in a sample of n
~
haplotypes, pA = x / n
• Tendency to
– regard this quantity as being normally distributed
– Acknowledge sampling variation with a 100(1-a) %
confidence interval in the form of:
For 95% confidence, a = 0.05 so that Z(1-a/2) = 1.96 and Z(1-a) = 1.645
Estimation of Haplotype Frequencies
• The actual sampling distribution is binomial
• This is approximately normal ONLY for
*
moderately large
values of n pA (>5)
Estimation of haplotype frequencies
• An exact confidence interval based on the binomial
distribution was described by Clopper and Pearson
• An upper 100(1-a) % one-sided limit po is found by
solving:
– If the sample does not contain any copies of A, then po =
1 –a1/n
– For a 95% confidence limit, this is approximately equal to
3/n once n is 100 or more
Clopper-Pearson Exact Confidence Interval
Clopper, C.J. and E.S. Pearson, The use of confidence or fiducial intervals illustrated in
the case of the binomial. Biometrika (1934). 26: p. 64-69.
n k
nk



 k  p0 (1  p0 )  0.05
k 0  
x
The formula is the cumulative binomial distribution for all values from 0 matches to k
matches given a database size of N and a frequency of p. Since N and k are fixed
after a search, the goal is to determine the p at which 95% of the observations are
expected to be more than k, and 5% of the observations (the 0.05 in the formula) are
expected to be between 0 and k. This program increases p by small increments until
this balance point (~95% of the possible comparisons expected to give you >k
matches, and ~5% expected to give you k or fewer matches) has been reached. By
finding what amounts to a left-hand 1-sided 95% confidence interval (i.e., the lower
limit) for the distribution of possible matches given the frequency p (all as it relates to
the k and N observed from the search), this then also provides an 95% upper limit for
p. Beyond that point, it is considered too unlikely that a haplotype with a more
common frequency would give so few matches.
55
If the haplotype has not been observed
in the database, then:
The upper (95%) bound of the CI is
1-(0.05)1/N
Or the ‘rule of 3’ = 3/N
Exact vs. Normal Confidence Intervals
n
X
P
HP
(1-tail)
HP
(2-tail)
CP
100
1
0.01
0.026
0.029
0.047
2
0.02
0.043
0.047
0.062
10
0.10
0.149
0.159
0.164
1
0.001
0.0026
0.0029
0.0047
2
0.002
0.0043
0.0048
0.0063
10
0.010
0.0152
0.0162
0.0169
1
0.0001
0.0003
0.0003
0.0005
2
0.0002
0.0004
0.0005
0.0006
10
0.0010
0.0015
0.0016
0.0017
1,000
10,000
57
National U.S. Y-STR Database
Y-STR Database Goals
• To compile and consolidate Y-STR data from all
available ‘legitimate’ sources
• To create a Y-STR Consortium comprised of stake
holders and data contributors from the forensic
community
• Expand data
– Type additional samples using core loci
• Provide custodial and managerial responsibility
• Develop quality indicators for data inclusion and
submission
– ‘Proficiency testing’ for labs who wish to contribute data
– Screen data and remove duplicate & related samples
• Ensure allele-call consistency among different primer
sets
• Provide accessibility and statistical data to the forensic
community via the Internet
Y-STR Consortium Members
•
NCFS
– Jack Ballantyne
– Lyn Fatolitis
•
•
University of Arizona
– Mike Hammer
•
NIST
– John Butler
•
•
MN Dept of Public Health
– Ann Marie Gross
University of North Texas
– Arthur Eisenberg
•
FBI
– Bruce Budowle
NYC OCME
– Mecki Prinz
Applied Biosystems
– Lisa Calandro
•
•
Promega
– Curtis Knox
•
ReliaGene
– Sudhir Sinha
•
Orchid Cellmark
– Cassie Johnson
•
NIJ
– John Paul Jones
The Y-STR Consortium was formed at the 2006 AAFS Meeting in Seattle,
WA to assist in sample consolidation and the design and development of the
database.
Y-STR Interpretation Guidelines
Scientific Working Group on DNA Analysis Methods (SWGDAM)
FSC January 2009
5. Statistical Interpretation
5.1. Y-STR loci are located on the nonrecombining part of the Ychromosome and, therefore, should be considered linked as a single
locus. A Y-STR database must consist of haplotype frequencies
rather than allele frequencies. The source of the population
database(s) used should be documented. Relevant population(s) for
which the frequency will be estimated should be identified. A
consolidated U.S. Y-STR database (http://usystrdatabase.org) has
been established and should be used for population frequency
estimation. A number of other Y-STR haplotype frequency databases
exist online. (See available listing on the NIST [National Institute of
Standards and Technology] STRBase Web site at
http://www.cstl.nist.gov/biotech/strbase/y_strs.htm.)
Database Home Page: www.usystrdatabase.org
62
Release 2.6 – Current Version
 Was made available on January 3, 2012
 Comprised of 18,719 haplotypes
– An additional 61 Yfiler haplotypes were uploaded
• 35 Caucasian
• 3 African American
• 7 Asian
• 16 Hispanic
– San Diego Sheriff’s Crime Laboratory (39 samples)
– Santa Clara County Crime Laboratory (22 samples)
63





6301 African American (All, Undefined, or Select by State)
1008 Asian (All, Asian, Chinese, Filipino, Oriental, S. Indian, Vietnamese, Jordanian)
6998 Caucasian (All, US, Canada, Europe, Undefined)
3429 Hispanic (All, Undefined, or Select by State)
983 Native American (All, Apache, Navajo, Shoshone, Sioux)
64
NCFS -2,440
ReliaGene -3,037
Promega -3,800
Applied Biosystems - 6,159
University of Arizona -2,462
The Illinois State Police - 398
Orange County CA Coroner - 30 samples.
Santa Clara County -143
CA DOJ Sacramento - 32
Marshall University -113
WSP Vancouver - 40
Richland County Sheriff SC-7
San Diego Sheriff - 39
65



Release 2.6 contains 18,719 samples with a complete 11-marker SWGDAM core haplotype
15,395 samples have a complete 12-marker PowerPlex Y haplotype
8,548 samples have a complete 17-marker Yfiler haplotype
66


Release 2.6 contains 15,395 complete PowerPlex Y (12-locus) haplotypes. Of these, 6194 haplotypes
are unique (i.e., seen only once in the database) while 9201 haplotypes are seen more than once,
giving a DP of 40.2%
The Database contains 8548 complete Y-filer (17-locus) haplotypes. Of these, 6934 haplotypes are
unique while 1614 haplotypes are seen more than once, giving a DP of 81.1%.
67
 Since the release of the US Y-STR Database in January 2008, over 86,200
database search queries have been performed. Up until January of 2011,
the database had an average of approximately 800 searches per month.
Between January and August 2011 the average use increased to >8,000
searches per month. Since this time, the average is over 1200 searches
per month.
68
69
Court Support
Frye Hearings
 State of California v Miszkewycz (county of Placer)
– “..contends that the statistical analysis applied in this case is faulty because
of an unreliable or unknown data base used to formulate the statistics”
– “Court is satisfied that accepted scientific procedures and principals (sic)
were properly used in this case” (After “Ms Caser’s” testimony)
 State of Kansas v Gonzalez
– Judge Pokorny, Seventh Judicial District for the District of Kansas, Douglas
County, KS
– Challenge that over a period of six months the frequency of the
evidence/suspect haplotype changed from 1/2717 to 1 in 1786
– 4 October 2010: found that Y-STR database is fit for purpose and motion to
deny/exclude Y-STR haplotype evidence denied
Peer Reviewed Journal Article
 US forensic Y-chromosome short tandem repeats database.
Ge,Budowle, Planz, Eisenberg, Ballantyne, Chakraborty. Legal
Medicine 12 289-295 (2010)
70
Database Expansion / Data Solicitation
 Created Sample Submission section on database website
– Created quality control competency testing procedure using liquid blood
samples donated by UNT for this purpose
– Certificate of Participation is issued to qualifying laboratories
– Sample submission template and information available on website
 Solicitation for data was posted on our behalf by Lynne Burley of Santa
Clara County Crime Lab on the Yahoo group, forens-DNA, a technical
discussion group of forensic DNA technology
 Updated U.S. Department of Justice and NCFS websites to solicit for
samples and / or data
 Routinely make appeals for samples and data at all meetings,
presentations, workshops, etc.
71
Certificate No.: 00010
DuPage County Forensic Science Center
501 North County Farm Road
Wheaton, IL 60187
USA
The National Center for Forensic Science
University of Central Florida
12354 Research Parkway, Suite 225
Orlando, FL 32816-2367
USA
US Y-STR Database
Certificate of Participation
DuPage County
Forensic Science Center
has participated in the Y-STR Haplotyping
Quality Assurance Exercise
The alleles at all loci tested have been typed correctly
according to the published nomenclature and the
ISFG guidelines for Y-STR Analysis
(Int J Legal Med 114 (2001) 305-309)
Granted: March 30, 2010
Jack Ballantyne, Ph.D.
Associate Director (Research)
Lyn Fatolitis
US Y-STR Database Manager
72
QC Participants
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
IL State Police Crime Laboratory
Jan Bashinski DNA Crime Laboratory, CA Department of Justice
Orange County CA Sheriff – Coroner, Forensic Science Services
State of Connecticut Forensic Science Services Laboratory
County of Santa Clara CA Crime Laboratory, Office of the DA
CA Department of Justice, Sacramento Crime Laboratory
WA State Patrol Crime Laboratory – Vancouver
Marshall University Forensic Science Center
AZ Department of Public Safety Central Regional Crime Laboratory
DuPage County IL Forensic Science Center
Richland County Sheriff’s DNA/Trace Department – SC
Miami-Dade Police Department Forensic Services Bureau
Michigan State Police Biology Unit – Lansing
San Diego Sheriff’s Regional Crime Laboratory
FBI Lab
UNT
73
Users’ Feedback – Database Improvements
 Several changes and improvements have been made based upon
recommendations and suggestions from users in the field
– Followed suggestions from the Santa Clara County DA Crime
Laboratory and the Centre of Forensic Sciences in Ontario to alter
some of the verbiage in the displayed results
– Added a “News” section to the database homepage to allow for
announcements and updates to keep users informed
– Created and validated an automatic haplotype upload interface,
allowing users to simultaneously upload multiple haplotypes directly
from Genotyper® and GeneMapper® text files for database
searches, modeled after Applied Biosystems’ Yfiler Database
interface
– Adjusted the > (greater than) and < (less than) queries of the
database. Rather than returning just exact matches, the query now
returns all alleles greater than or less than the entry and calculates
these haplotypes into the statistic statements.
74
Court Support
Frye Hearings
 State of California v Miszkewycz (county of Placer)
– “..contends that the statistical analysis applied in this case is faulty because
of an unreliable or unknown data base used to formulate the statistics”
– “Court is satisfied that accepted scientific procedures and principals (sic)
were properly used in this case” (After “Ms Caser’s” testimony)
 State of Kansas v Gonzalez
– Judge Pokorny, Seventh Judicial District for the District of Kansas, Douglas
County, KS
– Challenge that over a period of six months the frequency of the
evidence/suspect haplotype changed from 1/2717 to 1 in 1786
– 4 October 2010: found that Y-STR database is fit for purpose and motion to
deny/exclude Y-STR haplotype evidence denied
Peer Reviewed Journal Article
 US forensic Y-chromosome short tandem repeats database.
Ge,Budowle, Planz, Eisenberg, Ballantyne, Chakraborty. Legal
Medicine 12 289-295 (2010)
75
Release 2.7 – Next Version
•
•
•
•
FBI: 1691
Marshall: 249
SanDiego: 15
UNT: 951 (+ Y23)
• N increased by 2906
• NIST: 642 new Y23 markers
Future Goals
 To continue to solicit data and / or samples from
forensic laboratories
– to expand continuously the number of individuals (N) for
each ancestral group and geographical location
– plan updates approximately every 6 months if samples are
available
 To continuously incorporate the suggestions and
recommendations received from users
– to improve the design and functionality of the database to
better serve the needs of the forensic community
77
Population Substructure Issues
• Correction for population structure may be
necessary although depends upon no. of loci typed

Effective population size ¼ of autosomal loci
Substructure effects less in US than ancestral
populations

Use when reference database considered not
representative

Haplotype Frequencies
• Sample frequency ( p~A )
• Frequency in the sub-population ( pA* )
• Frequency over all sub-populations (pA)
Autosomal Markers
• With random sampling of individuals
– And if the population in question is in HardyWeinberg equilibrium
• The number of A alleles in a sample of n
genotypes has a binomial distribution with
parameters 2n and p*A, so that the variance of
~
pA is:
Autosomal markers
• Population genetic theories aimed at sensible match
probabilities
– Must address another level of sampling:
• That inherent to the underlying evolutionary process
•
For a very wide class of evolutionary models, the
variance of the actual allele frequency p*A over
populations is
~
q = Fst
Autosomal Markers
• If HW equilibrium within the population of interest
– Probability that a randomly drawn profile is AA
P*AA = (p*A)2
– Expected value over populations =
– The ‘match probability’ that two randomly and
independently drawn individuals in the same population
of interest are both AA is (p*A)4
• Probability that an unknown person has AA given that
one instance of AA has been seen in the population
Pr(AA AA) = Pr (AA, AA)/Pr(AA) = (p*A)2
• Match probability is the same as the profile probability
in this case
Matching with lineage markers
• Y chromosome haplotype (type A)
– Match and profile probabilities for a specific subpopulation are both equal to p*A
• Which is the profile frequency in that sub-population
• With population structure
– Match and profile probabilities are no longer the same
– Expected value of (p*A)2 is needed
– the match probability
*Note: match probability within a particular sub-population is
also greater than the haplotype frequency in the whole
population since q + (1 – q)pA >pA
Pr (A, B) = Pr(A B) x Pr (B)
Pr (A,A) = Pr(A A) x Pr(A)
Pr (A A) =
Pr(A,A)
Pr(A)
= pA2 + pA(1-pA)q
pA
= pA + (1-pA)q

pA + q – pAq = q + pA(1-q)
Bayesian (Frequency Surveying) Approach
• Beta posterior distribution when a sample of size
n contains x copies of A
• Expected value of this distribution
~ ]/[(1-q)/n+q]
– [(1-q)pA/n + qp
A
~
• Weighted mean of pA and pA
– with decreasing weight on the prior as the sample
size increases
• Bayesian analog of a confidence interval is the
credible interval based on 100(1-a) % of the
posterior distribution
Estimating q
• Autosomal markers
– Usual to assign a value in the range 0.01 – 0.05
• Non-Bayesian framework
– Necessary to have data from at least two subpopulations in order to estimate q
– Because q describes the normalized variance of allele
frequencies over sub-populations
• If a sample was available from the relevant subpopulation
– Sample haplotype frequency from those data could be
used directly for calculating match probabilities without
having to invoke q
Estimating q cont’d
• Otherwise, any estimation of q would have to use
data from a set of samples from sub-populations
that were considered relevant
• Average haplotype frequency over these samples
could be used as an estimate of pA
– Along with q estimated from the between-subpopulation variation of the ~
pA values.
Ewen’s sampling theory
• Number, k, of distinct alleles (or haplotypes) in a
sample is sufficient for estimating parameter y
– where 1/(1+y) is the probability that two alleles
(haplotypes) drawn randomly from a population are the
same
• Latter probability is also a definition of q
– For haploid population of size N with an infinite-alleles
mutation rate of m
y = 2Nm so that q = 1/(1+2Nm)
• Note:
q decreases as the mutation rate increases
haplotype mutation rates increase as the number of
constituent markers increase
Ewen’s sampling theory
• Maximum likelihood estimate of y (or q)
– Found by setting the observed value of k equal to its
expected value
– Numerical methods are needed to solve this equation
for q
– Note:
• Frequencies of particular alleles are not used here
• If every haolotype in a sample is unique
– k=n
– Provides an estimate of zero for q corresponding to
(indefinitely) large mutation rates
Brenner’s approach
k
• Proportion of haplotypes seen only once in a
sample of size (n-1) augmented by the crimescene haplotype
• Approximated the match probability between an
innocent suspect and a previously unseen trace
haplotype as (1-k)/n
• In most cases, will lead to a likelihood ratio of
n/(1-k)
– rather than n that would result if the match probability
was set to 1/n
• 1/(1-k) referred to as “inflation factor”
Brenner’s approach
• If trace haplotype had been seen (x-1) times in
a database of size (n-1)
– Likelihood ratio modified to n/[x(1-k)]
• Note:
– If all haplotypes are seen only once in a database
• k=1
• Match probability is 0
– Same quantification of evidential strength is
applied to all types of a lineage marker
• Expected value of k is (from Ewens and
Brenner)
Combination of lineage and autosomal
markers
• It can be done
• e.g. Walsh, Redd and Kammer. Joint match
probabilities for y chromosomal and autosomal
markers. For Sci Int 174 (2008) 234-238
• SWGDAM, Budowle, Weir and Buckleton
Another Approach: Estimate of q
q=
~
q=
1
1+4Nm
1
1+4Nmm
Autosomal
N = effective population size (~105)
m = mutation rate
Y-chromosome
N = effective population size
m = number of loci
m = mutation rate
Statistics and genetic sampling?
‘among
subpopulation’
component
‘within
subpopulation’
component
g
2 person Mixed Y-STR profile
a
b
c
d
e
f
h
i
No. of possible haplotypes = 2n, where n = no of (non 385) loci
exhibiting two alleles (as opposed to one) = 24 = 16
No of haplotypes = 2n x (k(k + 1)/2) where k = no of 385 peaks
a database search (N = 5000) reveals that 9 of the 16 haplotypes
have been observed at least once:
acegh
1
bcegh
0
DNA Mixture Interpretation
acegi
1
bcegi
2
acfgh
0
bcfgh
1
acfgi
0
bcfgi
1
adegh
0
bdegh
0
Presentation Title
adegi
0
bdegi
0
adfgh
1
bdfgh
1
adfgi
1
bdfgi
1
98
LR
V = acegh
S = bdfgi = 0.0034
RMNE/PE
RMNE/PE:
1. Using only haplotypes OBSERVED in the database
Count how many times each of the possible haplotypes are
observed in the database, sum them, determine an overall
frequency and that constitutes the PI/RMNE.
Thus 10/5000 = 1/500 = 0.002 which, with the binomial 95% CI, is =
0.00338 (1/296)
Then 1-PI = PE = 1-0.0038 = 0.9962 (ie 99.6 % can be excluded)
2. Using all haplotypes whether or not they have been observed in
the database
However, what about the 7 haplotypes that could be
components of the mixture but whose presence has not been
accounted for?
DNA Mixture Interpretation
Presentation Title
99
•
The 7 haplotypes could each occur with a
frequency of 3/N
−
•
The PI to take into account all possible
contributors to the mixture is 0.00338 + 0.0042 =
0.00758 (1/132)
−
•
Thus 7 x 3/N = 21/N = 21/5000 = 0.0042
PE = 1- 0.00758 = 0.99242 = 99.2%
PE = 99.6% (observed possible haplotypes)
versus 99.2% (all possible haplotypes)
Hp
= Prosecution hypothesis
= Mixture comprises (1 known + 1 unknown) OR (2
known) individuals
In this case assume = victim + suspect DNA comprises the
mixture, thus Pr (E|Hp) = 1
Hd
= Defense hypothesis
= the mixture comprises DNA from 2 random
unrelated males
haplotype
alleles
count
Pr (Hi) (with binomial sampling
correction)
H1 = victim
acegh
1
0.0034
H2
acegi
1
0.0034
H3
acfgh
0
0.0006
H4
acfgi
0
0.0006
H5
adegh
0
0.0006
H6
adegi
0
0.0006
H7
adfgh
1
0.0034
H8
adfgi
1
0.0034
H9
bcegh
0
0.0006
H10
bcegi
2
0.0068
H11
bcfgh
1
0.0034
H12
bcfgi
1
0.0034
H13
bdegh
0
0.0006
H14
bdegi
0
0.0006
H15
bdfgh
1
0.0034
H16 = suspect
bdfgi
1
0.0034
There are sixteen combinations of haplotypes that can
produce the evidence haplotype (e.g. H1 + H16)
Combinations of haplotypes that could
comprise the 2 person mixture
(1
• (2
• (3
• (4
• (5
• (6
• (7
• (8
•
+ 16) = (16 + 1)
+ 15) = (15 + 2)
+ 14) = (14 + 3)
+ 13) = (13 + 4)
+ 12) = (12 + 5)
+ 11) = (11 + 6)
+ 10) = (10 + 7)
+ 9) = (9 + 8)
Pr = 0.0034 x 0.0034 = 0.00001156
Pr = 0.0034 x 0.0034 = 0.00001156
Pr = 0.0006 x 0.0006 = 0.00000036
Pr = 0.0006 x 0.0006 = 0.00000036
Pr = 0.0006 x 0.0034 = 0.00000204
Pr = 0.0006 x 0.0034 = 0.00000204
Pr = 0.0034 x 0.0068 = 0.00002312
Pr = 0.0034 x 0.0006 = 0.00000204
∑ = 0.00005308
LR Calculation:
= 1/(2 x 0.00005308)
= 1/0.00010616
= 9419
Thus the DNA profiling results were 9419 times
more likely if the mixture comprised DNA from the
victim and the suspect than if it came from two
random unrelated individuals
-is this true? (random individuals chosen
from the database or might be expected to be
present in the database?)
RMNE versus LR and Unresolved Questions
•
•
•
•
RMNE = 1 in 296 or 1 in 132 (taking into account
unobserved haplotypes) males are included as potential
donors to the mixture
LR = DNA results are 9400 times more likely if the suspect
is admixed with the victim than if DNA from two random
males’ DNA is present (cf LR of 1/0.0034 = 294 for the
suspects haplotype if single source)
Population substructure correction added instead of or in
addition to, binomial sampling correction?
Denominator of LR
− Instead of haplotype frequencies some suggest use
frequency of pair wise haplotypes from database that
can explain the mixture versus all other pairs of
haplotypes that are possible
•
•
(8)/(1/2 x 5000)(4999) = 8/12497500 = 0.00000064 (without
binomial correction)
LR = 1/0.00000064 ≈ 1,562,000
•
•
Y-STR Mixture Tools were added to the database website on
June 20, 2010
Users can select the tool that best suits their needs and the tool
opens in a new window.
 Y-STR Mixture Tools were added to the database website on June
20, 2010
 Users can select the tool that best suits their needs and the tool
opens in a new window.
107
CA DOJ Y-STR Mixture Tool
Harris County MEO Y-STR Mixture Tool
Summary: Statistical Issues
• correct for population substructure (q)?
• how to measure q?
• correct for both genetic and statistical sampling?
• do not take into account the specific profile but
use purely statistical approach (Brenner)?
• mixtures?
We shall never
cease from
exploration
And the end of
all our exploring
Will be to arrive
where we
started
And know the
place for the
first time.
T. S. Eliot
Thank you
for your
attention!