Variant Finding - UC Davis Bioinformatics Core

Variant Finding
UCD Genome Center Bioinformatics Core
Tuesday 15 September 2015
Types of Variants
Adapted from Alkan et al, Nature Reviews Genetics 2011
UC Davis Genome Center | Bioinformatics Core | M Britton
Variants 2015-09-15
Why Look For Variants?
● Genotyping
● Correlation with Traits
● Breeding (Agriculture)
● Disease Susceptibility
● Disease Progression
● Population Structure
● Identification of changes to protein sequences
UC Davis Genome Center | Bioinformatics Core | M Britton
Variants 2015-09-15
Variant Calling Tools
A few of the many SNP/Indel calling tools include:
GATK (www.broadinstitute.org/gatk/)
● A suite of tools including local realigner, quality score
recalibrator, and SNP/INDEL caller.
Samtools (www.htslib.org)
● For working with SNPs and short INDELs
Freebayes (github.com/ekg/freebayes)
● Finds SNPs, indels, MNPs (multi-nucleotide
polymorphisms), and complex events (composite
insertions and substitutions)
UC Davis Genome Center | Bioinformatics Core | M Britton
Variants 2015-09-15
Variant Calling Tools
Different software is needed for larger scale
variants, with fewer choices, including:
Breakdancer (github.com/genome/breakdancer)
● predicts insertions, deletions, inversions, inter- and
intra-chromosomal translocations.
Delly2 (github.com/tobiasrausch/delly)
● discovers and genotypes deletions, tandem
duplications, inversions and translocations
● includes visualization software Delly-maze and Dellysuave
UC Davis Genome Center | Bioinformatics Core | M Britton
Variants 2015-09-15
A Comparison of Tools
Venn diagrams
showing the number of
identified variants for
tested (A) germline, (B)
somatic, (C) CNV, and
(D) exome CNV tools.
Stephan Pabinger et al. Brief Bioinform 2014;15:256-278
© The Author 2013.
UC Davis Genome Center | Bioinformatics Core | M Britton
Variants 2015-09-15
Variant Call Format (VCF)
The general specifications for most of today’s file
formats are at github.com/samtools/hts-specs
The specs tend to be minimum requirements.
Different software tools can produce different file
versions that may (or may not) completely follow
the spec, and often add tool-specific info.
This can lead to compatibility issues between tools
in a workflow.
UC Davis Genome Center | Bioinformatics Core | M Britton
Variants 2015-09-15
Variant Call Format (VCF)
A good tutorial (with examples) can be found at faculty.
washington.edu/browning/beagle/intro-to-vcf.html
http://vcftools.sourceforge.net/specs.html ... VCF poster
UC Davis Genome Center | Bioinformatics Core | M Britton
Variants 2015-09-15
Variant Call Format (VCF)
##fileformat=VCFv4.1
##fileDate=20130825
##source=freeBayes v9.9.2-9-gfbf46fc-dirty
##reference=../results/8/8.fa
##phasing=none
##commandline="../tools/freebayes/bin/freebayes -f ../results/8/8.fa --min-alternate-fraction 0.03 --minmapping-quality 20 --min-base-quality 20 --ploidy 1 --pooled-continuous --use-best-n-alleles 4 --usemapping-quality --min-alternate-fraction 0.04 --min-alternate-count 1 ../results/8/8.bam"
##INFO=<ID=RO,Number=1,Type=Integer,Description="Reference allele observation count, with partial
observations recorded fractionally">
##INFO=<ID=AO,Number=A,Type=Integer,Description="Alternate allele observations, with partial observations
recorded fractionally">
##INFO=<ID=TYPE,Number=A,Type=String,Description="The type of allele, either snp, mnp, ins, del, or
complex.">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Float,Description="Genotype Quality, the Phred-scaled marginal (or
unconditional) probability of the called genotype">
##FORMAT=<ID=GL,Number=G,Type=Float,Description="Genotype Likelihood, log10-scaled likelihoods of the data
given the called genotype for each possible genotype generated from the reference and alternate alleles
given the sample ploidy">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=RO,Number=1,Type=Integer,Description="Reference allele observation count">
##FORMAT=<ID=QR,Number=1,Type=Integer,Description="Sum of quality of the reference observations">
##FORMAT=<ID=AO,Number=A,Type=Integer,Description="Alternate allele observation count">
##FORMAT=<ID=QA,Number=A,Type=Integer,Description="Sum of quality of the alternate observations">
#CHROM POS
ID
REF
ALT
QUAL
FILTER INFO
FORMAT 8
8_PB1
26
.
TGTTACGCG
GCTTTTGC,TGTTTCTAC
27.2619 .
AO=1,2;RO=0;TYPE=complex,
complex
GT:DP:RO:QR:AO:QA:GL
2:3:0:0:1,2:31,70:-4.46,-1.65,0
8_PB1
38
.
TCA
ACG,TA,AGA
0.0495692
.
AO=1,1,1;RO=3;TYPE=complex,del,mnp
GT:DP:RO:QR:AO:QA:GL
2:6:3:101:1,1,1:31,37,34:0,-4.556,-4.004,-4.28
8_PB1
42
.
G
A
3.94171e-14
.
AO=8;RO=128;TYPE=snp
GT:DP:RO:QR:AO:QA:
GL
UC Davis Genome Center | Bioinformatics Core | M Britton
Variants 2015-09-15
Variant Call Format (VCF)
##fileformat=VCFv4.1
##fileDate=20130825
##source=freeBayes v9.9.2-9-gfbf46fc-dirty
##reference=../results/8/8.fa
##phasing=none
##commandline="../tools/freebayes/bin/freebayes -f ../results/8/8.fa --min-alternate-fraction 0.03 --minmapping-quality 20 --min-base-quality 20 --ploidy 1 --pooled-continuous --use-best-n-alleles 4 --usemapping-quality --min-alternate-fraction 0.04 --min-alternate-count 1 ../results/8/8.bam"
##INFO=<ID=RO,Number=1,Type=Integer,Description="Reference allele observation count, with partial
observations recorded fractionally">
##INFO=<ID=AO,Number=A,Type=Integer,Description="Alternate allele observations, with partial observations
recorded fractionally">
##INFO=<ID=TYPE,Number=A,Type=String,Description="The type of allele, either snp, mnp, ins, del, or
complex.">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Float,Description="Genotype Quality, the Phred-scaled marginal (or
unconditional) probability of the called genotype">
##FORMAT=<ID=GL,Number=G,Type=Float,Description="Genotype Likelihood, log10-scaled likelihoods of the data
given the called genotype for each possible genotype generated from the reference and alternate alleles
given the sample ploidy">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=RO,Number=1,Type=Integer,Description="Reference allele observation count">
##FORMAT=<ID=QR,Number=1,Type=Integer,Description="Sum of quality of the reference observations">
##FORMAT=<ID=AO,Number=A,Type=Integer,Description="Alternate allele observation count">
##FORMAT=<ID=QA,Number=A,Type=Integer,Description="Sum of quality of the alternate observations">
#CHROM POS
ID
REF
ALT
QUAL
FILTER INFO
FORMAT 8
8_PB1
26
.
TGTTACGCG
GCTTTTGC,TGTTTCTAC
27.2619 .
AO=1,2;RO=0;TYPE=complex,
complex
GT:DP:RO:QR:AO:QA:GL
2:3:0:0:1,2:31,70:-4.46,-1.65,0
8_PB1
38
.
TCA
ACG,TA,AGA
0.0495692
.
AO=1,1,1;RO=3;TYPE=complex,del,mnp
GT:DP:RO:QR:AO:QA:GL
2:6:3:101:1,1,1:31,37,34:0,-4.556,-4.004,-4.28
8_PB1
42
.
G
A
3.94171e-14
.
AO=8;RO=128;TYPE=snp
GT:DP:RO:QR:AO:QA:
GL
UC Davis Genome Center | Bioinformatics Core | M Britton
Variants 2015-09-15
Variant Call Format (VCF)
#CHROM POS
ID
REF
8_PB2
407
.
A
1:170:21:788:149:5579:-5,0
CHROM
= 8_PB2
POS
= 407
ID
= .
REF
= A
ALT
= G
QUAL
= 3935.83
ALT
G
QUAL
FILTER
3935.83 .
INFO
FORMAT 8
AO=149;RO=21;TYPE=snp
GT:DP:RO:QR:AO:QA:GL
FILTER = .
INFO
= AO=149;RO=21;TYPE=snp
FORMAT = GT:DP:RO:QR:AO:QA:GL
8
= 1:170:21:788:149:5579:-5,0
UC Davis Genome Center | Bioinformatics Core | M Britton
Variants 2015-09-15
Variant Call Format (VCF)
UC Davis Genome Center | Bioinformatics Core | M Britton
Variants 2015-09-15
Variant Call Format (VCF)
#CHROM POS
ID
REF
8_PB2
407
.
A
170:21:788:149:5579:-5,0
CHROM
= 8_PB2
POS
= 407
ID
= .
REF
= A
ALT
= G
QUAL
= 3935.83
ALT
G
QUAL
FILTER
3935.83 .
INFO
FORMAT 8
AO=149;RO=21;TYPE=snp
GT:DP:RO:QR:AO:QA:GL
1:
##FORMAT=<ID=DP,Number=1,Type=Integer,
Description="Read Depth">
FILTER = .
INFO
= AO=149;RO=21;TYPE=snp
FORMAT = GT:DP:RO:QR:AO:QA:GL
8
= 1:170:21:788:149:5579:-5,0
UC Davis Genome Center | Bioinformatics Core | M Britton
Variants 2015-09-15
Variant Call Format (VCF)
##INFO=<ID=RO,Number=1,Type=Integer,
Description="Reference allele observation count,
with partial observations recorded fractionally"
>
##INFO=<ID=AO,Number=A,Type=Integer,
Description="Alternate allele observations, with
partial observations recorded fractionally">
##INFO=<ID=TYPE,Number=A,Type=String,
Description="The type of allele, either snp,
mnp, ins, del, or complex.">
UC Davis Genome Center | Bioinformatics Core | M Britton
Variants 2015-09-15
Variant Call Format (VCF)
##FORMAT=<ID=GT,Number=1,Type=String,
Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Float,
Description="Genotype Quality, the Phred-scaled
marginal (or unconditional) probability of the
called genotype">
##FORMAT=<ID=GL,Number=G,Type=Float,
Description="Genotype Likelihood, log10-scaled
likelihoods of the data given the called
genotype for each possible genotype generated
from the reference and alternate alleles given
the sample ploidy">
##FORMAT=<ID=DP,Number=1,Type=Integer,
Description="Read Depth">
UC Davis Genome Center | Bioinformatics Core | M Britton
Variants 2015-09-15
Variant Call Format (VCF)
##FORMAT=<ID=RO,Number=1,Type=Integer,
Description="Reference allele observation count">
##FORMAT=<ID=QR,Number=1,Type=Integer,
Description="Sum of quality of the reference
observations">
##FORMAT=<ID=AO,Number=A,Type=Integer,
Description="Alternate allele observation count">
##FORMAT=<ID=QA,Number=A,Type=Integer,
Description="Sum of quality of the alternate
observations">
UC Davis Genome Center | Bioinformatics Core | M Britton
Variants 2015-09-15
Variant Effect Prediction Tools
snpEff (snpeff.sourceforge.net/)
Variant Effect Predictor - EMBL (www.ensembl.
org/info/docs/tools/vep/)
SIFT (sift.jcvi.org)
UC Davis Genome Center | Bioinformatics Core | M Britton
Variants 2015-09-15
VCF after Effect Prediction
#CHROM POS
ID
REF
ALT
QUAL
FILTER INFO
FORMAT 8
8_PB2
407
.
A
G
3935.83 .
AO=149;RO=21;TYPE=snp;EFF=SYNONYMOUS_CODING
(LOW|SILENT|gaA/gaG|E123|759|PB2||CODING|Tr_PB2|1|1) GT:DP:RO:QR:AO:QA:GL
1:170:21:788:149:5579:
-5,0
CHROM
= 8_PB2
POS
= 407
ID
= .
REF
= A
ALT
= G
QUAL
= 3935.83
FILTER = .
INFO
= AO=149;RO=21;TYPE=snp;EFF=SYNONYMOUS_CODING
(LOW|SILENT|gaA/gaG|E123|759|PB2||CODING|Tr_PB2|1|1)
FORMAT = GT:DP:RO:QR:AO:QA:GL
8
= 1:170:21:788:149:5579:-5,0
UC Davis Genome Center | Bioinformatics Core | M Britton
Variants 2015-09-15
VCF after Effect Prediction
##INFO=<ID=TYPE,Number=A,Type=String,Description="The type of
allele, either snp, mnp, ins, del, or complex.">
##INFO=<ID=EFF,Number=.,Type=String,Description="Predicted
effects for this variant.Format: 'Effect ( Effect_Impact |
Functional_Class | Codon_Change | Amino_Acid_change |
Amino_Acid_length | Gene_Name | Transcript_BioType |
Gene_Coding | Transcript_ID | Exon | GenotypeNum [ | ERRORS
| WARNINGS ] )' ">
INFO
= AO=149;RO=21;TYPE=snp;
EFF=SYNONYMOUS_CODING
(LOW|SILENT|gaA/gaG|E123|759|PB2||CODING|Tr_PB2|1|1)
UC Davis Genome Center | Bioinformatics Core | M Britton
Variants 2015-09-15
UC Davis Genome Center | Bioinformatics Core | M Britton
Variants 2015-09-15
Why Duplicates Are Bad
https://www.broadinstitute.org/gatk/events/slides/1307/GATKwh1-BP-1-Map_and_Dedup.pdf
UC Davis Genome Center | Bioinformatics Core | M Britton
Variants 2015-09-15
The Need for Indel Realignment
https://www.broadinstitute.org/gatk/events/slides/1307/GATKwh1-BP-2-Realignment.pdf
UC Davis Genome Center | Bioinformatics Core | M Britton
Variants 2015-09-15
Information Used For Indel Realignment
Known sites (dbSNP, 1000 Genomes)
Indels present in original alignments (in CIGARs)
Sites where evidence suggests a hidden indel
UC Davis Genome Center | Bioinformatics Core | M Britton
Variants 2015-09-15
After Local Realignment - One Indel Remains
https://www.broadinstitute.org/gatk/events/slides/1307/GATKwh1-BP-2-Realignment.pdf
UC Davis Genome Center | Bioinformatics Core | M Britton
Variants 2015-09-15
Base Quality Score Recalibration
● Critical for downstream analysis
● Scores assigned by sequencers are inaccurate and
biased
● Recalibration information is obtained by analyzing
covariation among several features of a base, including:
○
○
○
○
Reported quality score
Position within the read (machine cycle)
Preceding and current nucleotide (sequencing chemistry effect)
Known variants are used to discount most of the real genetic
variation present in the sample
○ All other differences from the reference are assumed to be
sequencing errors
○ Indel Realignments first reduces noise from misalignments
UC Davis Genome Center | Bioinformatics Core | M Britton
Variants 2015-09-15
Base Quality Score Recalibration
https://www.broadinstitute.org/gatk/events/slides/1307/GATKwh1-BP-3-Base_recalibration.pdf
UC Davis Genome Center | Bioinformatics Core | M Britton
Variants 2015-09-15
Read Compression
Discard redundant information
Only keep the essential information for variant
calling
https://www.broadinstitute.org/gatk/events/slides/1307/GATKwh1-BP-4-Compression.pdf
UC Davis Genome Center | Bioinformatics Core | M Britton
Variants 2015-09-15
Read Compression: Full vs. Reduced BAM
https://www.broadinstitute.org/gatk/events/slides/1307/GATKwh1-BP-4-Compression.pdf
UC Davis Genome Center | Bioinformatics Core | M Britton
Variants 2015-09-15
Haplotype Caller - Initial Variant Calling
● Calls SNPs, indels, and some structural variants
simultaneously by performing a local de-novo
assembly
● Distinguishes genetic variant and random machine
noise
● Uses “active regions” for variant calling, based on
significant evidence for variation
● Determines likelihoods of the haplotypes given the
read data
● Assigns sample genotypes based on Bayesian
likelihoods
UC Davis Genome Center | Bioinformatics Core | M Britton
Variants 2015-09-15
Variant Quality Score Recalibration (VQSR)
● Also called Hard Filtering
● Initial variant calling has very large set that is full of false
positives
● Hand-tuned filtering requires time and expertise
● Statistical model could be used to recalibrate variants
● Each variant has a set of statistics associated with them
that are called variant annotations
● Real variants tend to cluster together via these statistics
● SNPs and indels must be recalibrated separately
● Training resources:
○ SNP (HapMap, Omni, 1000G, dbSNP)
○ INDEL (Mills)
● Genome Center | Bioinformatics Core | M Britton
UC Davis
Variants 2015-09-15
Variant Quality Score Recalibration (VQSR)
https://www.broadinstitute.org/gatk/events/slides/1307/GATKwh1-BP-6-Variant_recalibration.pdf
UC Davis Genome Center | Bioinformatics Core | M Britton
Variants 2015-09-15
Variant Filtering
● Based on some criteria relevant to your research
● Useful for:
○ Small data sets in terms of both number of samples or
size of targeted regions
○ No database available with high confidence known
variants
● Example:
For SNPs:
QD < 2.0
MQ < 40.0
FS > 60.0
MQRankSum < -12.5
ReadPosRankSum < -8.0
UC Davis Genome Center | Bioinformatics Core | M Britton
For indels:
QD < 2.0
ReadPosRankSum < -20.0
InbreedingCoeff < -0.8
FS > 200.0
Variants 2015-09-15
Genotype gVCF Files
● VCF files with information for every position in the
genome regardless of variant calls
● Used by GATK to perform variant discovery in a way
that enables joint analysis of multiple samples, but
decoupled from the initial individual variant calling step.
I.e. you don't have to call variants on all your samples
together to perform a joint analysis
● Drastically reduces run time and allows for easy
incorporation of additional samples into the pipeline
● Part of GATK 3.0, but NOT in our Galaxy AMI
because wrappers have not been written yet
UC Davis Genome Center | Bioinformatics Core | M Britton
Variants 2015-09-15