Variant Finding UCD Genome Center Bioinformatics Core Tuesday 15 September 2015 Types of Variants Adapted from Alkan et al, Nature Reviews Genetics 2011 UC Davis Genome Center | Bioinformatics Core | M Britton Variants 2015-09-15 Why Look For Variants? ● Genotyping ● Correlation with Traits ● Breeding (Agriculture) ● Disease Susceptibility ● Disease Progression ● Population Structure ● Identification of changes to protein sequences UC Davis Genome Center | Bioinformatics Core | M Britton Variants 2015-09-15 Variant Calling Tools A few of the many SNP/Indel calling tools include: GATK (www.broadinstitute.org/gatk/) ● A suite of tools including local realigner, quality score recalibrator, and SNP/INDEL caller. Samtools (www.htslib.org) ● For working with SNPs and short INDELs Freebayes (github.com/ekg/freebayes) ● Finds SNPs, indels, MNPs (multi-nucleotide polymorphisms), and complex events (composite insertions and substitutions) UC Davis Genome Center | Bioinformatics Core | M Britton Variants 2015-09-15 Variant Calling Tools Different software is needed for larger scale variants, with fewer choices, including: Breakdancer (github.com/genome/breakdancer) ● predicts insertions, deletions, inversions, inter- and intra-chromosomal translocations. Delly2 (github.com/tobiasrausch/delly) ● discovers and genotypes deletions, tandem duplications, inversions and translocations ● includes visualization software Delly-maze and Dellysuave UC Davis Genome Center | Bioinformatics Core | M Britton Variants 2015-09-15 A Comparison of Tools Venn diagrams showing the number of identified variants for tested (A) germline, (B) somatic, (C) CNV, and (D) exome CNV tools. Stephan Pabinger et al. Brief Bioinform 2014;15:256-278 © The Author 2013. UC Davis Genome Center | Bioinformatics Core | M Britton Variants 2015-09-15 Variant Call Format (VCF) The general specifications for most of today’s file formats are at github.com/samtools/hts-specs The specs tend to be minimum requirements. Different software tools can produce different file versions that may (or may not) completely follow the spec, and often add tool-specific info. This can lead to compatibility issues between tools in a workflow. UC Davis Genome Center | Bioinformatics Core | M Britton Variants 2015-09-15 Variant Call Format (VCF) A good tutorial (with examples) can be found at faculty. washington.edu/browning/beagle/intro-to-vcf.html http://vcftools.sourceforge.net/specs.html ... VCF poster UC Davis Genome Center | Bioinformatics Core | M Britton Variants 2015-09-15 Variant Call Format (VCF) ##fileformat=VCFv4.1 ##fileDate=20130825 ##source=freeBayes v9.9.2-9-gfbf46fc-dirty ##reference=../results/8/8.fa ##phasing=none ##commandline="../tools/freebayes/bin/freebayes -f ../results/8/8.fa --min-alternate-fraction 0.03 --minmapping-quality 20 --min-base-quality 20 --ploidy 1 --pooled-continuous --use-best-n-alleles 4 --usemapping-quality --min-alternate-fraction 0.04 --min-alternate-count 1 ../results/8/8.bam" ##INFO=<ID=RO,Number=1,Type=Integer,Description="Reference allele observation count, with partial observations recorded fractionally"> ##INFO=<ID=AO,Number=A,Type=Integer,Description="Alternate allele observations, with partial observations recorded fractionally"> ##INFO=<ID=TYPE,Number=A,Type=String,Description="The type of allele, either snp, mnp, ins, del, or complex."> ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype"> ##FORMAT=<ID=GQ,Number=1,Type=Float,Description="Genotype Quality, the Phred-scaled marginal (or unconditional) probability of the called genotype"> ##FORMAT=<ID=GL,Number=G,Type=Float,Description="Genotype Likelihood, log10-scaled likelihoods of the data given the called genotype for each possible genotype generated from the reference and alternate alleles given the sample ploidy"> ##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth"> ##FORMAT=<ID=RO,Number=1,Type=Integer,Description="Reference allele observation count"> ##FORMAT=<ID=QR,Number=1,Type=Integer,Description="Sum of quality of the reference observations"> ##FORMAT=<ID=AO,Number=A,Type=Integer,Description="Alternate allele observation count"> ##FORMAT=<ID=QA,Number=A,Type=Integer,Description="Sum of quality of the alternate observations"> #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 8 8_PB1 26 . TGTTACGCG GCTTTTGC,TGTTTCTAC 27.2619 . AO=1,2;RO=0;TYPE=complex, complex GT:DP:RO:QR:AO:QA:GL 2:3:0:0:1,2:31,70:-4.46,-1.65,0 8_PB1 38 . TCA ACG,TA,AGA 0.0495692 . AO=1,1,1;RO=3;TYPE=complex,del,mnp GT:DP:RO:QR:AO:QA:GL 2:6:3:101:1,1,1:31,37,34:0,-4.556,-4.004,-4.28 8_PB1 42 . G A 3.94171e-14 . AO=8;RO=128;TYPE=snp GT:DP:RO:QR:AO:QA: GL UC Davis Genome Center | Bioinformatics Core | M Britton Variants 2015-09-15 Variant Call Format (VCF) ##fileformat=VCFv4.1 ##fileDate=20130825 ##source=freeBayes v9.9.2-9-gfbf46fc-dirty ##reference=../results/8/8.fa ##phasing=none ##commandline="../tools/freebayes/bin/freebayes -f ../results/8/8.fa --min-alternate-fraction 0.03 --minmapping-quality 20 --min-base-quality 20 --ploidy 1 --pooled-continuous --use-best-n-alleles 4 --usemapping-quality --min-alternate-fraction 0.04 --min-alternate-count 1 ../results/8/8.bam" ##INFO=<ID=RO,Number=1,Type=Integer,Description="Reference allele observation count, with partial observations recorded fractionally"> ##INFO=<ID=AO,Number=A,Type=Integer,Description="Alternate allele observations, with partial observations recorded fractionally"> ##INFO=<ID=TYPE,Number=A,Type=String,Description="The type of allele, either snp, mnp, ins, del, or complex."> ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype"> ##FORMAT=<ID=GQ,Number=1,Type=Float,Description="Genotype Quality, the Phred-scaled marginal (or unconditional) probability of the called genotype"> ##FORMAT=<ID=GL,Number=G,Type=Float,Description="Genotype Likelihood, log10-scaled likelihoods of the data given the called genotype for each possible genotype generated from the reference and alternate alleles given the sample ploidy"> ##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth"> ##FORMAT=<ID=RO,Number=1,Type=Integer,Description="Reference allele observation count"> ##FORMAT=<ID=QR,Number=1,Type=Integer,Description="Sum of quality of the reference observations"> ##FORMAT=<ID=AO,Number=A,Type=Integer,Description="Alternate allele observation count"> ##FORMAT=<ID=QA,Number=A,Type=Integer,Description="Sum of quality of the alternate observations"> #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 8 8_PB1 26 . TGTTACGCG GCTTTTGC,TGTTTCTAC 27.2619 . AO=1,2;RO=0;TYPE=complex, complex GT:DP:RO:QR:AO:QA:GL 2:3:0:0:1,2:31,70:-4.46,-1.65,0 8_PB1 38 . TCA ACG,TA,AGA 0.0495692 . AO=1,1,1;RO=3;TYPE=complex,del,mnp GT:DP:RO:QR:AO:QA:GL 2:6:3:101:1,1,1:31,37,34:0,-4.556,-4.004,-4.28 8_PB1 42 . G A 3.94171e-14 . AO=8;RO=128;TYPE=snp GT:DP:RO:QR:AO:QA: GL UC Davis Genome Center | Bioinformatics Core | M Britton Variants 2015-09-15 Variant Call Format (VCF) #CHROM POS ID REF 8_PB2 407 . A 1:170:21:788:149:5579:-5,0 CHROM = 8_PB2 POS = 407 ID = . REF = A ALT = G QUAL = 3935.83 ALT G QUAL FILTER 3935.83 . INFO FORMAT 8 AO=149;RO=21;TYPE=snp GT:DP:RO:QR:AO:QA:GL FILTER = . INFO = AO=149;RO=21;TYPE=snp FORMAT = GT:DP:RO:QR:AO:QA:GL 8 = 1:170:21:788:149:5579:-5,0 UC Davis Genome Center | Bioinformatics Core | M Britton Variants 2015-09-15 Variant Call Format (VCF) UC Davis Genome Center | Bioinformatics Core | M Britton Variants 2015-09-15 Variant Call Format (VCF) #CHROM POS ID REF 8_PB2 407 . A 170:21:788:149:5579:-5,0 CHROM = 8_PB2 POS = 407 ID = . REF = A ALT = G QUAL = 3935.83 ALT G QUAL FILTER 3935.83 . INFO FORMAT 8 AO=149;RO=21;TYPE=snp GT:DP:RO:QR:AO:QA:GL 1: ##FORMAT=<ID=DP,Number=1,Type=Integer, Description="Read Depth"> FILTER = . INFO = AO=149;RO=21;TYPE=snp FORMAT = GT:DP:RO:QR:AO:QA:GL 8 = 1:170:21:788:149:5579:-5,0 UC Davis Genome Center | Bioinformatics Core | M Britton Variants 2015-09-15 Variant Call Format (VCF) ##INFO=<ID=RO,Number=1,Type=Integer, Description="Reference allele observation count, with partial observations recorded fractionally" > ##INFO=<ID=AO,Number=A,Type=Integer, Description="Alternate allele observations, with partial observations recorded fractionally"> ##INFO=<ID=TYPE,Number=A,Type=String, Description="The type of allele, either snp, mnp, ins, del, or complex."> UC Davis Genome Center | Bioinformatics Core | M Britton Variants 2015-09-15 Variant Call Format (VCF) ##FORMAT=<ID=GT,Number=1,Type=String, Description="Genotype"> ##FORMAT=<ID=GQ,Number=1,Type=Float, Description="Genotype Quality, the Phred-scaled marginal (or unconditional) probability of the called genotype"> ##FORMAT=<ID=GL,Number=G,Type=Float, Description="Genotype Likelihood, log10-scaled likelihoods of the data given the called genotype for each possible genotype generated from the reference and alternate alleles given the sample ploidy"> ##FORMAT=<ID=DP,Number=1,Type=Integer, Description="Read Depth"> UC Davis Genome Center | Bioinformatics Core | M Britton Variants 2015-09-15 Variant Call Format (VCF) ##FORMAT=<ID=RO,Number=1,Type=Integer, Description="Reference allele observation count"> ##FORMAT=<ID=QR,Number=1,Type=Integer, Description="Sum of quality of the reference observations"> ##FORMAT=<ID=AO,Number=A,Type=Integer, Description="Alternate allele observation count"> ##FORMAT=<ID=QA,Number=A,Type=Integer, Description="Sum of quality of the alternate observations"> UC Davis Genome Center | Bioinformatics Core | M Britton Variants 2015-09-15 Variant Effect Prediction Tools snpEff (snpeff.sourceforge.net/) Variant Effect Predictor - EMBL (www.ensembl. org/info/docs/tools/vep/) SIFT (sift.jcvi.org) UC Davis Genome Center | Bioinformatics Core | M Britton Variants 2015-09-15 VCF after Effect Prediction #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 8 8_PB2 407 . A G 3935.83 . AO=149;RO=21;TYPE=snp;EFF=SYNONYMOUS_CODING (LOW|SILENT|gaA/gaG|E123|759|PB2||CODING|Tr_PB2|1|1) GT:DP:RO:QR:AO:QA:GL 1:170:21:788:149:5579: -5,0 CHROM = 8_PB2 POS = 407 ID = . REF = A ALT = G QUAL = 3935.83 FILTER = . INFO = AO=149;RO=21;TYPE=snp;EFF=SYNONYMOUS_CODING (LOW|SILENT|gaA/gaG|E123|759|PB2||CODING|Tr_PB2|1|1) FORMAT = GT:DP:RO:QR:AO:QA:GL 8 = 1:170:21:788:149:5579:-5,0 UC Davis Genome Center | Bioinformatics Core | M Britton Variants 2015-09-15 VCF after Effect Prediction ##INFO=<ID=TYPE,Number=A,Type=String,Description="The type of allele, either snp, mnp, ins, del, or complex."> ##INFO=<ID=EFF,Number=.,Type=String,Description="Predicted effects for this variant.Format: 'Effect ( Effect_Impact | Functional_Class | Codon_Change | Amino_Acid_change | Amino_Acid_length | Gene_Name | Transcript_BioType | Gene_Coding | Transcript_ID | Exon | GenotypeNum [ | ERRORS | WARNINGS ] )' "> INFO = AO=149;RO=21;TYPE=snp; EFF=SYNONYMOUS_CODING (LOW|SILENT|gaA/gaG|E123|759|PB2||CODING|Tr_PB2|1|1) UC Davis Genome Center | Bioinformatics Core | M Britton Variants 2015-09-15 UC Davis Genome Center | Bioinformatics Core | M Britton Variants 2015-09-15 Why Duplicates Are Bad https://www.broadinstitute.org/gatk/events/slides/1307/GATKwh1-BP-1-Map_and_Dedup.pdf UC Davis Genome Center | Bioinformatics Core | M Britton Variants 2015-09-15 The Need for Indel Realignment https://www.broadinstitute.org/gatk/events/slides/1307/GATKwh1-BP-2-Realignment.pdf UC Davis Genome Center | Bioinformatics Core | M Britton Variants 2015-09-15 Information Used For Indel Realignment Known sites (dbSNP, 1000 Genomes) Indels present in original alignments (in CIGARs) Sites where evidence suggests a hidden indel UC Davis Genome Center | Bioinformatics Core | M Britton Variants 2015-09-15 After Local Realignment - One Indel Remains https://www.broadinstitute.org/gatk/events/slides/1307/GATKwh1-BP-2-Realignment.pdf UC Davis Genome Center | Bioinformatics Core | M Britton Variants 2015-09-15 Base Quality Score Recalibration ● Critical for downstream analysis ● Scores assigned by sequencers are inaccurate and biased ● Recalibration information is obtained by analyzing covariation among several features of a base, including: ○ ○ ○ ○ Reported quality score Position within the read (machine cycle) Preceding and current nucleotide (sequencing chemistry effect) Known variants are used to discount most of the real genetic variation present in the sample ○ All other differences from the reference are assumed to be sequencing errors ○ Indel Realignments first reduces noise from misalignments UC Davis Genome Center | Bioinformatics Core | M Britton Variants 2015-09-15 Base Quality Score Recalibration https://www.broadinstitute.org/gatk/events/slides/1307/GATKwh1-BP-3-Base_recalibration.pdf UC Davis Genome Center | Bioinformatics Core | M Britton Variants 2015-09-15 Read Compression Discard redundant information Only keep the essential information for variant calling https://www.broadinstitute.org/gatk/events/slides/1307/GATKwh1-BP-4-Compression.pdf UC Davis Genome Center | Bioinformatics Core | M Britton Variants 2015-09-15 Read Compression: Full vs. Reduced BAM https://www.broadinstitute.org/gatk/events/slides/1307/GATKwh1-BP-4-Compression.pdf UC Davis Genome Center | Bioinformatics Core | M Britton Variants 2015-09-15 Haplotype Caller - Initial Variant Calling ● Calls SNPs, indels, and some structural variants simultaneously by performing a local de-novo assembly ● Distinguishes genetic variant and random machine noise ● Uses “active regions” for variant calling, based on significant evidence for variation ● Determines likelihoods of the haplotypes given the read data ● Assigns sample genotypes based on Bayesian likelihoods UC Davis Genome Center | Bioinformatics Core | M Britton Variants 2015-09-15 Variant Quality Score Recalibration (VQSR) ● Also called Hard Filtering ● Initial variant calling has very large set that is full of false positives ● Hand-tuned filtering requires time and expertise ● Statistical model could be used to recalibrate variants ● Each variant has a set of statistics associated with them that are called variant annotations ● Real variants tend to cluster together via these statistics ● SNPs and indels must be recalibrated separately ● Training resources: ○ SNP (HapMap, Omni, 1000G, dbSNP) ○ INDEL (Mills) ● Genome Center | Bioinformatics Core | M Britton UC Davis Variants 2015-09-15 Variant Quality Score Recalibration (VQSR) https://www.broadinstitute.org/gatk/events/slides/1307/GATKwh1-BP-6-Variant_recalibration.pdf UC Davis Genome Center | Bioinformatics Core | M Britton Variants 2015-09-15 Variant Filtering ● Based on some criteria relevant to your research ● Useful for: ○ Small data sets in terms of both number of samples or size of targeted regions ○ No database available with high confidence known variants ● Example: For SNPs: QD < 2.0 MQ < 40.0 FS > 60.0 MQRankSum < -12.5 ReadPosRankSum < -8.0 UC Davis Genome Center | Bioinformatics Core | M Britton For indels: QD < 2.0 ReadPosRankSum < -20.0 InbreedingCoeff < -0.8 FS > 200.0 Variants 2015-09-15 Genotype gVCF Files ● VCF files with information for every position in the genome regardless of variant calls ● Used by GATK to perform variant discovery in a way that enables joint analysis of multiple samples, but decoupled from the initial individual variant calling step. I.e. you don't have to call variants on all your samples together to perform a joint analysis ● Drastically reduces run time and allows for easy incorporation of additional samples into the pipeline ● Part of GATK 3.0, but NOT in our Galaxy AMI because wrappers have not been written yet UC Davis Genome Center | Bioinformatics Core | M Britton Variants 2015-09-15
© Copyright 2026 Paperzz