Detecting nucleotide differences and annotate SNP calls

Variant Detection Using Partek Genomics Suite™ 6.6
Overview
Partek Genomics Suite™ software can detect variations at single-base resolution in next
generation sequencing data. The single nucleotide variation (SNV) detection tools are
available for all of Partek Genomics Suite’s next generation sequencing (NGS) workflows,
including RNA-Seq, ChIP-Seq, and DNA-seq.
This tutorial will illustrate how to:
 Import NGS data sets
 Perform QA/QC
 Detect SNVs
 Filter the detected SNVs
 Create a custom annotation database
 Annotate detected SNVs with known SNPs
 Annotate detected SNVs with functional effects
 Visualize the SNVs
Note: The workflow described below is enabled in Partek Genomics Suite software version
6.6. Please contact the licensing team at [email protected] to request this version. The
screenshots shown below may vary across platforms and across different versions of Partek
Genomics Suite software.
Description of the Tutorial Data
In this tutorial, you will perform SNV analysis of an exome sequencing dataset using
Partek Genomic Suite’s DNA-Seq workflow. The data used in this tutorial was downloaded
from the NCBI SRA website with the SRA Study ID: SRP007386. That was the first
exome sequencing study aimed at understanding the genetic characteristics of the well
differentiated papillary mesothelioma of the peritoneum (WDPMP). There are three
samples in that study, WDPMP tissue and blood of the same patient, and the WDPMPderived cell line.
Two samples were chosen, WDPMP tissue (SRR305173) and peripheral blood
(SRR305174) harvested from the same patient. The data has been downloaded from the
NCBI SRA website in .sra format and subsequently converted to .fastq files. They were in
turn aligned to hg19 using Partek® FlowTM with BWA as the aligner of choice (default
settings). For the purpose of the present tutorial, we have extracted reads aligned to
chromosome 20 only, but the principles discussed here can be applied to the whole genome
as well.
Partek User’s Guide: Variant Detection
1
Note: Partek Genomics Suite software can import only aligned .bam/.sam files. If you have
data in .fasta or .fastq format, you need to align your files first. You can contact us
([email protected]) for further information.
Importing NGS Data Sets
To start variant detection, use the workflow selector to invoke DNA-Seq workflow, which
provides a step-by-step guide through the analysis (Figure 1).
Figure 1: Overview of the DNA-Seq workflow in Partek Genomics Suite
To import the aligned reads in .bam/.sam format, select Import and manage samples
under the Import section of the workflow (Figure 2).
Figure 2: Sequence Import dialog: browse to the files you would like to import
1. Files of type: select BAM Files (*.bam) from the drop-down list.
Partek User’s Guide: Variant Detection using Partek Genomics Suite 6.6
2. Browse… to the folder where you have stored the .bam files. Select the files to
import by checking the box to the left of the data files. For this tutorial, select
chr20-SRR305173.bam and chr20-SRR305174.bam.
3. Select OK and the next Sequence Import window will open as shown in Figure 3.
Figure 3: Sequence Import dialog: selecting Output file and directory, Species, and the
genome build (Genome/Transcriptome reference used to align the reads)
4. Configure the Sequence Import dialog as follows:
a. Output File – provide a name for the output spreadsheet file; the default is to
name the spreadsheet after the data being imported. Use the Browse… button if
you want to change the output directory.
b. Species –Since the tutorial data is human, select Homo sapiens.
c. Genome Build – Select the genome build that your data is aligned to. For this
tutorial data, choose hg 19.
d. Select OK – this will open the Bam Sample Manager dialog (Figure 4)
Figure 4: Bam Sample Manager dialog
Partek User’s Guide: Variant Detection
3
The Bam Sample Manager window shows the files that are to be imported. In this tutorial
the individual file names are SRA ID – in this case, assigning shorter, informative names
will lead to clearer labels/legends later in the workflow.
5. To change the names, select Manage samples to invoke the Assign files to samples
dialog (Figure 5). The path to the file is shown and the Sample ID is the filename by
default.
6. Change the first sample chr20-SRR305173 to Tumour and chr20-SRR305174 to
Normal.
Figure 5: Assign files to samples dialog
If you have data from one experimental condition, which is split into two or more .bam
files, you can use Manage samples to assign the files to the same sample. Additionally, one
can add or remove files as needed. If the files are being imported for the first time, they
have to be sorted in order to enable quick visualization and data analysis.
7. Select OK to go back to the Bam Sample Manager dialog.
8. Select Close to close the Bam Sample Manager dialog and import the data.
The imported data will appear in a spreadsheet (Figure 6). Each imported sample is listed
in a row, with the number of aligned reads displayed for that run.
Partek User’s Guide: Variant Detection using Partek Genomics Suite 6.6
Figure 6: Imported data in a spreadsheet
Sample attributes can be easily added for grouping if the data set has replicates or sample
groups. To do so, select Add sample attributes from the workflow. Please refer to other
online tutorials (such as Down Syndrome Study tutorial) for more information on Add
sample attributes.
The Sample ID column is particularly important when integration of data from different
workflows (i.e. different assays) is desired. The Sample ID column should contain a unique
identifier that will serve as a bridge between the multiple datasets, and it is important that
they are the identical in each spreadsheet.
In this tutorial, choosing a Sample ID column is optional because we do not perform any
genomic integration analysis. Nonetheless, choosing a Sample ID column is also useful in
telling the software that the entry in the Sample ID column will be used as a unique
identifier for each sample.
9. From the workflow, select Choose sample ID column.
10. Use the drop-down list to select the column 1. Sample ID (Figure 7).
Figure 7: Choose Sample ID column dialog
Partek User’s Guide: Variant Detection
5
Performing QA/QC
For quality assessment, select Alignments per read from the QA/QC section of the
workflow. The Alignments per read will report the number of reads that are unaligned
(align to 0 location), aligned to 1 location, 2 locations, etc. depending on the options
selected during the alignment. Alignments per read will also give you information on
whether your data is a single-end or a paired-end data.
Figure 8: Alignment Counts spreadsheet
From Figure 8 above, we can see that the data is a paired-end, as identified by the column
headers. For a paired-end reads data set, we would expect to see most of the reads in the
category 2 Paired End Alignments Per Read. If you observe a high count in the 0 Paired
End Alignments Per Read and 1 Paired End Alignments Per Read, then you should double
check your data at the alignment stage. If your data set is single-end, the columns will be
labeled in the following fashion: 0 Single End Alignments Per Read, 1 Single End
Alignments Per Read, etc.
Detecting SNVs
The Detect single nucleotide variations option can be found in the Allele-Specific
Analysis section of the workflow. Partek Genomics Suite supports two approaches for
SNV detection, Detect SNVs among samples and Detect SNVs against the reference
sequence (Figure 9).
Partek User’s Guide: Variant Detection using Partek Genomics Suite 6.6
Figure 9: Detect Single Nucleotide Variations (SNVs) dialog
The method for identifying SNVs is the genotype likelihood test. This test will identify the
likelihood that a particular genomic location differs from the reference sequence or among
the samples. For further details on the method used, please refer to the Detect SNVs in
Sequencing Data using Partek Genomic Suite software white paper which is available from
Help > On-line Tutorials. Then select the White Papers tab and look for RNA-Seq
section.
Detect SNVs Among Samples
This method will compare each genomic location directly across the different samples in
the spreadsheet (but not to the reference genome). That strategy is useful in detecting
somatic mutations, for instance.
In order to perform detect SNVs among samples, select Detect single nucleotide
variations under the Allele-Specific Analysis section and then Detect SNVs among
samples.
Figure 10: Detecting SNVs Across Samples dialog
Partek User’s Guide: Variant Detection
7
The Detect Differential SNPs Across Samples dialog (Figure 10) provides options to set the
log-odds ratio threshold and the resulting file name. A high log-odds score for a reported
SNP indicates a strong chance that at least one of the samples has a different base call at
that position. For this tutorial, use the default settings, and press OK to proceed.
A .2bit file containing the reference genome is needed for this step. Partek Genomics Suite
will automatically download the file if one is not already specified. Depending on the
speed of your internet connection, this download may take some time, but only needs to be
done once. When the SNP detection is completed, a new spreadsheet will appear as shown
in Figure 11.
Figure 11: Spreadsheet resulting from the “Detect SNVs among samples” tool
(SNVsArossSamples)
The resulting spreadsheet (Figure 11) has the following columns:
1
2
3
4 to (X+3)
remaining
columns
: position: the genomic location of the detected SNV.
: log-odds ratio of different genotypes: the score given to the detected
SNV. A higher score indicates a strong discrepancy in base composition
across the samples.
: reference base: the base call of the reference genome (for example:
hg19). If no reference genome is specified, N will be displayed.
: “Sample genotype call”: the most likely genotype call for each of the
samples at that location (X is the number of samples).
: number of A, C, G, T and N calls for each of the samples at the given
location. N refers to ambiguous (or unknown) base calls.
Detect SNVs against the reference sequence
This method will compare each genomic location against the reference sequence (such as
hg19), independently for all the samples. This is the starting point for using the subtraction
method to select only the SNVs that appear in one particular sample, but not in the other
sample. Moreover, the subsequent analysis in this tutorial will be based on the output from
this SNV detection method.
Partek User’s Guide: Variant Detection using Partek Genomics Suite 6.6
In order to perform Detect SNVs against the reference sequence, under the AlleleSpecific Analysis section of the workflow, select Detect single nucleotide variations and
then select Detect SNVs against the reference sequence.
Figure 12: Detect nucleotides that are different from the reference dialog
The Detect nucleotides that are different from the reference dialog (Figure 12) provides
options to set the log-odds ratio threshold and the resulting file name. A high log-odds
score for a reported SNV indicates a strong chance that the sample has a nucleotide that is
different from the reference sequence at that particular position. We can set the parameter
as default and select OK to run the SNV detection. After the SNV detection is completed,
you will see the results as shown in Figure 13 below.
Figure 13: Spreadsheet resulting from “Detect SNVs against the reference sequence” tool
Partek User’s Guide: Variant Detection
9
The results spreadsheet (Figure 13) has the following columns:
1
2
3
4
5
6
7
8
9
10
11
12 – 15
: position: the genomic location of the detected SNV.
: log-odds ratio of SNP against reference: the score given to the reported
SNP. A higher score indicates a strong discrepancy in base composition
compared to the reference sequence.
: sample ID: the sample that differed from the reference. If more than one
sample differed from the reference at the same location, the samples will be
displayed on separate lines.
: reference base: the base call of the reference genome.
: genotype call: the most likely genotype call of the sample listed in the
sample ID column at that position.
: total non-reference bases: total number of bases from the sample that do not
match the reference (not including no-calls).
: total coverage at locus: the total number of reads covering this position.
: non-reference average base qualities: Average Phred-scaled base quality
score of bases different from the reference.
: reference base qualities: Average Phred-scaled base quality score of bases
that match the reference.
: non-reference average mapping qualities: average mapping quality score of
reads containing the variant at the locus (0 = poor, 254 = good, 255 =
unknown).
: reference average mapping qualities: average mapping quality score of
reads containing the reference call at the locus.
: number of A, C, G, and T calls for the sample in the sample ID column at
this position.
Filtering the detected SNVs
After we have performed Detect SNVs against the reference sequence, we can then
proceed to create list of SNVs of interest. There are multiple ways to perform that and it
really depends on how we would like to filter it.
Partek User’s Guide: Variant Detection using Partek Genomics Suite 6.6
In this tutorial, the following filtering strategy will be applied:
1. Create a list of SNVs in the normal sample as well as the tumour sample.
2. Use a Venn diagram to overlap the SNVs from the normal with the tumour sample, and
obtain the SNVs present in tumour only.
3. Filter the tumour SNVs based on:
a. Coverage (≥50);
b. Non-reference average base quality (≥20);
c. Non-reference average mapping quality (≥20).
4. Filter off-target SNPs.
5. Filter SNVs if they are available on the known SNPs database, such as dbSNP.
Creating the List of SNVs in Tumour Sample Only
In order to create a list of SNVs in the tumour sample only, we first have to create a list of
SNVs in each sample and subsequently use the Venn Diagram to overlap them and then
select the section which belongs to the tumour sample only. We can use the Create region
list function under the Allele-Specific Analysis section.
1. Starting with the reference-snps spreadsheet (i.e. the result of Detect SNVs against the
reference sequence), select Create region list and you will see the List Creator
dialog (Figure 14).
Figure 14: List Creator dialog
2. Select Specify New Criteria to get the Configure Criteria dialog
Partek User’s Guide: Variant Detection
11
Figure 15: Configure Criteria dialog
Select the sample ID column. You should see two bars which signify Normal and
Tumour. Right-click on the Normal bar to select the normal SNVs, set Name to
Normal sample (Figure 16) and push OK. Then repeat the step #2, but this time
right-click on the Tumour bar to select the tumour SNVs, set the Name to Tumour
sample (Figure 17) and press OK.
Figure 16: Selecting SNVs in the normal sample
Partek User’s Guide: Variant Detection using Partek Genomics Suite 6.6
Figure 17: Select SNVs in the tumour sample
3. Go back to the List Creator and select both lists by holding the Ctrl-key. Select Venn
Diagram to overlap the two lists of SNVs (Figure 18).
Figure 18: Overlapping two lists
4. On the Venn Diagram dialog, you will notice that the section marked with “*”. Click
on the overlap region to deselect it and select the Tumour sample section to filter in
the SNVs that belong to the tumour sample only (Figure 19)
Partek User’s Guide: Variant Detection
13
Figure 19: Selecting Tumour sample only SNVs from Venn Diagram. The arrow indicates
that the 5827 SNVs in Tumour sample only have been selected (as shown by the *)
5. Press OK and you will be asked to specify a name for this list of SNVs; label it as
Tumour-only SNVs (Figure 20).
Figure 20: Naming a list in the List Creator dialog
6. Subsequently Tumour-only SNVs will appear in the List Creator dialog. Select the
Save button to save the list. You will be asked which spreadsheet to save. Ensure that
the Tumour-only SNVs spreadsheet is checked. Select OK to save it (Figure 21).
Partek User’s Guide: Variant Detection using Partek Genomics Suite 6.6
Figure 21: Selecting the criteria that will be saved as lists
7. Select Close to close the List Creator dialog box.
Filter the tumour-only SNVs based on several criteria
At this stage, we have the list with tumour-only SNVs. Several technical filtering can be
applied to the list and this tutorial covers how to apply the following three:
a. Coverage (≥50)
b. Non-reference average base quality (≥20)
c. Non-reference average mapping quality (≥20)
To quickly filter the list, we can use the Interactive Filter (
available on the tool bar.
) function which is
1. Select the Tumour-only SNVs spreadsheet and click on the Interactive Filter icon.
Select the column 7. total coverage at locus. Then set the Min to 50 and hit Enter
(Figure 22) (Tip: The black bar at the right means that the current spreadsheet is
under filtering. You can clear this filter if you mouse-over the black bar, right-click
and then choose Clear Filter. Alternatively, go to Filter > Filter Rows > Clear Row
Filters).
Figure 22: Filtering based on total coverage at locus. The interactive filter bar is
highlighted
Partek User’s Guide: Variant Detection
15
2. Repeat the filtering as described above on the columns 8. Non-reference average
base quality (set Min to 20) and 10. non-reference average mapping quality (set
Min to 20). Remember that whenever you specify the value to filter, you have to hit
Enter to ensure that the filter is applied. After this, you should expect to have only 16
SNVs available as shown in Figure 23.
Figure 23: Filtered Tumour-only SNVs, containing 16 SNVs (see Rows: in the bottom left
corner)
3. Save this list to replace (!) the original Tumour-only SNVs spreadsheet by pushing the
Save icon and answering Yes when asked save the filtered list only.
Creating a Custom Annotation Database
The sample exomes for this tutorial were captured using Agilent SureSelect Human All
Exon Kit v1.01. If you have the .bed file provided by a vendor, you can use it to create a
custom annotation using Partek Genomics Suite software. As we do not have Agilents
coordinates in a .bed format, we have instead downloaded all CCDS exon coordinates from
the University of California Santa Cruz site. The resulting hg19-ccds-exon.bed file is
included in the data set provided for this exercise. To proceed, we first need to create a
custom annotation file from that .bed file.
1. Go to menu Tools > Annotation Manager > Create Annotation. Then select BED
file (.bed). Refer to Figure 24 below.
Partek User’s Guide: Variant Detection using Partek Genomics Suite 6.6
Figure 24: Annotation Manager dialog
2. Browse… to point to the file hg19-ccds-exon.bed that comes together with this
tutorial data and set the Species to Homo sapiens and the Genome build to hg19
(Figure 25).
Figure 25: Create Annotation dialog
3. Select OK to create the annotation file and select Close to close the Annotation
Manager dialog.
Partek User’s Guide: Variant Detection
17
4. From the DNA-Seq workflow, under the Allele-Specific Analysis section, select
Annotate with known SNPs. Then select the Custom dropdown list and point to the
hg19-ccds-exon.bed (Figure 26).
Figure 26: Annotating with a custom file, hg19-ccds-exon.bed
5. Select OK to start the annotation and close the dialog.
6. The Tumour-only SNVs spreadsheet should now have additional two columns at the
far right, i.e. Known SNPs and # Known SNPs. If the SNV overlaps with the .bed
region, then you should have a value greater than 0 in the #Known SNPs column.
The “# Known SNPs” will tell you how many regions overlap with this SNV (Figure
27).
From Figure 27, it is clear that only 1 SNV is overlapping with the hg19-ccds-exon.bed,
meaning that remaining 15 SNVs could be off-target regions.
Partek User’s Guide: Variant Detection using Partek Genomics Suite 6.6
Figure 27: After annotating with hg19-ucsc-exon.bed
7. In order to filter in only the on-target SNV, use the Interactive Filter again and filter
based on column # Known SNPs. The criterion for Min set to 1 should be applied
(Figure 28).
Figure 28: Using the columns “known SNVs” to filter in only on-target SNVs
8. Select Save button to save the Tumour-only SNVs spreadsheet. Select Yes to save
only the filtered spreadsheet and this will be the only SNV that we finally have.
Annotate detected SNVs with known SNPs
Now, we have only one selected SNV and are interested if this SNV is already known in
the public database dbSNP. We can easily annotate the SNV with dbSNP by selecting the
Annotate with known SNPs again. Select the dbSNP 135 as the dbSNP annotation
database.
The final result shows that this SNV is not available within the known dbSNP database as
Partek User’s Guide: Variant Detection
19
shown in Figure 29, since the right-most two columns, Known SNPs and #Known SNP are
None and 0, respectively.
Figure 29: Annotating SNVs with dbSNP135. Columns #18 and #19show that this SNV
does not overlap with any of the SNV in the dbSNP135 database
Since this is a novel SNV, no additional filtering will be done. We can also proceed to
annotate with COSMIC database. Select on the Annotate with known SNPs again and the
COSMIC database will be available under the Genomic Variants Database. Choose the
COSMIC database and you will notice that this SNV is not described within the COSMIC
database. This step is left for you to explore by yourself.
Annotate detected SNVs with functional effects
After we have come to the final list of detected SNVs (Tumour-only SNVs spreadsheet), we
would like to know where this SNV is located within the gene, and whether there is any
potentially deleterious functional effect associated with this SNV.
Select the Tumour-only SNVs spreadsheet and go to Annotate functional effects under the
Allele-Specific Analysis section of the workflow. You will then be prompted to decide on
the transcript annotation database for the annotation. For this tutorial, choose RefSeq
Transcripts – 2013-09-03. The resulting output file will be named as annotate-snvs. We
can leave the other parameters at their default values (Figure 30).
Figure 30: Annotate SNVs with transcripts dialog
Partek User’s Guide: Variant Detection using Partek Genomics Suite 6.6
Your resultant spreadsheet should have 12 columns (Figure 31).
Figure 31: Annotate-SNVs spreadsheet
The columns are:
1
: Chromosome: chromosome coordinate.
2
: Position: position of the SNV on the chromosome.
3
: Reference: the reference genome call for that position.
4
: Alt: the alternative base observed at this position based on the Tumour-only
SNVs spreadsheet.
5
: Sample ID: the sample that carries this SNV.
6
: Gene Symbol: the gene that this SNV locates at.
7
: Transcript: the transcript ID that this SNV locates at.
8
: Strand: the strand of the transcript described in column #7.
9
: Gene Section: the gene section of where this SNV is located. It can have one
of the following values: Exon, Intron, Promoter, 5’-Splice Site, 3’-Splice
Site.
10 : Functional Effect: the effect that this SNV might cause. It can have one of
the following values: Intronic, Missense, Nonsense, Synonymous, Splicing
Site, ncRNA, Promoter, 5’-UTR, 3’-UTR.
11 : Nucleotide Change: the nucleotide change with respect to the transcript. If
the transcript is on the reverse (-) strand, then the nucleotide will be the
reverse complement of the Reference and Alt bases. The position is based on
the cDNA nomenclature.
12 : Amino acid change: the amino acid change that could happen because of the
base change. This is only applicable if the functional effect is Missense or
Nonsense. The position will be based on the protein position”.
Visualize the SNVs
On any of the SNV spreadsheet, such as reference-snps, SNVsAcrossSamples, or Tumouronly SNVs spreadsheet, we can always right-click on the row header and then select
Browse to Location (Figure 32). The resultant chromosome view is shown in Figure 33.
You are encouraged to refer to the Chromosome Viewer User Guide which is available on
Help > Online Tutorials, under the User Guides section.
Partek User’s Guide: Variant Detection
21
Figure 32: Using context menu to Browse to Location
Figure 33: Chromosome View; tracks (from the top): chosen transcript model, SNP
Proportion track (the box represents an SNV call, while the colors represent relative
proportions of the short sequencing reads with the reference and the alternative base call),
Bam profile track (one track per sample, showing short sequencing reads), reference
genome.
Partek User’s Guide: Variant Detection using Partek Genomics Suite 6.6
End of Tutorial
For additional assistance, contact our technical support staff by phone at +1-314-878-2329
or by email [email protected].
Last revision: February 2014
Copyright  2014 by Partek Incorporated. All Rights Reserved. Reproduction of this material without express written
consent from Partek Incorporated is strictly prohibited.
Partek User’s Guide: Variant Detection
23