HIVE dna-hexagon Tutorial - High-performance Integrated Virtual

HIVE dna-hexagon Tutorial
The purpose of this tutorial is to guide the user through the process of a single alignment using the HIVE
dna-hexagon tool. All other variations on alignments using the tool employ this same basic process with
modified inputs and parameters.
TABLE OF CONTENTS
Introduction
1. Selecting Inputs
1.1 Query Sequences (reads)
1.2 Reference Genome
2. Input Parameters
2.1 Alignment Algorithm
2.2 Algorithmic Parameters
3. Job Processing
4. Alignment Results
5. What next?
INTRODUCTION
The HIVE dna-hexagon sequence alignment tool allows the user to align reads from a high throughput
experiment to a reference genome. Both reads and references may be selected from those provided by
HIVE or supplied by the user. (This tutorial assumes the query sequences are already in the system. For
help loading new data into HIVE, please see Tutorial for DMDownloader). Once query sequences are
selected, a number of summary and quality-check visualization tools are available to view. Numerous
parameters allow customization of the alignment to suit the user’s needs. When the alignment is
finished, the tool produces a hits table and pie chart showing hits of the query sequences to genes on
the reference genome in addition to several other visualizations, all of which can be downloaded. The
user may now compute SNP profiles or move onto a number of other analysis modules which can be
stacked end-to-end with the dna-hexagon aligner.
For easier understanding, all text for HIVE system options, parameters or buttons is displayed in bolded
blue. Any text the user should input is displayed in the Courier font. Equations are offset and
italicized.
1. SELECTING INPUTS
The only two inputs required for selection are the query sequences (reads) and the reference genome
(genomes). While further customization is available via specification of parameters, all other inputs
have set defaults such that the alignment will proceed without the user entering any additional values.
1.1 Query Sequences (reads)
Logged into HIVE in the user home page, the second menu to the left in the header region reads
HIVE-Portal (Figure 1). Hovering the mouse on top of this option will open a menu below. Upon
clicking the Sequence Alignment on Genome tool, you will be redirected to the HIVE alignment dnahexagon portal.
Figure 1. HIVE Home Directory for a Logged-In User
Once in the portal (Figure 2) you should see four input boxes: Reference Genomes, Short Reads,
Alignment Algorithm and Algorithmic Parameters. The Short Reads box should be visible in the
upper-right of the window. To view all the short reads available to you, click on the expansion icon
in the top left corner of this box. This will reveal the list (Figure 3) of all the reads you have
uploaded in addition to Demo Reads and other files shared with you by the HIVE team under the
user name Biological Data Handler, or by another user.
Figure 2. dna-hexagon View
Figure 3. List of Short Reads Accessed from dna-Hexagon Inputs Page
Note that the above representation is in a list view. At the bottom of the sequence reads window
there is an option to toggle the organization between the hierarchy
and list
formats by
clicking the associated icons.
Each item in the sequence reads menu is preceded by the checkbox icon
. To choose a file or
series of files for alignment check the box(es) next to the desired file(s). To choose all files in the
directory, check the
icon at the very top of the chart next to the ID heading.
If at any point you find yourself without the desirable sequence files or in the wrong place
altogether, clicking the Home link option on top left of the page will always return you to your user
home directory. For tutorial purposes, please click Home at this time. Any information loaded by or
shared with you in HIVE will be found in this directory.
For streamlined alignment access, you may also select short reads or genomes directly from your
home directory. All short reads available to you can be seen in the top page section labeled Files and
Sequences under the reads tab. Here you should see a list of files identical to that accessed from the
dna-hexagon page. To proceed with the tutorial, please check the two Influenza paired reads
shown in Figure 4 below.
Figure 4. Demo Input Sequences - Influenza Paired Reads
Notice the data in this case is displayed in a pair - these data come from paired-end read
experiments such that one is the forward read and the other the reverse read of the template.
Only by checking the boxes will you be able to select data for alignment, simply clicking on a file
name (selecting a file) will highlight the selected file and allow you to view details about it but will
not give you the option to perform any action (such as alignment) on the file. This preview
information can be viewed in the box to the right in the preview and details tabs. For your
convenience, the help tab containing topical help can also be accessed here.
Once you check the boxes next to this pair you should have operational icons appear in your toolbar
at the top of this list (Figure 5). Clicking on the dna-hexagon image
in this toolbar will direct you
again to the alignment tool. You should see your short read sequences have already been selected
and are listed in the short reads input box.
(NOTE: If the user wishes to upload new experimental sequence information into the system, click
the add tab on the home page in the Files and Sequences box. This will redirect the user to the
DMDownloader utility. Please see the DMDownloader tutorial for further details.)
With your query sequences selected, you are now ready to choose a reference genome.
Figure 5. HIVE Toolbar
1.2 Reference Genome
A reference genome is a comprehensive sequence of the entire genome of a given species,
assembled by scientists to be representative of that species. Reference genomes for several species
have been made available to the users, shared by the HIVE team under the user name Biological
Data Handler.
Selection of a reference genome should depend greatly on the experimental data being aligned. For
example, if you think you have isolated a new strain of influenza and you want to see how it
compares to other influenzas, you will want to select an influenza reference genome. Alternatively,
if you are looking for SNPs in a disease-associated human protein, you will need to select a human
reference genome. Perhaps you are looking for horizontal gene transfer between two species of
bacteria, in which case you may want to align your sequences from one species to the reference
genome of the other.
We have selected sequence data for Influenza. There may be several applicable genomes. In
addition to picking the right species genome, you may need further information to choose the best
possible genome for alignment. Continuing with the tutorial, we will now choose the influenza
genome. Click on the expansion icon found in the upper left corner of Reference Genomes box.
Please check the box next to ID 5153, filed named Influenza_Segments.fa, to select it as our
reference genome.
2. INPUT PARAMETERS
Below the Input Data box are two collapsible/expandable sections, Alignment Algorithm and Alignment
Parameters, which allow for complete customization of many aspects of the alignment. All parameters
are populated with default values such that a user is not required to specify any values in order to run
the alignment, they need only click the Align button after selecting inputs. To expand any closed section,
click the icon to the left of the section header. To hide an open section, click the icon to the left of
the section header of the newly expanded window.
2.1 Alignment Algorithm
Here you may select which alignment tool you would like to use to align your data. Access the menu
by clicking the expansion icon in the top left of the Alignment Algorithm box. Current options
include: HIVE’s dna-hexagon, NCBI-BLAST, Bowtie, TopHat, Ace View Magic and BWA. dna-hexagon
is the native, default alignment tool developed and optimized for use within high-performance cloud
computing environments and therefore is best adapted to efficiently use the HIVE infrastructure to
enhance analysis performance. The external tools have all been adapted to the parallel HIVE
environment, so you should see increased performance of most of those tools when comparing
their usage within HIVE to their usage as standalone tools. We recommend dna-hexagon for best
results, so this tutorial will be using dna-hexagon.
2.2 Algorithmic Parameters
The parameters shown by default on the portal page are those which are most often user-modified,
including the ability to specify a name for the resultant output file.
Name: Specify a name for the alignment output file.
Minimum Match Length: Alignments produced will likely vary in length. This option allows the user
to choose a preferred length of alignment by discarding all alignments shorter than the specified
value. The default minimum length is set to 75.
Matches to Keep: Alignment results can be filtered by choosing one of three options. The reported
matches/alignments can be returned as the Best Match or First Match, or as the list of All Matches
by selecting the preferred option from the drop-down menu. The First Match option can be useful
when the user is simply interested in finding the genome to which a query sequence belongs,
whereas the Best Match option ranks alignments according to alignment scores. Default is set to
Best Match.
Percent Mismatches Allowed: A mismatch in an alignment occurs when the query and reference
align at most points but differ in nucleotides at corresponding positions in the same region. An
alignment is not disregarded due to mismatches, but the alignment score reflects a penalty
associated with mismatches. The mismatch percentage is equal to the total number of mismatches
divided by the total number of positions.
Mismatch % = total # mismatches/total # positions
This filter will remove alignments with a mismatch % greater than the specified value. Default value
is set to 15 for 15%.
The remaining, less used and therefore hidden parameters are located in the Algorithm Parameters
box, accessed by clicking the expansion icon in the top left corner of the box.
Advanced Parameters:
Alignment Directionality: This parameter allows the user to choose whether sequences will be
aligned in a Forward or Bi-directional (forward and reverse) manner by selection from the dropdown menu. Default is Bi-directional.
Alignment Tail Extension: Extension can be performed in Conservative Compact or Aggressive
Extended mode according to the selection from the drop-down menu. The Aggressive
Extended option facilitates continued alignment of sequences until the threshold of mismatches
encountered is reached, resulting in lower sensitivity but higher coverage. The Conservative
Compact option terminates an alignment as soon as a higher score is obtained, providing higher
sensitivity but lower coverage. The Conservative Compact mode is a good choice when working
with small genomes that have a higher mutation rate. Default is set to Conservative Compact.
Allowed number of gaps during extension: The user can specify the maximum number of
insertions and deletions allowed during the seed extension phase of alignment. Default is set to
3.
Intelligent Diagonal: HIVE allows the ability to compute only the region surrounding the
diagonal of alignment scoring matrix under the assumption that the best score should lie near
the diagonal. If a user is uncertain, they can specify to compute the Complete matrix. Default is
set to Diagonal of a given size.
Intelligent K-mer Jump in Read: Increasing the seed size in large genomes improves the
alignment speed significantly at a potential cost of loss of sensitivity. The value 1 means no loss
in sensitivity. The size of the seed is the maximum recommended value. If auto is chosen it will
be set to 1 for small genomes and 2 for the large genomes. Default is auto.
Intelligent K-mer Jump in Reference: The same logic as for Intelligent K-mer Jump in Read
above applies, but with respect to the reference genome instead of the short read. Default is
auto.
K-mer Extension Minimal Length Percent: The minimal length of the extension must be equal to
this percentage of the seed length. Default is set to 66 for 66%.
K-mer Extension Mismatch Allowance Percent: This parameter allows the user to define the
percentile of mismatches allowed during extension of alignments. If 0 is entered, all positions
are computed using the Smith-Waterman algorithm. Default is set to 25 for 25%.
Optimal Alignment Search: This option allows the user to choose whether to find the optimal
alignment using the Smith-Waterman algorithm (default setting) or quickly look up highly
scoring hits by making selecting Only Identities from the drop-down menu. The Smith-
Waterman algorithm is a dynamic programming algorithm that employs a substitution matrix to
yield the highest scoring local alignment. The time required by the algorithm to return the
optimal alignment greatly increases with an increase in sequence length. If the user is not
interested in optimal alignments and merely wants to identify high scoring potential hits, the
Only Identities option will produce results at a faster rate.
Over-represented K-mer Suppression: Some K-mers appear in data much more than is to be
expected by random occurrence. This parameter allows the elimination of K-mer hits which are
this percentage more abundant than expected. Default set to 20 for 20%.
Use Read Self Similarity: Selecting the Use self similarity optimization (default) option allows
the aligner to index and count identical reads instead of aligning each identical read separately.
Width of Intelligent Diagonal: As mentioned in the Intelligent Diagonal description above, HIVE
allows computation of a defined region around the diagonal of the matrix as opposed to
computation of the entire matrix, including regions which are known to have no potential of
containing the highest scoring path. Setting the parameter to auto will specify a width equal to
the length of the seed.
Alignment Algorithm: Automatically populated by selection of algorithm mentioned in Section 2.1
above.
Alignment Filters:
Repeat and Transposition Discovery: This parameter allows the user to perform more detailed
multi-pass lookups in order to identify repeats and transpositions, or repeats only, by selection
of the preferred option from the drop-down menu. Inclusion of this search comes at the
expense of a longer run-time, so the default is set to exclude this search.
Score Filter: Each alignment produces a score based on matches, mismatches, insertions,
deletions and gaps. Usually, a higher score is correlated with a better alignment. This option
allows the user to eliminate alignment scores below a specified threshold. The default setting of
None will report all alignments and discard nothing.
Alignment Parameters:
Alignment Scope: Alignment of sequences across the reference genome can be done locally or
globally by choosing the appropriate value from the drop-down menu. A Local alignment tries to
find the best match in particular regions of the genome whereas a Global alignment tries
matching the entire sequence length to the reference genome (Figure 6). Global alignments are
good for similarity and identity searches between the sequences. Local alignments are better
suited for identifying conserved residues or domains.
Figure 6. Local vs. Global Alignments
Alignment Costs: Alignments are ranked by comparison of alignment scores, composed by the
assignment of cost values to all matches, mismatches and gaps. Matches are rewarded,
mismatches and gaps are penalized.
In a biological scenario, single base insertions or deletions are less probable than larger
insertions or deletions. For this reason, a gap opening is assigned a higher penalty than
continuation of a gap while computing the alignment score. If gaps continue over multiple
positions after the gap has already been opened, they are more likely the result of a true
insertion or deletion with potential evolutionary significance. Thus, the default Gap open
penalty has a more negative value than the default Gap next penalty.
Default values for Match Benefit, Mismatch Penalty, Gap Opening Cost, Gap Continuation Cost
and Mismatch Continuation Penalty are 5, -4, -12, -4 and -6, respectively.
Seed K-mer: A seed is a short fragment of the query sequence used to search for nucleotide
similarity. Seed length determines the size of the K-mer to look up in a hash table. Higher seed
values make the alignment faster but have a potential of decreasing the sensitivity of the
method. The default seed length is set at 11 letters, corresponding to the seed length employed
by the BLAST algorithm. Alternative seed lengths pre-populate the drop-down menu.
Shorter Terminus Alignment: Alignment coverage near the end (terminus) of a reference
segment is often less than ideal due to the Minimum Match Length threshold. For example, if
the Minimum Match Length parameter is set to 30 but you have a read that exactly matches 20
base pairs up to the end of segment, this exact match over the area that matters will be
excluded because it is shorter than the specified threshold. This parameter essentially allows
specification of a different match length threshold in the region of the terminus. Default set to 0
means to use the default Minimum Match Length.
Alignment Test Run Slice: This allows a user to conduct a test-run of parameters by specifying the
size of a subset of sequences to run per computational thread. The default of all indicates to run all
sequences in the alignment. To use the test-run feature, the user must specify a number of
sequences to run on each thread.
Low complexity filter: Some subjects are well known to have regions or entire sequences of low
complexity and a high degree of repeats. With these subjects in mind, HIVE allows the user to
specify masking such that regions of both reads and references can be excluded from the alignment
by defining the minimal entropy threshold and a window size of the sequence to be ignored.
Reference masking:
Minimal Shannon Entropy: The dropdown menu allows the user to define the Minimal
Shannon Entropy of a region of the reference genome allowed to be included in an
alignment. Default is 0 – permissive, which allows all regions to be included.
Window Size: The dropdown menu allows the user to specify the Window Size of the
reference genome to ignore if the calculated entropy drops below the specified Minimal
Shannon Entropy. Default of do not mask low complexity regions allows inclusion of all
regions regardless of calculated entropy.
Short read filtration:
Minimal Shannons Entropy The dropdown menu allows the user to define the Minimal
Shannon Entropy of a region of any short read allowed to be included in an alignment.
Default is 0 – permissive, which allows all regions to be included.
Window Size: The dropdown menu allows the user to specify the Window Size of all short
reads to ignore if the calculated entropy drops below the specified Minimal Shannon
Entropy. Default of do not mask low complexity regions allows inclusion of all regions
regardless of calculated entropy.
Number of Computational Subjects per Single Thread: This parameter allows the user to specify the
maximum number of sequences per thread to be aligned by a single compute node.
Query Data: Automatically populated by object IDs of selected input files
Reference Genomes: Automatically populated by selected reference genome
Reference genome serial number: Automatically populated by selected reference genome, if
applicable.
For the purposes of our tutorial, we will leave all parameters set to the default values, but we will
rename the output file “Demo_Align_1”. Make sure both short read sequences are selected as well
as the proper reference genome. Your page should now look identical to Figure 7.
Figure 7. dna-hexagon Portal with Specified Inputs.
3. JOB PROCESSING
To start the job click on the ALIGN button. This will refresh your page and present you with two new
boxes. The first titled HIVE-hexagon Alignment and the second Results.
The HIVE-hexagon Alignment box tracks the progress of your alignment. By clicking on the expand node
icon found in the top left side of this box, you can view the progress of the subcomponents of this
task.
The Results box will become populated as the alignment is carried out. The process status will change
from Waiting to Running to Done. The whole alignment is finished when all statuses read Done and the
progress bar will show 100% completion. The time elapsed clock will stop, but the run-time will
continue to be displayed. At this time, click the refresh button on the left side of this box to assure
you have the complete results.
4. ALIGNMENT RESULTS
The Results area has two main sections: on the left is a directory of all the aligned reads and on the right
using the tabs you can view various information about each read. Your Results section should now look
like Figure 8.
Figure 8. Default Alignment Results View
By default your results will be presented in a list format. You may alternately view them in a pie chart
format click on the piechart tab. You should now see a piechart representation of your results (See
Figure 9).
In order to view detailed alignment information, you must select a particular reference genome. While
some genomes may consist of a single segment or sequence, the influenza genome has 8 segments that
are each presented within the genome file as a separate reference sequence. Continuing with the
tutorial, select (highlight by clicking) the first aligned genome with the id of 5 and the name NS.fa.
Figure 9. Pie Chart View of Alignment Hits
The tabular views to the right include the following:
alignments: This tab shows all alignments for the selected reference sequence in a triplet format
such that the top line contains information about the reference or consensus sequence, the bottom
line contains information about the query sequence and the middle line shows matches and
mismatches with the use of symbols (See Figure 10). The # column corresponds to the ids supplied
to the reference/consensus sequence or genome and to the read id supplied to the specific read
from the selected query sequence data. Similarly, Sequence contains the names given to the
reference sequence file and the read sequence file. The Repeats column displays the counts of any
exactly repeated reads. In the Direction column, (+) indicates alignment in the forward direction and
(-) indicates alignment in reverse. The Start column displays the numeric position on each sequence
where an alignment starts. In the middle line of the Alignment column, a pipe symbol | indicates a
match between sequence whereas a dot . indicates a deletion, a – indicates an insertion and a space
indicates a mismatch. The End column displays the numeric position of each sequence where an
alignment ends.
Figure 10. alignments View
stack: This view is similar to the alignments view but only highlights the differences between the
reference and query sequence. Dots . represent matches, dashes – represent deletions and single
nucleotide polymorphisms are represented by the letter code (A,C,T or G) of the base found in the
read (See Figure 11). The mutation bias graph shows the distribution of base calls for a given
position along query reads.
Figure 11. stack Results View
hit table: Opening the hit table provides yet another way to visualize the alignment results with a
focus on scores and positional information. The table has thirteen columns, most of which have
already been discussed in the alignments or stack sections above. Columns not yet covered include:
Alignment number – assigns a unique id to each alignment; and Score – calculated as discussed
under the Alignment costs tab above.
downloads: Clicking the downloads tab will allow you to download all results views and tables as
a .csv file in addition to alignments in SAM format. Once a download is initiated by selecting the
desired data, processes will proceed as determined by your specific browser.
5. WHAT NEXT?
HIVE is built in a very modular way to allow stacking of different analytic tools end to end in a wide array
of configurations. Thus, following alignment via dna-hexagon, you have a few options of the next
analysis to perform on your data. On the top right of the Results box you will should see text that says
what can you do next (See Figure 12).
If you are unhappy with your results you can click on
Modify and Resubmit to edit your
parameters or inputs and try alignment again. To proceed onto other analyses, you can hover over
Profiling Tools and select the option that is best for your research workflow. Currently available
downstream applications include Sequence Profiling, Reference Recombination and Population
Analysis.
Figure 12. Access to Downstream Tools
This concludes the HIVE dna-hexagon tutorial. Please see the other tutorials or the tutorial videos
available on HIVE main pages for further information.