HIVE dna-hexagon Tutorial The purpose of this tutorial is to guide the user through the process of a single alignment using the HIVE dna-hexagon tool. All other variations on alignments using the tool employ this same basic process with modified inputs and parameters. TABLE OF CONTENTS Introduction 1. Selecting Inputs 1.1 Query Sequences (reads) 1.2 Reference Genome 2. Input Parameters 2.1 Alignment Algorithm 2.2 Algorithmic Parameters 3. Job Processing 4. Alignment Results 5. What next? INTRODUCTION The HIVE dna-hexagon sequence alignment tool allows the user to align reads from a high throughput experiment to a reference genome. Both reads and references may be selected from those provided by HIVE or supplied by the user. (This tutorial assumes the query sequences are already in the system. For help loading new data into HIVE, please see Tutorial for DMDownloader). Once query sequences are selected, a number of summary and quality-check visualization tools are available to view. Numerous parameters allow customization of the alignment to suit the user’s needs. When the alignment is finished, the tool produces a hits table and pie chart showing hits of the query sequences to genes on the reference genome in addition to several other visualizations, all of which can be downloaded. The user may now compute SNP profiles or move onto a number of other analysis modules which can be stacked end-to-end with the dna-hexagon aligner. For easier understanding, all text for HIVE system options, parameters or buttons is displayed in bolded blue. Any text the user should input is displayed in the Courier font. Equations are offset and italicized. 1. SELECTING INPUTS The only two inputs required for selection are the query sequences (reads) and the reference genome (genomes). While further customization is available via specification of parameters, all other inputs have set defaults such that the alignment will proceed without the user entering any additional values. 1.1 Query Sequences (reads) Logged into HIVE in the user home page, the second menu to the left in the header region reads HIVE-Portal (Figure 1). Hovering the mouse on top of this option will open a menu below. Upon clicking the Sequence Alignment on Genome tool, you will be redirected to the HIVE alignment dnahexagon portal. Figure 1. HIVE Home Directory for a Logged-In User Once in the portal (Figure 2) you should see four input boxes: Reference Genomes, Short Reads, Alignment Algorithm and Algorithmic Parameters. The Short Reads box should be visible in the upper-right of the window. To view all the short reads available to you, click on the expansion icon in the top left corner of this box. This will reveal the list (Figure 3) of all the reads you have uploaded in addition to Demo Reads and other files shared with you by the HIVE team under the user name Biological Data Handler, or by another user. Figure 2. dna-hexagon View Figure 3. List of Short Reads Accessed from dna-Hexagon Inputs Page Note that the above representation is in a list view. At the bottom of the sequence reads window there is an option to toggle the organization between the hierarchy and list formats by clicking the associated icons. Each item in the sequence reads menu is preceded by the checkbox icon . To choose a file or series of files for alignment check the box(es) next to the desired file(s). To choose all files in the directory, check the icon at the very top of the chart next to the ID heading. If at any point you find yourself without the desirable sequence files or in the wrong place altogether, clicking the Home link option on top left of the page will always return you to your user home directory. For tutorial purposes, please click Home at this time. Any information loaded by or shared with you in HIVE will be found in this directory. For streamlined alignment access, you may also select short reads or genomes directly from your home directory. All short reads available to you can be seen in the top page section labeled Files and Sequences under the reads tab. Here you should see a list of files identical to that accessed from the dna-hexagon page. To proceed with the tutorial, please check the two Influenza paired reads shown in Figure 4 below. Figure 4. Demo Input Sequences - Influenza Paired Reads Notice the data in this case is displayed in a pair - these data come from paired-end read experiments such that one is the forward read and the other the reverse read of the template. Only by checking the boxes will you be able to select data for alignment, simply clicking on a file name (selecting a file) will highlight the selected file and allow you to view details about it but will not give you the option to perform any action (such as alignment) on the file. This preview information can be viewed in the box to the right in the preview and details tabs. For your convenience, the help tab containing topical help can also be accessed here. Once you check the boxes next to this pair you should have operational icons appear in your toolbar at the top of this list (Figure 5). Clicking on the dna-hexagon image in this toolbar will direct you again to the alignment tool. You should see your short read sequences have already been selected and are listed in the short reads input box. (NOTE: If the user wishes to upload new experimental sequence information into the system, click the add tab on the home page in the Files and Sequences box. This will redirect the user to the DMDownloader utility. Please see the DMDownloader tutorial for further details.) With your query sequences selected, you are now ready to choose a reference genome. Figure 5. HIVE Toolbar 1.2 Reference Genome A reference genome is a comprehensive sequence of the entire genome of a given species, assembled by scientists to be representative of that species. Reference genomes for several species have been made available to the users, shared by the HIVE team under the user name Biological Data Handler. Selection of a reference genome should depend greatly on the experimental data being aligned. For example, if you think you have isolated a new strain of influenza and you want to see how it compares to other influenzas, you will want to select an influenza reference genome. Alternatively, if you are looking for SNPs in a disease-associated human protein, you will need to select a human reference genome. Perhaps you are looking for horizontal gene transfer between two species of bacteria, in which case you may want to align your sequences from one species to the reference genome of the other. We have selected sequence data for Influenza. There may be several applicable genomes. In addition to picking the right species genome, you may need further information to choose the best possible genome for alignment. Continuing with the tutorial, we will now choose the influenza genome. Click on the expansion icon found in the upper left corner of Reference Genomes box. Please check the box next to ID 5153, filed named Influenza_Segments.fa, to select it as our reference genome. 2. INPUT PARAMETERS Below the Input Data box are two collapsible/expandable sections, Alignment Algorithm and Alignment Parameters, which allow for complete customization of many aspects of the alignment. All parameters are populated with default values such that a user is not required to specify any values in order to run the alignment, they need only click the Align button after selecting inputs. To expand any closed section, click the icon to the left of the section header. To hide an open section, click the icon to the left of the section header of the newly expanded window. 2.1 Alignment Algorithm Here you may select which alignment tool you would like to use to align your data. Access the menu by clicking the expansion icon in the top left of the Alignment Algorithm box. Current options include: HIVE’s dna-hexagon, NCBI-BLAST, Bowtie, TopHat, Ace View Magic and BWA. dna-hexagon is the native, default alignment tool developed and optimized for use within high-performance cloud computing environments and therefore is best adapted to efficiently use the HIVE infrastructure to enhance analysis performance. The external tools have all been adapted to the parallel HIVE environment, so you should see increased performance of most of those tools when comparing their usage within HIVE to their usage as standalone tools. We recommend dna-hexagon for best results, so this tutorial will be using dna-hexagon. 2.2 Algorithmic Parameters The parameters shown by default on the portal page are those which are most often user-modified, including the ability to specify a name for the resultant output file. Name: Specify a name for the alignment output file. Minimum Match Length: Alignments produced will likely vary in length. This option allows the user to choose a preferred length of alignment by discarding all alignments shorter than the specified value. The default minimum length is set to 75. Matches to Keep: Alignment results can be filtered by choosing one of three options. The reported matches/alignments can be returned as the Best Match or First Match, or as the list of All Matches by selecting the preferred option from the drop-down menu. The First Match option can be useful when the user is simply interested in finding the genome to which a query sequence belongs, whereas the Best Match option ranks alignments according to alignment scores. Default is set to Best Match. Percent Mismatches Allowed: A mismatch in an alignment occurs when the query and reference align at most points but differ in nucleotides at corresponding positions in the same region. An alignment is not disregarded due to mismatches, but the alignment score reflects a penalty associated with mismatches. The mismatch percentage is equal to the total number of mismatches divided by the total number of positions. Mismatch % = total # mismatches/total # positions This filter will remove alignments with a mismatch % greater than the specified value. Default value is set to 15 for 15%. The remaining, less used and therefore hidden parameters are located in the Algorithm Parameters box, accessed by clicking the expansion icon in the top left corner of the box. Advanced Parameters: Alignment Directionality: This parameter allows the user to choose whether sequences will be aligned in a Forward or Bi-directional (forward and reverse) manner by selection from the dropdown menu. Default is Bi-directional. Alignment Tail Extension: Extension can be performed in Conservative Compact or Aggressive Extended mode according to the selection from the drop-down menu. The Aggressive Extended option facilitates continued alignment of sequences until the threshold of mismatches encountered is reached, resulting in lower sensitivity but higher coverage. The Conservative Compact option terminates an alignment as soon as a higher score is obtained, providing higher sensitivity but lower coverage. The Conservative Compact mode is a good choice when working with small genomes that have a higher mutation rate. Default is set to Conservative Compact. Allowed number of gaps during extension: The user can specify the maximum number of insertions and deletions allowed during the seed extension phase of alignment. Default is set to 3. Intelligent Diagonal: HIVE allows the ability to compute only the region surrounding the diagonal of alignment scoring matrix under the assumption that the best score should lie near the diagonal. If a user is uncertain, they can specify to compute the Complete matrix. Default is set to Diagonal of a given size. Intelligent K-mer Jump in Read: Increasing the seed size in large genomes improves the alignment speed significantly at a potential cost of loss of sensitivity. The value 1 means no loss in sensitivity. The size of the seed is the maximum recommended value. If auto is chosen it will be set to 1 for small genomes and 2 for the large genomes. Default is auto. Intelligent K-mer Jump in Reference: The same logic as for Intelligent K-mer Jump in Read above applies, but with respect to the reference genome instead of the short read. Default is auto. K-mer Extension Minimal Length Percent: The minimal length of the extension must be equal to this percentage of the seed length. Default is set to 66 for 66%. K-mer Extension Mismatch Allowance Percent: This parameter allows the user to define the percentile of mismatches allowed during extension of alignments. If 0 is entered, all positions are computed using the Smith-Waterman algorithm. Default is set to 25 for 25%. Optimal Alignment Search: This option allows the user to choose whether to find the optimal alignment using the Smith-Waterman algorithm (default setting) or quickly look up highly scoring hits by making selecting Only Identities from the drop-down menu. The Smith- Waterman algorithm is a dynamic programming algorithm that employs a substitution matrix to yield the highest scoring local alignment. The time required by the algorithm to return the optimal alignment greatly increases with an increase in sequence length. If the user is not interested in optimal alignments and merely wants to identify high scoring potential hits, the Only Identities option will produce results at a faster rate. Over-represented K-mer Suppression: Some K-mers appear in data much more than is to be expected by random occurrence. This parameter allows the elimination of K-mer hits which are this percentage more abundant than expected. Default set to 20 for 20%. Use Read Self Similarity: Selecting the Use self similarity optimization (default) option allows the aligner to index and count identical reads instead of aligning each identical read separately. Width of Intelligent Diagonal: As mentioned in the Intelligent Diagonal description above, HIVE allows computation of a defined region around the diagonal of the matrix as opposed to computation of the entire matrix, including regions which are known to have no potential of containing the highest scoring path. Setting the parameter to auto will specify a width equal to the length of the seed. Alignment Algorithm: Automatically populated by selection of algorithm mentioned in Section 2.1 above. Alignment Filters: Repeat and Transposition Discovery: This parameter allows the user to perform more detailed multi-pass lookups in order to identify repeats and transpositions, or repeats only, by selection of the preferred option from the drop-down menu. Inclusion of this search comes at the expense of a longer run-time, so the default is set to exclude this search. Score Filter: Each alignment produces a score based on matches, mismatches, insertions, deletions and gaps. Usually, a higher score is correlated with a better alignment. This option allows the user to eliminate alignment scores below a specified threshold. The default setting of None will report all alignments and discard nothing. Alignment Parameters: Alignment Scope: Alignment of sequences across the reference genome can be done locally or globally by choosing the appropriate value from the drop-down menu. A Local alignment tries to find the best match in particular regions of the genome whereas a Global alignment tries matching the entire sequence length to the reference genome (Figure 6). Global alignments are good for similarity and identity searches between the sequences. Local alignments are better suited for identifying conserved residues or domains. Figure 6. Local vs. Global Alignments Alignment Costs: Alignments are ranked by comparison of alignment scores, composed by the assignment of cost values to all matches, mismatches and gaps. Matches are rewarded, mismatches and gaps are penalized. In a biological scenario, single base insertions or deletions are less probable than larger insertions or deletions. For this reason, a gap opening is assigned a higher penalty than continuation of a gap while computing the alignment score. If gaps continue over multiple positions after the gap has already been opened, they are more likely the result of a true insertion or deletion with potential evolutionary significance. Thus, the default Gap open penalty has a more negative value than the default Gap next penalty. Default values for Match Benefit, Mismatch Penalty, Gap Opening Cost, Gap Continuation Cost and Mismatch Continuation Penalty are 5, -4, -12, -4 and -6, respectively. Seed K-mer: A seed is a short fragment of the query sequence used to search for nucleotide similarity. Seed length determines the size of the K-mer to look up in a hash table. Higher seed values make the alignment faster but have a potential of decreasing the sensitivity of the method. The default seed length is set at 11 letters, corresponding to the seed length employed by the BLAST algorithm. Alternative seed lengths pre-populate the drop-down menu. Shorter Terminus Alignment: Alignment coverage near the end (terminus) of a reference segment is often less than ideal due to the Minimum Match Length threshold. For example, if the Minimum Match Length parameter is set to 30 but you have a read that exactly matches 20 base pairs up to the end of segment, this exact match over the area that matters will be excluded because it is shorter than the specified threshold. This parameter essentially allows specification of a different match length threshold in the region of the terminus. Default set to 0 means to use the default Minimum Match Length. Alignment Test Run Slice: This allows a user to conduct a test-run of parameters by specifying the size of a subset of sequences to run per computational thread. The default of all indicates to run all sequences in the alignment. To use the test-run feature, the user must specify a number of sequences to run on each thread. Low complexity filter: Some subjects are well known to have regions or entire sequences of low complexity and a high degree of repeats. With these subjects in mind, HIVE allows the user to specify masking such that regions of both reads and references can be excluded from the alignment by defining the minimal entropy threshold and a window size of the sequence to be ignored. Reference masking: Minimal Shannon Entropy: The dropdown menu allows the user to define the Minimal Shannon Entropy of a region of the reference genome allowed to be included in an alignment. Default is 0 – permissive, which allows all regions to be included. Window Size: The dropdown menu allows the user to specify the Window Size of the reference genome to ignore if the calculated entropy drops below the specified Minimal Shannon Entropy. Default of do not mask low complexity regions allows inclusion of all regions regardless of calculated entropy. Short read filtration: Minimal Shannons Entropy The dropdown menu allows the user to define the Minimal Shannon Entropy of a region of any short read allowed to be included in an alignment. Default is 0 – permissive, which allows all regions to be included. Window Size: The dropdown menu allows the user to specify the Window Size of all short reads to ignore if the calculated entropy drops below the specified Minimal Shannon Entropy. Default of do not mask low complexity regions allows inclusion of all regions regardless of calculated entropy. Number of Computational Subjects per Single Thread: This parameter allows the user to specify the maximum number of sequences per thread to be aligned by a single compute node. Query Data: Automatically populated by object IDs of selected input files Reference Genomes: Automatically populated by selected reference genome Reference genome serial number: Automatically populated by selected reference genome, if applicable. For the purposes of our tutorial, we will leave all parameters set to the default values, but we will rename the output file “Demo_Align_1”. Make sure both short read sequences are selected as well as the proper reference genome. Your page should now look identical to Figure 7. Figure 7. dna-hexagon Portal with Specified Inputs. 3. JOB PROCESSING To start the job click on the ALIGN button. This will refresh your page and present you with two new boxes. The first titled HIVE-hexagon Alignment and the second Results. The HIVE-hexagon Alignment box tracks the progress of your alignment. By clicking on the expand node icon found in the top left side of this box, you can view the progress of the subcomponents of this task. The Results box will become populated as the alignment is carried out. The process status will change from Waiting to Running to Done. The whole alignment is finished when all statuses read Done and the progress bar will show 100% completion. The time elapsed clock will stop, but the run-time will continue to be displayed. At this time, click the refresh button on the left side of this box to assure you have the complete results. 4. ALIGNMENT RESULTS The Results area has two main sections: on the left is a directory of all the aligned reads and on the right using the tabs you can view various information about each read. Your Results section should now look like Figure 8. Figure 8. Default Alignment Results View By default your results will be presented in a list format. You may alternately view them in a pie chart format click on the piechart tab. You should now see a piechart representation of your results (See Figure 9). In order to view detailed alignment information, you must select a particular reference genome. While some genomes may consist of a single segment or sequence, the influenza genome has 8 segments that are each presented within the genome file as a separate reference sequence. Continuing with the tutorial, select (highlight by clicking) the first aligned genome with the id of 5 and the name NS.fa. Figure 9. Pie Chart View of Alignment Hits The tabular views to the right include the following: alignments: This tab shows all alignments for the selected reference sequence in a triplet format such that the top line contains information about the reference or consensus sequence, the bottom line contains information about the query sequence and the middle line shows matches and mismatches with the use of symbols (See Figure 10). The # column corresponds to the ids supplied to the reference/consensus sequence or genome and to the read id supplied to the specific read from the selected query sequence data. Similarly, Sequence contains the names given to the reference sequence file and the read sequence file. The Repeats column displays the counts of any exactly repeated reads. In the Direction column, (+) indicates alignment in the forward direction and (-) indicates alignment in reverse. The Start column displays the numeric position on each sequence where an alignment starts. In the middle line of the Alignment column, a pipe symbol | indicates a match between sequence whereas a dot . indicates a deletion, a – indicates an insertion and a space indicates a mismatch. The End column displays the numeric position of each sequence where an alignment ends. Figure 10. alignments View stack: This view is similar to the alignments view but only highlights the differences between the reference and query sequence. Dots . represent matches, dashes – represent deletions and single nucleotide polymorphisms are represented by the letter code (A,C,T or G) of the base found in the read (See Figure 11). The mutation bias graph shows the distribution of base calls for a given position along query reads. Figure 11. stack Results View hit table: Opening the hit table provides yet another way to visualize the alignment results with a focus on scores and positional information. The table has thirteen columns, most of which have already been discussed in the alignments or stack sections above. Columns not yet covered include: Alignment number – assigns a unique id to each alignment; and Score – calculated as discussed under the Alignment costs tab above. downloads: Clicking the downloads tab will allow you to download all results views and tables as a .csv file in addition to alignments in SAM format. Once a download is initiated by selecting the desired data, processes will proceed as determined by your specific browser. 5. WHAT NEXT? HIVE is built in a very modular way to allow stacking of different analytic tools end to end in a wide array of configurations. Thus, following alignment via dna-hexagon, you have a few options of the next analysis to perform on your data. On the top right of the Results box you will should see text that says what can you do next (See Figure 12). If you are unhappy with your results you can click on Modify and Resubmit to edit your parameters or inputs and try alignment again. To proceed onto other analyses, you can hover over Profiling Tools and select the option that is best for your research workflow. Currently available downstream applications include Sequence Profiling, Reference Recombination and Population Analysis. Figure 12. Access to Downstream Tools This concludes the HIVE dna-hexagon tutorial. Please see the other tutorials or the tutorial videos available on HIVE main pages for further information.
© Copyright 2026 Paperzz