view as pdf - High-performance Integrated Virtual Environment (HIVE)

HIVE Algorithm for Next-Generation Sequencing
Analysis
E
V
H I
Alin Voskanian1, Luis Santana-Quintero1, Raja Mazumder2, Jean et Danielle Thierry-Mieg3,
Hayley Dingerdissen1,2, Vahan Simonyan1
1Food
and Drug Administration, Center for Biologics Evaluation and Research, Rockville, MD, 20852
2Department of Biochemistry and Molecular Biology, The George Washington University, Washington, DC, 20037
3National Center for Biotechnology Information, U.S. National Library of Medicine, National Institutes of Health, Bethesda, MD, 20894
Short reads
Reference sequences
Short read sequence (size S)
Reference sequence size R
(C) serialization
(D) bloom/hash table compilation
(E)K-mer lookup
(A) non-redundification
1
0
1
0
1
1
0
1
1
1
1
Bloom
seed occurrence position buckets
1
1
0
0=AAA . . . AA
1=AAA . . . AC
2=AAA . . . AT
12,23,45,...
24,322,3432,...
1
1
n-1=CAA . . . AA
n=TTT . . . TT
356,3434,23328,...
123,3782,12332,...
(D) Bloom and hash tables
ATGCtCGAGACGTAT-CAG
(F) fuzzy extension
ATGCaCGAG.CGTATGCAG
||||.||||-|||||-|||
ATGCaCGAGACGTAT-CAG
ATGCaCGAG.CGTATGCAG
(C) Score along alignment
position
(B) Squared matrix
(A) Rectangular matrix
(G) optimal
alignment search
K-mer
Optimal alignment search optimization schema
Workflow for HIVE-hexagon alignment utility
Overall alignment schema for dna-hexagon: short reads are non-redundified
(A) and parallel portions (B) are sent to distributed cloud for computation.
Reference genomes are then split into smaller pieces (C) and compiled
into bloom/hash table (D). Every parallel execution thread performs a
K-mer lookup against every reference sequence (E) then extends
matches via inexact alignments (F) and performs a subsequent
optimal alignment search on remaining candidates (G).
(A) Dynamic programming matrix Needleman-Wunsch or Smith-Waterman algorithms
use a two dimensional rectangular matrix where cells represent the cumulative
score of the alignment between short read (horizontal) and reference
genome(vertical). (B,C) The fuzzy extension algorithm allows accurate
definition of the alignment frame and squeezes the window where the
high scoring alignments might be discovered. (D) dna-hexagon
maintains a bloom lookup table where each K-mer is represented
only by a single bit signifying the presence or absence of that
K-mer and 2-na hash table where a sequence’s
Root
binary numeric representation is used as an index.
A
A
G
Short read sequence
(D) Armenian Cheese Diagonal
(E) Final Dynamic Matrix
Dynamic programing matrix linearization
HIVE-hexagon implements a floating diagonal approach where
the diagonal of the computation is maintained along the two
sides of the current highest scoring path of the matrix. (A) We
assume the optimal alignment will be along the diagonal. (B)
Multiple insertions or deletions can result in the optimal path
traversing outside the defined diagonal belt. (C) By defining
a constant width to the diagonal and pre-computing cell
scores line by line, the limits of the remaining diagonal
matrix can float along with the likely optimal
scoring path. (D) Furthermore, pre-computation
of cells in the diagonal line by line allows the
exclusion of regions which cannot possibly
Pool of optimal alignment candidates
contain the highest scoring path. (E)
In this example, the first sequence CATAGTGGACACTG has
The resultant minimized, dynamic
generated 5 possible hit candidate positions (blue arrows), but
diagonal matrix is
linearized to further simplify the only three of those have failed to extend or align long enough (red
“X”). The next sequence, CATAGTGGATAATA, having a significant
process and minimize the
length of prefix similar to the prior one, does not need to consider all
memory footprint
candidate positions, specifically those where the extension attempt
required.
failed with a wide margin of error.
AAAAAA . . . AAA
AAAAAA . . . AAC
Candidate positions found by seeds for the first sequence
(blue) and next sequence (red)
CATAGTGGACACTG
CATAGTGGATAATA
.
. .
1.
2.
3.
4.
5.
6.
7.
AGTAC
TAGC
CCGGA
AGAC
AGACT
GTAGA
GTCTCA
A
C
P=4
L=4
T
P=5
L=5
T
A
C
P=1
L=5
G
P=3
L=5
T
G
T
A P=6
G L=5
A
P=6
L=2
T P=2
A L=4
G
C
C P=7
T L=6
C
A
Sorted order = {4, 5, 1, 3, 6, 7, 2}
Non-redundification: removal of identical reads
This is a representation of 6 sequences in a tree. The tree is
composed by linking all sequences’ tails (suffixes) to the parent
nodes which represent the longest prefixes common to each
branch. It contains all the information of the 6 sequences,
including position and length, at each node. Once the tree is
completed, we can traverse the tree in a left maze order
(always taking a left path) to obtain the sorted list of elements
as: S = {4,5,1,3,6,7,2}, which refers to the sorted list AGAC,
AGACT, AGTAC, CCGGA, GTAGA, GTCTCA, TAGC.
Motivation:
Modern next-generation sequencing platforms generate a
large number of short reads mapped to the same genomic
position with low error rate, resulting in a high degree of
redundancy.
Approach:
Innovative usage of redundancy identification and self-similarity between
reads allows minimization of the memory footprint by removing repeats and by
allowing a higher rate of vertical compression. Novel prefix-tree algorithms used to
discover self-similarity decrease the number of individual reads in the alignment thereby
accelerating the alignment process. Modifications to the optimal alignment searching
Smith-Waterman allow the diagonal to float together with the running high scoring path, thus
taking full advantage of the dynamic matrices diagonalization method. A refined approach to
conventional heuristic seeding involving high levels of parallelization minimizes data transfer and
removes most of the I/O bottlenecks.
Accuracy and Reproducibility:
TTTTTT . . . TTG
TTTTTT . . . TTT
X
X
X
dna-hexagon mapping is up to 99% accurate and is
capable of reproducing results on par with other
industry-standard tools.
Speed:
Reference
sequence
Similarity measure
shown in gray
C
C
G
G
A
Pos= 1
Len = 2
Sequences in
the order of
appearance:
NON-REDUNDIFICATION
(A)Diagonal optimization (B) Multiple In-Del alignment problem (C) Floating Diagonal
OPTIMAL ALIGNMENT
Reference sequence
X
The High-performance Integrated Virtual
Environment (HIVE) system, a cloud-based
environment optimized for the storage and
analysis of extra-large data, presents a novel
algorithmic solution in the form of the dnahexagon DNA-seq aligner. dna-hexagon
implements a number of novel algorithmic
paradigms to take full advantage of the
nature of the underlying sequence data and
to exploit CPU, RAM and I/O architecture to
quickly compute accurate alignments in a
massively parallel execution environment.
C
Seed hit extension sizes shown as white boxes on a
reference frame
HIVE speed is competitive with other industry-standard
systems. The dna-hexagon aligner can align 31GB of
Human mRNA_seq reads to a 45MB genome in 23
minutes.
ACKNOWLEDGEMENTS
Konstantin Chumakov, PhD, Associate Director for Research, Office of Vaccines Research and Review, FDA CBER
Carolyn A. Wilson, PhD, Associate Director for Research, Office of the Center Director, FDA CBER
Implementation: blueHIVE technology group [email protected] (Mazumder and Simonyan groups)

Download Report

view as pdf - High-performance Integrated Virtual Environment (HIVE)

Paperzz.com

Your Paperzz