Succinct Binary Tries

Background
Three Foundations of Modern Cloud Services
● Document Stores
● Key-value Stores
● Multi-attribute NoSQL
Important disadvantage of existing data stores
● While existing data stores provide efficient abstraction for storing and retrieving
data using primary keys, interactive queries on values (or, second attributes)
remains a challenge.
Background
Defects of Existing Data Stores to Support Queries on Secondary
Attributes
➢ Scan Data
➢ Eg. Column Oriented Stores
○ High latency for large data size
○ Limited throughput --- since queries typically tough all machines
➢ Index Data
➢ Two Advantages Compare to Scan Data
○ Stored in memory, so it is faster
○ High throughput
○ Main Disadvantage
■ High memory footprint
Background
Examples of Existing Scan Data & Index Data
● Scan Data
○ Store a bit-vector A[1..n]
○ Answer partial sum queries RANK (k)
■ Asks for ∑ki =1 = A[i]; and SELECT (k)
● Asks for index of k-th one in the array
This means, we need to store a summary of every block of t bits. So, the query needs to
spend time proportional to t. Thus, this is a Scan Data method, which means we use less
space, but it is very slow.
Background
Examples of Existing Scan Data & Index Data (ctd)
● Index Data
○ Want to represent an array A [1..n] of “trits” (A[i] ∈ {0, 1, 2})
○ Support fast access to single element A[i]
○ We can encode the entire array as a number in {0, . . . , 3^n − 1}
○ However, we cannot decode one “trits” without decoding the whole array
This means, in Index Data method, we can gain a faster speed, but takes more
space.
Background
Why Succinct Data Structure?
● Imperfections of Previous Examples
○ The problem behind these problem is linear trade-off between redundancy
and query time
○ Existing data stores either restore to using complex memory management
techniques for identifying and cashing “hot” data (index data)
○ Or simply executing queries off-SSD (similar as memory stick)
○ Thus, we want to new data structure to gain efficiency and use less space
Introduction
Succinct Data Structure
● Advantages Compare to Scan and Index Data
○ Achieve memory efficiency close to Data Scans
■
lower throughput than indexes?
○ NO. Because of low memory footprint
■
store more data in memory
○ So, it can avoid latency and low memory throughput
Introduction
Two Key Ideas of Succinct
● Entropy-compressed representation
○ Allows random access
■ Enabling efficient storage and retrieval of data
■ Supports count, search, range and wildcard queries without storing
indexes
● Executes queries directly on compressed representation
■ Do not need Data Scans
■ Do not need decompression
Introduction
● Definition: A Succinct Data Structure for a given data is a representation of the
underlying combinatorial object that uses and amount of space “close” to the
information theoretic lower bound, together with efficient algorithms for
navigation, search, insertion and deletion operation.
● Goal: Construct data structures that use space equal to the information theoretic
minimum plus some redundancy.
Introduction
Small Space
● Most Succinct Data Structure are static, few of them are dynamic
● Goal: To get as close the information theoretic optimum as possible
Three Senses of Small Space
● Implicit Data Structure
○ Space = Information-theoretic OPT + O(1) bits
○ To add a constant is essential
■ If OPT is a fractional ---> round up use O(1)
■ Store some permutation of the data
Introduction
● Examples:
○ Heap --- Implicit Dynamic Data Structure
○ Sorted Array --- Implicit Static Data Structure
● Succinct Data Structure
○ Space = OPT + o(OPT) bits
■ Key: to get a constant 1 in front of the OPT
○ Most common type of space-efficient Data Structure
Introduction
Compact Data Structure
● Space = O(OPT) bits
○ Note: some “linear space” data structures are NOT actually compact
■ They use O(w·OPT) bits
○ So, Compact Data Structure saves at least an O(w) bits from the normal
Data Structures
■ Eg. Suffix Trees:
● Has O(n) word space
● Information theoretic lower bounds is n bits
Introduction
● Conclusion: The second one, Succinct Data Structure is our usual goal. Because
“implicit” is very hard to get. And the “compact” is a warm up stage, it is generally
to work towards the Succinct Data Structure
● More examples of these three different Data Structures
Introduction
● Implicit Dynamic Search Tree
○ Static Search Tree: can be stored in an array with ln(n) per search
■ Insert and deletes make it tricky
■ Old result: lg2n
● Pointers and permutation of bits
○ Implicit Data Structure
■ In 2003, Franceschini and Grossi
■ Support insert, deletes and predecessor
■ O(log(n)) time --- worst case
Introduction
● Succinct Binary Tries
○ Motivation: fit the Oxford dictionary onto a CD
○ Number of possible binary tries with n nodes in the n-th Catalan number
■ Above can be derived from a recursion formula based on the size of the
left and right subtrees of the root
■ 2n + O(n) bits of space --- Jimmy’s part
■ Able to find the left child, the right child and the parent in O(1) time
■ Subtree size --- can keep track of the rank of node we are not
Succinct Binary Tries
● Level Order Representation of Binary Tries
Succinct Binary Tries
● Theorem 1. In the external node formulation, the left and right children
of the ith internal node are at positions 2i and 2i+1
Succinct Binary Tries
● Rank and Select
○ rank(i) = number of 1’s at or before position i
○ select(j) = position of jth one
○ left-child(i) = 2rank(i)
○ right-child(i) = 2rank(i) + 1
○ parent(i) =
Succinct Binary Tries
● Rank
○ step 1: build a lookup table for bit strings of length
○ step 2: Split the n-bit string into
○ step 3: split each chunck into
○ step 4: Rank = rank of chunk + relative rank of sub-chunk within chunk +
relative rank of element within sub-chunk
Succinct Binary Tries
● Select
○ step 1: Pick every
1 to be a special one
○ step 2: Restrict a single chunk
○ step 3: Repeat step 1 and 2
○ step 4: Use look up table for bit string of length <=
Applications and Implementations
● Storing Trits (ternary values).
● Assembling Large Genomes
● Enabling Queries on Compressed Data
● Succinct Representation of Binary Trees
● ...
Storing Trits
Ternary
1
2
10
11
12
20
21
22
100
Binary
1
10
11
100
101
110
111
1000
1001
Decimal
1
2
3
4
5
6
7
8
9
Trits = [102210120201012102000210111000202210]
● Three possible methods
Storing Trits
● Three Effective Methods:
1.Naïve Method:
Store each trit using two bits. Fast but more space.
2.Arithmetic Method:
Space reduced but time consuming.
3.Succinct Data Structure:
n*log2(3) Space and O(t) random access time.
t is the depth of the data structure.
Implementation
Encoding
Raw
Trits
Decoding
Binary
File
Recovered
Trits
Encoding
1. To encode, we start with an array with
fixed size of trits with pseudorandomly
assigned values.
2. Given the size and base, the header of
all levels are constructible.
3. Encode recursively, starting from the
bottom level = 0 and continuing until
level equals LEVELS.
Input Trits Array
Construct
Headers
Recursively
Encoding
● Construct Headers:
1. Succincter break entries into
blocks. Each block is treated as
one large number, base K.
2. K=3 at Level 0;
3. Find an n so that new base
n is chunksize. M=32.
Recursively Encoding
1. Take each block with chunksize n as
one large number.
2. Divide by 2^M, store the remainder.
3. Pass the quotient to a new array for
the next level.
4. Repeat for each block at this level.
5. For the last level, figure out how
many bits to store base K, and store
each number using that many bits.
6. Binary File created!!!
Binary File
● [SIZE, LEVELS, M]
● Headers (Chunksize, K, Size) for each level.
● Remainders for each level.
● Storing number in bits for the last level.
Decoding (Reverse Encoding)
Decode All:
● Extract data from Binary File:
Read [SIZE, LEVELS, M]; Headers (Chunksize, K, Size);
Remainders for each level; Numbers for last level.
● Calculate number I’ = I*2^M + Remainder
● Divide the number I’ by base K, and place the remainder as the new
array for the upper level.
● Print new array at Level 0.
Decoding (Reverse Encoding)
Decode One:
● Extract data from Binary File:
● Make the path base on index.
● Same as Decode All.
Storing Trits Demo
● Encoding:
○ Generate a random ‘trits.txt’.
○ Storing trits in Bianry file ‘encoded.bin’.
● Decoding All:
○ Recover the trits ‘decoded.txt’.
● Decoding One:
○ Give an index in original trits.
○ Print the element in that position in trits.
Results and Analysis
● Levels vs Encoding time, Decoding time and Encoding Size
Results and Analysis
● Comparison of three methods
Assembling Large Genomes
● Movtivation:
○ Second-generation sequencing makes it feasible for researchers
to obtain enough sequences reads to attempt the genome
assembly.
● Two Concerns: Computational Complexity and In-practice Memory.
● Approach:
○ Succinct Data Structure to create the de Bruijn Graph(DBG),
which requires at least a factor of 10 less storage.
Genome Assembling
● What is Genome Assembling?
Sample sequence showing how a sequence assembler would take
fragments and match by overlaps.
de Bruijn Graph (DBG) Background
de Bruijn Graph (DBG) Background
de Bruijn Graph (DBG) Background
Successor and Predecessor:
Genome de Bruijn Assembly Graph
An example of a genome sequence and its de Bruijn Assembly Graph.
How to store this huge data (graph)???
Approaches
● Simple represent the nodes as ordinary records(C/C++ struct),
and edges as pointers between them:
○ For human genome with k=25. About 4.8 billion nodes. Graph
requires 250GB storage!!!
● Hashtable:
○ avoid pointers, but load factor is crucial.(No further discussion)
● Bitmap:
○ No need to store nodes explicitly since they are readily inferred
from the edges. Remove error edges.
Approach Bitmap
● Create a bitmap for each edge in the de Bruijn graph, and the set the
bits 1 for the edges that occur in the assembly graph.
● The scheme Depends on being able to enumerate the k-mers. (i.e. the
edges).
● This is done by enumerating the bases. A=0; C=1; G=2; T=3.
Approach Bitmap
Given such a bitmap, we can determine the successor set of a given node by
probing the position corresponding to the four edges that could proceed from the
node. For a node corresponding to a k-mer n, the four position in the bitmap are
4n, 4n+1, 4n+2, 4n+3.
Results
References:
https://people.csail.mit.edu/mip/papers/succinct/succinct.pdf
https://www.usenix.org/system/files/conference/nsdi15/nsdi15-paper-agarwal.pdf
https://www.wisdomjobs.com/e-university/data-structures-tutorial-290/succinctdata-structure-7117.html
https://courses.csail.mit.edu/6.851/spring12/scribe/L17.pdf
https://arxiv.org/pdf/1008.2555.pdf
Q&A
Thank you!

Download Report

Succinct Binary Tries

Paperzz.com

Your Paperzz