genomic data compression and processing: theory

GENOMIC DATA COMPRESSION AND PROCESSING:
THEORY, MODELS, ALGORITHMS, AND EXPERIMENTS
A DISSERTATION
SUBMITTED TO THE DEPARTMENT OF ELECTRICAL
ENGINEERING
AND THE COMMITTEE ON GRADUATE STUDIES
OF STANFORD UNIVERSITY
IN PARTIAL FULFILLMENT OF THE REQUIREMENTS
FOR THE DEGREE OF
DOCTOR OF PHILOSOPHY
Idoia Ochoa-Alvarez
August 2016
© 2016 by Idoia Ochoa-Alvarez. All Rights Reserved.
Re-distributed by Stanford University under license with the author.
This work is licensed under a Creative Commons AttributionNoncommercial 3.0 United States License.
http://creativecommons.org/licenses/by-nc/3.0/us/
This dissertation is online at: http://purl.stanford.edu/st247bt3117
ii
I certify that I have read this dissertation and that, in my opinion, it is fully adequate
in scope and quality as a dissertation for the degree of Doctor of Philosophy.
Tsachy Weissman, Primary Adviser
I certify that I have read this dissertation and that, in my opinion, it is fully adequate
in scope and quality as a dissertation for the degree of Doctor of Philosophy.
Andrea Goldsmith
I certify that I have read this dissertation and that, in my opinion, it is fully adequate
in scope and quality as a dissertation for the degree of Doctor of Philosophy.
David Tse
Approved for the Stanford University Committee on Graduate Studies.
Patricia J. Gumport, Vice Provost for Graduate Education
This signature page was generated electronically upon submission of this dissertation in
electronic format. An original signed hard copy of the signature page is on file in
University Archives.
iii
Abstract
Recently, there has been growing interest in genome sequencing, driven by advancements
in the sequencing technology. Although early sequencing technologies required several
years to capture a 3 billion nucleotide genome, genomes as large as 22 billion nucleotides
are now being sequenced within days using next-generation sequencing technologies. Further, the cost of sequencing a whole human genome has dropped from billions of dollars to
merely $1000 within the past 15 years.
These developments in efficiency and affordability have allowed many to envision
whole-genome sequencing as an invaluable tool to be used in both personalized medical
care and public health. As a result, increasingly large and ubiquitous genomic datasets are
being generated. These datasets need to be stored, transmitted, and analyzed, which poses
significant challenges.
In the first part of the thesis, we investigate methods and algorithms to ease the storage
and distribution of these data sets. In particular, we present lossless compression schemes
tailored to the raw sequencing data, which significantly decrease the size of the files, allowing for both storage and transmission savings. In addition, we show that lossy compression
can be applied to some of the genomic data, boosting the compression performance beyond the lossless limit while maintaining similar – and sometimes superior – performance
in downstream analyses. These results are possible due to the inherent noise present in
the genomic data. However, lossy compressors are not explicitly designed to reduce the
noise present in the data. With that in mind, we introduce a denoising scheme tailored to
these data, and demonstrate that it can result in better inference. Moreover, we show that
reducing the noise leads to smaller entropy, and thus a significant boost in compression is
also achieved.
iv
In the second part of the thesis, we investigate methods to facilitate the access to genomic data on databases. Specifically, we study the problem of compressing a database
so that similarity queries can still be performed in the compressed domain. Compressing
the database allows it to be replicated in several locations, thus providing easier and faster
access to the data, and reducing the time needed to execute a query.
v
Acknowledgments
My life at Stanford during these years wouldn’t have been the same without several remarkable people who have accompanied me along the way.
First and foremost, I am deeply grateful to my advisor Tsachy Weissman, for believing
in me and letting me pursue research under his guidance. He has been all I could hope for
and more, both from an academic and personal perspective. It has been a privilege being
his student, and he will always be a role model for me.
I am very thankful to Andrea Goldsmith for her support, her energy for teaching, and for
serving as a reader of my thesis and a committee member of my oral defense. I would also
like to thank David Tse, for many interesting discussions on genomic related problems,
and for serving as a reader of my thesis and a committee member of my oral defense.
Special thanks to Ayfer Özgür, for her kindness and interesting discussions throughout
the years. I am also thankful to Olivier Gevaert, for fruitful collaborations, guidance, and
for serving as a committee member of my oral defense. I would also like to thank Euan
Ashley, for helpful discussions and collaborations. Thanks also to Golan Yona, for being
my co-advisor and sharing his knowledge in genomics, and to Anshul Andaje, for serving
as the chair in my oral defense. I would also like to dedicate some lines to the late Tom
Cover, a one-of-a-kind professor that I was lucky to interact with for more than a year. I
still remember with nostalgia his weekly group meetings where we would solve numerous
math puzzles. Finally, but not last, I would like to thank my advisor from Spain Pedro
Crespo, who initiated me into the world of research, and who gave me the initial idea and
motivation to pursue a PhD in the states.
I am happy to acknowledge my collaborators, as well as my fellows with whom I have
enjoyed many interesting interactions: Mikel, Alexandros, Kartik, Mainak, Bobbie, Albert,
vi
Kedar, Heyji, Gowtham, Himanshu, Amir, Rachel, Shirin, Dinesh, Greg, Milind, Khartik,
Tom, Yair, Dmitri, Irena, Vinith, Jiantao, Yanjun, and Peng, and Ritesh.
I would also like to thank Amy, Teresa, Meo, and specially Doug, for an amazing
administrative work and for making my life at Stanford easier. Also to Angelica, for her
happiness.
I gratefully acknowledge financial support from Fundación La Caixa, which supported
me during my first two years of PhD, the Basque Government, the Stanford Graduate Fellowships Program in Science and Engineering, and the Center for Science of Information
(CSoI).
On a more personal note, I would like to thank Alexandros, Kartik, and Bobbie, with
whom I have shared innumerable unforgettable moments; Mainak and Nima, who have
always let me into their office, even if it was to distract them; the donostiarrak Ane and
Ivan, for making me feel closer to home; Nadine, Carlos, Adrian, Felix, Yiannis, Sotiria,
Borja, Gemma, Christina, and Dani, for memorable moments and for being my family
abroad.
I am deeply grateful to my parents and my sister, for their love throughout all my life,
and for making this possible.
I cannot end without thanking Mikel, to whom I owe everything, including a beautiful
daughter. This thesis is dedicated to them.
vii
To Mikel and Naroa
viii
Contents
Abstract
iv
Acknowledgments
vi
1
Introduction
1
1.1
Easing the storage and distribution of genomic data . . . . . . . . . . . . .
3
1.2
Facilitating access to the genomic data . . . . . . . . . . . . . . . . . . . . 11
1.3
Outline and Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.4
Previously Published Material . . . . . . . . . . . . . . . . . . . . . . . . 14
2
3
Compression of assembled genomes
16
2.1
Survey of compression schemes for assembled genomes . . . . . . . . . . 16
2.2
Proposed method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.2.1
iDoComp . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.2.2
Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.2.3
Machine specifications . . . . . . . . . . . . . . . . . . . . . . . . 26
2.3
Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.4
Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.5
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
Compression of aligned reads
3.1
34
Survey of lossless compressors for aligned data . . . . . . . . . . . . . . . 35
3.1.1
The lossless data compression problem . . . . . . . . . . . . . . . 37
3.1.2
Data modeling in aligned data compressors . . . . . . . . . . . . . 38
ix
3.2
3.3
3.4
3.5
4
3.2.1
Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.2.2
Machine specifications . . . . . . . . . . . . . . . . . . . . . . . . 46
Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.3.1
Low coverage data sets . . . . . . . . . . . . . . . . . . . . . . . . 47
3.3.2
High coverage data sets . . . . . . . . . . . . . . . . . . . . . . . . 49
Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.4.1
Low Coverage Data Sets . . . . . . . . . . . . . . . . . . . . . . . 50
3.4.2
High Coverage Data Sets . . . . . . . . . . . . . . . . . . . . . . . 51
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
Lossy compression of quality scores
54
4.1
Survey of lossy compressors for quality scores . . . . . . . . . . . . . . . . 55
4.2
Proposed Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.3
5
Proposed Compression Scheme . . . . . . . . . . . . . . . . . . . . . . . . 40
4.2.1
QualComp . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.2.2
QVZ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.3.1
Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.3.2
Machine Specifications . . . . . . . . . . . . . . . . . . . . . . . . 72
4.3.3
Analysis
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.4
Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
4.5
Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
Effect of lossy compression of quality scores
5.1
5.2
78
Methodology for variant calling . . . . . . . . . . . . . . . . . . . . . . . 78
5.1.1
SNP calling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
5.1.2
INDEL detection . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
5.1.3
Datasets for SNP calling . . . . . . . . . . . . . . . . . . . . . . . 80
5.1.4
Datasets for INDEL detection . . . . . . . . . . . . . . . . . . . . 80
5.1.5
Performance metrics . . . . . . . . . . . . . . . . . . . . . . . . . 81
Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
5.2.1
SNP calling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
x
5.2.2
6
5.3
Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
5.4
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
Denoising of quality scores
6.1
6.2
6.3
7
98
Proposed method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
6.1.1
Problem Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
6.1.2
Denoising Scheme . . . . . . . . . . . . . . . . . . . . . . . . . . 99
6.1.3
Evaluation Criteria . . . . . . . . . . . . . . . . . . . . . . . . . . 102
Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
6.2.1
SNP calling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
6.2.2
INDEL detection . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
Compression schemes for similarity queries
7.1
7.2
7.1.1
Problem Description . . . . . . . . . . . . . . . . . . . . . . . . . 110
7.1.2
Fundamental limits . . . . . . . . . . . . . . . . . . . . . . . . . . 111
Proposed schemes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
7.2.2
7.3
7.4
108
Problem Formulation and Fundamental Limits . . . . . . . . . . . . . . . . 110
7.2.1
8
INDEL detection . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
The LC ≠ — scheme . . . . . . . . . . . . . . . . . . . . . . . . . 112
The TC ≠ — scheme . . . . . . . . . . . . . . . . . . . . . . . . . 114
Simulation results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
7.3.1
Binary symmetric sources and Hamming distortion . . . . . . . . . 118
7.3.2
General binary sources and Hamming distortion . . . . . . . . . . . 119
7.3.3
q-ary sources and Hamming distortion . . . . . . . . . . . . . . . . 121
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
Conclusions
124
Bibliography
127
xi
List of Tables
2.1
Genomic sequence datasets used for the pair-wise compression evaluation.
Each row specifies the species, the number of chromosomes they contain
and the target and the reference assemblies with the corresponding locations from which they were retrieved.
T. and R. stand for the Target and the Reference, respectively. . . . . . . . . 27
2.2
Compression results for the pairwise compression. C. time and D. time
stand for compression and decompression time [seconds], respectively. The
results in bold correspond to the best compression performance among the
different algorithms. We use the International System of Units for the prefixes, that is, 1MB and 1KB stands for 106 and 103 Bytes, respectively.
ú
denotes the cases where GRS outperformed GReEn. In these cases, i.e.,
L. pneumohilia and S. cerevisiae, the compression achieved by GReEn is
495KB and 304.2KB, respectively.
†
3.1
denotes the cases where GDC-advanced outperformed GDC-normal. . . . 33
Data sets used for the assessment of the proposed algorithm.
The alignment program used to generate the SAM files is Bowtie2. The
data sets are divided in two ensembles, low coverage data sets and high
coverage data sets. The size corresponds to the SAM file.
a
M stands for millions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
xii
3.2
Compression results for the high coverage ensemble. The results in bold
show the compression gain obtained by the proposed method with respect
to SamComp.
We use the International System of Units for the prefixes, that is, 1 MB
stands for 106 Bytes.
C.T. stands for compression time.
a
Raw size refers solely to the size of the mapped reads (1 Byte per base
pair). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.1
Illumina’s proposed 8 level mapping. . . . . . . . . . . . . . . . . . . . . . 56
4.2
Lossless results of the different algorithms for the NA12878 data set. . . . . 75
5.1
Sensitivity, precision, f-score and Compression ratio for the 30◊ and 15◊
Coverage datasets for the GATK pipeline, using the NIST ground truth. . . 90
5.2
Sensitivity for INDEL detection by Dindel pipeline with various compression approaches for 4 simulated datasets. . . . . . . . . . . . . . . . . . . . 97
xiii
List of Figures
1.1
Overview of the drop in the sequencing cost. As can be observed, the rate
of this price drop is surpassing Moore’s law.
Source: www.genome.gov/sequencingcostsdata/. . . . . . . . . . . . . . . .
1.2
2
Next Generation Sequencing technologies require a library preparation of
the DNA sample, which includes cutting the DNA into small fragments.
These are then used as input into the sequencing machine, which performs
the sequencing in parallel. The output data is stored in a FASTQ file. . . . .
1.3
3
The “reads” output by the sequencing machine correspond to fragments of
the genome. The coverage indicates the number of reads on average that
were generated from a random location of the genome. . . . . . . . . . . .
1.4
Example of a FASTQ file entry corresponding to a read of length 28 and
quality scores in the scale Phred + 33. . . . . . . . . . . . . . . . . . . . .
1.5
4
5
Example of the information contained in a SAM file for a single read. The
information in blue represents the data contained in the FASTQ file (i.e.,
the read identifier, the read itself, and the quality scores), and that in orange
the one pertaining to the alignment. In this example, the read maps to
chromosome 1, at position 50, with no mismatches. For more details on
the extra and optional fields, we refer the reader to [1]. . . . . . . . . . . .
1.6
6
Example of two variants contained in a VCF file. For example, the first
one correspond to a SNP in chromosome 20, position 150. The reference
sequence containes a “T” at that location, whereas the sequenced genome
contains a “C”. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
xiv
7
1.7
Typical pipeline of genome sequence, with the corresponding generated
files at each step. Sizes are approximate, and correspond to sequencing a
human genome at 200 coverage (200◊). . . . . . . . . . . . . . . . . . . .
8
2.1
Diagram of the main steps of the proposed algorithm iDoComp. . . . . . . 20
2.2
Flowchart of the post-processing of the sequence of matches M to generate
the set S. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.1
Performance of the proposed method and the previously proposed algorithms when compressing aligned reads of low coverage data sets. . . . . . 48
4.1
Example of the boundary points and reconstruction points found by a LloydMax quantizer, for M = 3. . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4.2
Our temporal Markov model. . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.3
Rate-Distortion performance of QualComp for MSE distortion, and the M.
Musculus dataset. cX stands for QualComp run with X number of clusters.
4.4
73
Rate-Distortion curves of PBlock, RBlock, QualComp and QVZ, for MSE,
L1 and Lorentzian distortions. In QVZ, c1, c3 and c5 denote 1, 3 and 5
clusters (when using k-means), respectively. QualComp was run with 3
clusters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
4.5
Rate-Distortion curves of QVZ for the H. Sapiens dataset, when the clustering step is performed with k-means (c3 and c10), and with the Mixture
of Markov Model approach (K3 and K10). In both cases we used 3 and 10
clusters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
5.1
Difference between the GIAB NIST “ground truth” and the one from Illumina, for (a) chromosomes 11 and (b) 20. . . . . . . . . . . . . . . . . . . 81
5.2
Average sensitivity, precision and f-score of the four considered datasets
using the NIST ground truth. Different colors represent different pipelines,
and different points within an algorithm represent different rates. Q40 denotes the case of setting all the quality scores to 40. . . . . . . . . . . . . . 85
xv
5.3
Average sensitivity, precision and f-score of the four considered datasets
using the Illumina ground truth. Different colors represent different pipelines,
and different points within an algorithm represent different rates. Q40 denotes the case of setting all the quality scores to 40. . . . . . . . . . . . . . 86
5.4
Comparison of the average sensitivity, precision and f-score of the four
considered datasets and two ground truths, when QVZ is used with 3 clusters computed with k-means (QVZ-Mc3) and Mixture of Markov Models
(QVZ-MMM3). Different colors represent different pipelines, and different points within an algorithm represent different rates. . . . . . . . . . . . 87
5.5
Box plot of f-score differences between the lossless case and six lossy compression algorithms for 24 simulations (4 datasets, 3 pipelines and 2 ground
truths). The x-axis shows the compression rate achieved by the algorithm.
The three left-most boxes correspond to QVZ-Mc3 with parameters 0.9,
0.8 and 0.6, while the three right-most boxes correspond to RBlock with
parameters 30, 20 and 10. The blue line indicates the mean value, and the
red one the median. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
5.6
ROC curve of chromosome 11 (ERR262996) with the NIST ground truth
and the GATK pipeline with the VQSR filter. The ROC curve was generated with respect to the VQSLOD field. The results are for the original
quality scores (uncompressed), and those generated by QVZ-Mc3 (MSE
distortion and 3 clusters), PBlock (p = 8) and RBlock (r = 25). . . . . . . . 92
5.7
Average (of four simulated datasets) sensitivity, precision and f-score for
INDEL detection pipelines. Different colors represent different pipelines,
and different points within an algorithm represent different rates. . . . . . . 93
6.1
Outline of the proposed denoising scheme. . . . . . . . . . . . . . . . . . . 100
6.2
Reduction in size achieved by the denoiser when compared to the original
data (when losslessly compressed). . . . . . . . . . . . . . . . . . . . . . . 103
6.3
Denoiser performance on the GATK pipeline (30x dataset, chr. 20). Different points of the same color correspond to running the lossy compressor
with different parameters. . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
xvi
6.4
Denoiser performance on the GATK and hstlib.org pipelines (15x dataset,
chr. 11). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
6.5
Improvement achieved by applying the post-processing operation. x-axis
represents the performance in sensitivity, precision and f-score achieved by
solely applying lossy compression, and the y-axis represents the same but
when the post-processing operation is applied after the lossy compressor.
Grey line corresponds to x = y, and thus all the points above it correspond
to an improved performance. . . . . . . . . . . . . . . . . . . . . . . . . . 107
7.1
Answering queries from signatures: a user first makes a query to the compressed database, and upon receiving the indexes of the sequences that may
possibly be similar, discards the false positives by retrieving the sequences
from the original database. . . . . . . . . . . . . . . . . . . . . . . . . . . 109
7.2
Signature assignment of the LC ≠ — scheme for each sequence x in the
database. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
7.3
LC≠—
Binary sources and Hamming distortion: if PX = PY = Bern(0.5), RID
(D) =
TC≠—
LC≠—
RID
(D) = RID (D), whereas if PX = PY = Bern(0.7), RID
(D) >
TC≠—
RID
(D) = RID (D). . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
7.4
Binary symmetric sequences and similarity threshold D = 0.2: (a) performance of the proposed architecture with quantized distortion (b) comparisson with LSH for rate R = 0.3. . . . . . . . . . . . . . . . . . . . . . . . . 119
7.5
Performance of the proposed schemes for sequences of length n = 512,
similarity thresholds D = {0.05, 0.1} and PX = PY = Bern(0.7) and
Bern(0.8). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
7.6
Performance of the LC ≠ — scheme for D = {0.1, 0.2} applied to two
databases composed of 4-ary sequences: one generated uniformly i.i.d. and
the other comprised of real DNA sequences from [2]. . . . . . . . . . . . . 122
xvii
Chapter 1
Introduction
In the year 2000, US president Bill Clinton declared the success of the Human Genome
Project [3], calling it “the most important scientific discovery of the 20th century” (although
it wasn’t until 2003 that the human genome assembly was finally completed). It was the
end of a project that took almost 13 years to complete and cost 3 billion dollars (around $1
per base pair).
Fortunately, sequencing cost has drastically decreased in recent years. While in 2004
the cost of sequencing a whole human genome was around $20 million, in 2008 it dropped
to a million, and in 2015 to a mere $1000 (see Fig. 1.1). As a result of this decrease
in sequencing cost, as well as advancements in sequencing technology, massive amounts
of genomic data are being generated. At the current rate of growth (sequencing data is
doubling approximately every seven months), more than an exabyte of sequencing data
per year will be produced, approaching the zettabytes by 2025 [4]. As an example, the
sequencing data generated by the 1000 Genomes Project1 in the first 6 months exceeded
the sequence data accumulated during 21 years in the NCBI GenBank database [5].
Often, these data are unique, in that the samples are not available for re-sequencing,
from organisms and ever changing ecosystems. Moreover, the tools that are used to process and analyze the data improve over time, and thus it will likely be beneficial to revisit
and re-analyze the data in the future. For these, among other reasons, long-term storage
with convenient retrieval is required. In addition, the acquisition of the data is highly
1
www.1000genoms.org
1
CHAPTER 1. INTRODUCTION
2
Figure 1.1: Overview of the drop in the sequencing cost. As can be observed, the rate of
this price drop is surpassing Moore’s law.
Source: www.genome.gov/sequencingcostsdata/.
distributed, which demands a large bandwidth to transmit and access these large quantities of information through the network. This situation calls for state-of-the-art, efficient
compressed representations of massive biological datasets, that can not only alleviate the
storage requirements, but also facilitate the exchange and dissemination of these data. This
undertaking is of paramount importance, as the storage and acquisition of the data are
becoming the major bottleneck, as evidenced by the recent flourishing of cloud-based solutions enabling processing the data directly on the cloud. For example, companies such as
DNAnexus2 , GenoSpace3 , Genome Cloud4 , and Google Genomics5 , to name a few, offer
solutions to perform genome analysis in the cloud. In addition, the ultimate goal of genome
sequencing is to analyze the data so as to advance the understanding of development, behavior, evolution, and disease. Consequently, substantial effort is put into designing accurate
methods for this analysis. In particular, the field of personalized medicine is rapidly developing, enabling the design of individualized paths to help patients mitigate risks, prevent
2
http://dnanexus.com
http://www.genospace.com
4
http://www.genome-cloud.com
5
https://cloud.google.com/genomics/
3
CHAPTER 1. INTRODUCTION
3
disease and treat it effectively when it occurs.
The main objective of this thesis is to develop new methods and algorithms to ease the
storage and distribution of the genomic data, and to facilitate its access and analysis.
1.1
Easing the storage and distribution of genomic data
Most of the genomic data being stored and analyzed to date are comprised of sequencing
data produced by Next Generation Sequencing (NGS) technologies. Unfortunately, these
technologies are not capable of providing the whole genome sequence. As a reference,
the human genome sequence contains 3 billion base-pairs. The reason is that all currently
available NGS sequencing platforms require some level of DNA pre-processing into a library suitable for sequencing. In general, these steps involve fragmenting the DNA into an
appropriate platform-specific size range, and ligating specialized adapters to both fragment
ends. The different sequence platforms have devised different strategies to prepare the sequence libraries into suitable templates as well as to detect the signal and ultimately read
the DNA sequence (see [6] for a detailed review).6 The method employed by the different
NGS technologies for the readout of the sequencing signal is known as base calling. Fig.
1.2 provides an overview of the sequencing process.
Figure 1.2: Next Generation Sequencing technologies require a library preparation of the
DNA sample, which includes cutting the DNA into small fragments. These are then used as
input into the sequencing machine, which performs the sequencing in parallel. The output
data is stored in a FASTQ file.
6
To increase the signal-to-noise ratio on the reading of each base-pair, some technologies perform a local
clonal amplification of the prepared fragment.
CHAPTER 1. INTRODUCTION
4
As a consequence of this process, the NGS technologies produce, instead of the whole
genome sequence, a collection of millions of small fragments (corresponding to those generated during the library preparation). These fragments are called “reads”, and can be
thought of as strings randomly sampled from the genome that was sequenced. For most
technologies, the length of these reads is of a few hundreds base-pairs, which is generally
significantly smaller than the genome itself (recall that a human genome is composed of
around 3 ◊ 109 base-pairs). Fig. 1.3 illustrates this concept.
Figure 1.3: The “reads” output by the sequencing machine correspond to fragments of the
genome. The coverage indicates the number of reads on average that were generated from
a random location of the genome.
The base calling process may be affected by various factors, which may lead to a wrong
readout of the sequencing signal. In order to assess the probability of base calling mistakes,
the sequencers generate, in addition to the reads (i.e., the nucleotide sequences), quality
scores that reflect the level of confidence in the readout of each base. Thus each of the
reads is accompanied by a sequence of quality scores, of the same length, that indicate the
level of confidence of each of the nucleotides of the read. The higher the quality score,
the higher the reliability of the corresponding base. Specifically, the quality score Q is the
integer mapping of P (the probability that the corresponding base call is incorrect) and it
is usually represented in one of the following scales/standards:
• Sanger or Phred scale [7]: Q = ≠10 log10 P .
• Solexa scale: Q = ≠10 log10
P
.
1≠P
Different NGS technologies use different scales, Phred + 33, Phred + 64 and Solexa + 64
CHAPTER 1. INTRODUCTION
5
being the most common ones. For example, Phred + 33 corresponds to values of Q in the
range [33 : 73]. Note that the size of the alphabet of the quality scores is around 40.
Quality scores are important and very useful in many “downstream applications” (i.e.,
applications that operate on the sequencing data) such as trimming (used to remove untrusted regions) [8, 9], alignment [10, 11, 12, 13] or Single Nucleotide Polymorphism
(SNP) detection [14, 15], among others.
The raw sequencing data is therefore mainly composed of the reads and the quality
scores. This information is stored in the FASTQ format, which is widely accepted as a
standard for storing sequencing data. FASTQ files consist of separate entries for each read,
each consisting of four lines. The first one is for the header line, which begins with the ‘@’
character and is followed by a sequence identifier and an optional description. The second
one contains the nucleotide sequence. The third one starts with the ‘+’ character and can be
followed by the same information stored in the first line. Finally, the fourth line is for the
quality scores associated with the bases that appear in the nucleotide sequence of line two
(both lines must contain the same number of symbols). The quality scores are represented
with ASCII characters in the FASTQ file. Fig. 1.4 shows an example of a FASTQ entry.
@SRR0626347 13976/1
TGGAATCAGATGGAATCATCGAATGGTC
+
IIIHIIHABBBAA=2))!!!(!!!((!!
Figure 1.4: Example of a FASTQ file entry corresponding to a read of length 28 and quality
scores in the scale Phred + 33.
The number of reads present in the raw sequencing data depends on the coverage (i.e.,
the expected number of times a specific nucleotide of the genome is sequenced). As an
example, sequencing a human genome with 200 coverage (Illumina’s current technology7 )
will generate around 6 billion reads (assuming a typical read length of 100). Thus the
resulting FASTQ files are very large (typically on the order of hundreds of gigabytes or
7
http://www.illumina.com/platinumgenomes/
CHAPTER 1. INTRODUCTION
6
larger).
Once the raw sequencing data is generated, the next step is typically to align the reads
to a reference sequence. In brief, the alignment process consists of determining, for each of
the reads contained in the FASTQ file, the corresponding location in the reference sequence
from which the read was generated (or that no such region exists). This is achieved by
comparing the sequence of the read to that of the reference sequence. A mapping algorithm
will try to locate a (hopefully unique) location in the reference sequence that matches the
read, while tolerating a certain amount of mismatches. Recall that the reads were generated
from a genome that most likely differs from that used as a reference sequence, and thus
variations between the reads and the reference sequence are to be expected. For each read,
the alignment program provides the location where the read is mapped in the reference,
as well as the mismatching information, if any, together with some extra fields. This is
denoted as the alignment information. This information is stored in the standard SAM
format [1], which also contains the original reads and the quality scores. These files, which
are heavily used by most downstream applications, are extremely large (typically in the
terabytes or larger). Fig. 1.5 shows an example of a SAM file entry for a single read.
Figure 1.5: Example of the information contained in a SAM file for a single read. The
information in blue represents the data contained in the FASTQ file (i.e., the read identifier,
the read itself, and the quality scores), and that in orange the one pertaining to the alignment. In this example, the read maps to chromosome 1, at position 50, with no mismatches.
For more details on the extra and optional fields, we refer the reader to [1].
Once the alignment concludes and the SAM file is generated, the next step consists of
analyzing the discrepancies between the reads and the reference sequence used for alignment. This process is called variant calling, and it is the main downstream application
in practice. The variants (discrepancies) between the original genome and the reference
sequence are normally due to Single Nucleotide Polymorphisms (SNPs) (i.e., a single nucleotide variation) and insertions and deletions (denoted by INDELS). The variant calling
CHAPTER 1. INTRODUCTION
7
algorithm analyzes the aligned data, contained in the SAM file, and finds the biologically
relevant variants corresponding to the sequenced genome. The set of called variants, together with some extra information such as the quality of the call, are stored in a VCF file
[16] (see Fig. 1.6). For human genomes, a VCF file may contain around 3 million variants,
most of them due to SNPs (note that two human genomes are about 99.9% identical). The
size of this file is in the order of a gigabyte. The variants contained in the VCF file can
be later used towards reconstructing the original genome (i.e., the genome from which the
reads were generated).
#CHROM POS
20
150
20
175
REF
T
A
ALT <Extra Fields>
C
–
T
–
Figure 1.6: Example of two variants contained in a VCF file. For example, the first one
correspond to a SNP in chromosome 20, position 150. The reference sequence containes a
“T” at that location, whereas the sequenced genome contains a “C”.
Fig. 1.7 illustrates the typical pipeline after sequencing a genome, which is the one
that we focus on in this thesis. Summarizing, the sequencing process generates a FASTQ
file that contains millions of reads that are generated from the genome. This FASTQ file
is then used by an alignment program, which generates a SAM file containing the mapping/alignment information of each of the reads to a reference genome. Finally, a variant
caller analyzes the information on the SAM file and finds the variants between the original genome and the one used as reference. The called variants are stored in a VCF file.
As mentioned above, with the information contained in the VCF file the original genome
can be potentially reconstructed (this process is called “assembly”), yielding the assembled genome. Thus an assembled genome contains the nucleotide sequences of each of the
chromosomes that comprise the genome.
Ideally, after all the steps of the pipeline are completed, one would need to store just
the VCF file, as it contains all the relevant information regarding the genome that was
sequenced. This would eliminate the need for storing the intermediate FASTQ and SAM
files, which are generally prohibitively large. Unfortunately, this is not yet the case. As
CHAPTER 1. INTRODUCTION
8
Figure 1.7: Typical pipeline of genome sequence, with the corresponding generated files at
each step. Sizes are approximate, and correspond to sequencing a human genome at 200
coverage (200◊).
already outlined above, there are several reasons why the FASTQ and SAM files need to
be stored. For example, the alignment tools and variant calling programs keep improving
over time, and thus re-analyzing the raw data can yield new variants that were initially
undetected. Thus there is a pressing need for compression of these files, that can ease the
storage and transmission of the data.
Although there exist general-purpose compression schemes like gzip, bzip2 and 7zip
(www.gzip.org, www.bzip.org and www.7zip.org, respectively) that can be directly applied
to any type of genomic data, they do not exploit the particularities of these data, yielding
relatively low compression gains [17, 18, 19]. With this gap in mind in compression ratios, several specialized compression algorithms have been proposed in the last few years.
These algorithms can be classified into two main categories: i) Compression of assembled
genomes and ii) Compression of raw NGS data (namely FASTQ and SAM files).
Note that even though the main bottleneck in the storage and transmission of genomic
CHAPTER 1. INTRODUCTION
9
data is due to the raw sequencing data (FASTQ, SAM), compression of assembled genomes
is also important. For example, whereas an uncompressed human genome occupies around
3 GB, its equivalent compressed form is in general smaller than 10 MB, thus easing the
transfer and download of genomes. This means that a human genome could simply be
attached to an email. Moreover, as we advance towards personalized medicine, increasing
amounts of assembled genomes are expected to be generated.
Compression of the raw sequencing data, that is, FASTQ and SAM files, pertains
mainly to the compression of the identifiers, the reads, and the quality scores. However,
there has been more interest in the compression of the reads and the quality scores, as they
take most of the space, and carry the most relevant information.
Much effort has been put into designing compression schemes for the reads, both for the
raw reads and the aligned reads. Better compression ratios can be achieved when considering aligned reads, as in that case only the differences with respect to the sub-sequence of
the reference where they aligned to need to be stored (see [17]). Note that those differences,
together with the reference sequence, suffice to reconstruct the read exactly.
Quality scores, on the other hand, have been proven to be more difficult to compress
than the reads, due in part to their higher entropy and larger alphabet. When losslessly
compressed, quality scores typically comprise more than 70% of the compressed file [17].
In addition, there is evidence that quality scores are inherently noisy, and downstream applications that use them do so in varying heuristic manners. As a result, lossy compression
of quality scores (as opposed to lossless) has emerged as a candidate for boosting compression performance, at a cost of introducing some distortion (i.e., the reconstructed quality
scores may differ from the original ones).
Traditionally, lossy compressors have been analyzed in terms of their rate-distortion
performance. Such analysis provides a yardstick for comparison of lossy compressors of
quality scores that is oblivious to the multitude of downstream applications, which use
the quality scores in different ways. However, the data compressed is used for biological
inference. Researchers are thus interested in understanding the effect that the distortion
introduced in the quality scores has on the subsequent analysis performed on the data,
rather than a more generic measure of distortion such as rate-distortion.
To date, there is not a standard practice on how this analysis should be performed. Proof
CHAPTER 1. INTRODUCTION
10
of this is the variety of analyses presented in the literature when a new lossy compressor
for quality scores is introduced (see [20, 21, 22, 23] and references therein). Moreover, it
is not yet well understood how lossy compression of quality scores affects the downstream
analysis performed on the data. This can be understood not only by the lack of a standard
practice, but also by the variety of applications that exist and the different manners in which
quality scores are used. In addition, the fact that lossy compressors can work at different
rates and be optimized for several distortion metrics make the analysis more challenging.
However, such an analysis is important if lossy compression is to become a viable mode
for coping with the surging requirements of genomic data storage.
With that in mind, there has been recent effort and interest in obtaining a methodology to analyze how lossy compression of quality scores affects the output of one of the
most widely used downstream applications: variant calling, which compromises Single
Nucleotide Polymorphism (SNP) and Insertion and Deletion (INDEL) calling. Not surprisingly, recent studies have shown that lossy compression can significantly alleviate storage
requirements while maintaining variant-calling performance comparable – and sometimes
superior – to the performance achieved using the uncompressed data (see [24] and references therein). This phenomenon can be explained by the fact that the data is noisy, and
the current variant callers do not use the data in an optimal manner.
These results suggest that the quality scores can be denoised, i.e., structured noise can
be removed to improve the quality of the data. While the proposed lossy compressors for
the quality scores address the problem of storage, they are not explicitly designed to also
reduce the noise present in the quality scores. Reducing the noise is of importance since
perturbations in the data may significantly degrade the subsequent analysis performed on it.
Moreover, reducing the noise of the quality scores leads to quality scores with smaller entropy, and consequently to higher compression ratios than those obtained with the original
file. Thus denoising schemes for the quality scores can reduce the noise of the genomic data
while easing its storage and dissemination, which can significantly advance the processing
of genomic data.
CHAPTER 1. INTRODUCTION
1.2
11
Facilitating access to the genomic data
The generation of new databases and the amount of data contained in existing ones is
growing exponentially. For example, the Sequence Read Archive (SRA) contains more
than 3.6 petabases of genomic data8 , the biological database Genbank contains almost 2000
million DNA sequences9 , and the database BIOZON contains well over 100 million records
[2]. As a result, executing queries on them is becoming a time-consuming and challenging
task. To facilitate this effort, there has been recent interest in compressing sequences in
a database such that similarity queries can still be performed on the compressed database.
Compressing the database allows it to replicate in several locations, thus providing easier
and faster access, and potentially reducing the time needed to execute a query.
Given a database consisting of many sequences, similarity queries refer to queries of the
form: “which sequences in the database are similar to a given sequence y?”. This kind
of query is of practical interest in genomics, such as in molecular phylogenetics, where
relationships among species are established by the similarity between their respective DNA
sequences. However, note that these queries are of practical interest not only in genomics,
where it is important to find sequences that are similar, but in many other applications as
well. For example forensics, where fingerprints are run through various local, state and
national fingerprint databases for potential matches. Thus such schemes will soon become
necessary in other fields where databases are rapidly growing in size.
The fundamental limits of this problem characterize the tradeoff between compression
rate and the reliability of the queries performed on the compressed data. While those
asymptotic limits have been studied and characterized in past work (see for example [25,
26, 27]), how to approach these limits in practice has remained largely unexplored.
1.3
Outline and Contributions
In the first part of the thesis, we investigate and develop methods to ease the distribution
and storage of the genomic data. Specifically, we first do an overview of the state-of-the-art
8
9
http://www.ncbi.nlm.nih.gov/sra
http://www.ncbi.nlm.nih.gov/genbank
CHAPTER 1. INTRODUCTION
12
compression schemes for the genomic data, and then describe new compression algorithms
tailored to assembled genomes and the sequencing data (e.g., the reads and the quality
scores). In addition, we define a methodology to analyze the effect that lossy compression of the quality scores has on variant calling, and we use the proposed methodology to
evaluate the performance of existing lossy compressors, which constitutes the first in-depth
comparison available in the literature. Finally, we introduce the first denoising scheme for
the quality scores, and demonstrate improved inference using the denoised data.
In the second part of the thesis, we investigate methods to facilitate the access to genomic data. Specifically, we study the problem of compressing sequences in a database
so that similarity queries can still be performed on the compressed database. We propose
practical schemes for this task, and show that their performance is close to the fundamental
limits under statistical models of relevance.
Our contributions per chapter are as follows:
• In Chapter 2 we focus on compression of assembled genomes. We describe a method
that assumes the existence of a reference genome for compression, which is a realistic
assumption in practice. Having such a reference boosts the compression performance
due to the high similarity between genomes of the same species. The presented algorithm is one of the most competitive to date, both in running time and compression
performance. The main underlying insight is that we can efficiently compute the
mismatches between the genome and the reference, using suffix arrays, and process
these mismatches to reduce their entropy, and thus their compressed representation.
• Chapter 3 describes a new method for compression of aligned reads. The proposed
method achieves a considerable improvement over the best previously achieved compression ratios (particularly for high coverage datasets). These results had broken the
experimental pareto-optimal barrier for compression rate and speed previously perceived [17]. Furthermore, the proposed algorithm is amenable for conducting operations in the compressed domain, which can speed up the running time of downstream
applications.
• In Chapter 4 we focus on the quality scores, proposing two different lossy compression methods that are able to significantly boost the compression performance. The
CHAPTER 1. INTRODUCTION
13
first method, QualComp, uses the Singular Value Decomposition (SVD) to transform the quality scores into Gaussian distributed values. This approach allows us
to use theory from rate distortion to allocate the bits in an optimal manner. The
second method, QVZ, assumes that the quality scores are generated by an order-1
Markov source. The algorithm computes the empirical distribution and uses it to
design the optimal quantizers in order to achieve a specific rate (number of bits per
quality score), while optimizing for a given distortion. Moreover, QVZ can also perform lossless compression. We also show that clustering the quality scores prior to
compression, using a Markov mixture model, can improve the performance. In addition, we provide a complete rate-distortion analysis that includes previously proposed
methods.
• In Chapter 5 we present a methodology to analyze the effect of lossy compression of
quality scores on variant calling. Having a defined methodology for comparison is
crucial in this multidisciplinary area, where researchers working on lossy compression are not familiar with the variant calling pipelines used in practice. In addition,
we use the proposed methodology to evaluate the performance of existing lossy compressors, which constitutes the first in-depth comparison available in the literature.
The results presented in this chapter show that lossy compression can significantly
alleviate storage requirements while maintaining variant-calling performance comparable – and sometimes superior – to the performance achieved using the uncompressed data.
• In Chapter 6 we build upon the results presented in the previous chapter, which
suggest that the quality scores can be denoised. In particular, we propose the first
denoiser for quality scores, and demonstrate improved inference on variant calling
using the denoised data. Moreover, a consequence of the denoiser is that the entropy
of the produced quality scores is smaller, which leads to higher compression ratios
than those obtained with the original file. Reducing the noise of genomic data while
easing its storage and dissemination can significantly advance the processing of genomic data, and the results presented in this chapter provide a promising baseline for
future research in this direction.
CHAPTER 1. INTRODUCTION
14
• In Chapter 7 we study the problem of compressing sequences in a database so that
similarity queries can still be performed on the compressed database. We propose
two practical schemes for this task, and show that their performance is close to the
fundamental limits under statistical models of relevance. Moreover, we apply the
aforementioned schemes to a database containing genomic sequences, as well as to a
database containing simulated data, and show that it is possible to achieve significant
compression while still being able to perform queries of the form described above.
• To conclude, Chapter 8 contains final remarks and outlines several future research
directions.
More background on these topics and previous work in these areas is provided in each
individual chapter.
1.4
Previously Published Material
Part of this dissertation has appeared in the following manuscripts:
• Publication [28]: Idoia Ochoa, Mikel Hernaez, and Tsachy Weissman, “iDoComp: a
compression scheme for assembled genomes”, Bioinformatics, btu698, 2014.
• Publication [29]: Idoia Ochoa, Mikel Hernaez, and Tsachy Weissman, “Aligned
genomic data compression via improved modeling”, Journal of bioinformatics and
computational biology, Vol. 12, no. 06, 2014.
• Publication [20]: Idoia Ochoa, Himanshu Asnani, Dinesh Bharadia, Mainak Chowd-
hury, Tsachy Weissman, and Golan Yona, “QualComp: a new lossy compressor for
quality scores based on rate distortion theory”, BMC bioinformatics, Vol. 14, no. 1,
2013.
• Publication [22]: Greg Malysa, Mikel Hernaez, Idoia Ochoa, Milind Rao, Karthik
Ganesan, and Tsachy Weissman, “QVZ: lossy compression of quality values”, Bioinformatics, btv330, 2015.
CHAPTER 1. INTRODUCTION
15
• Publication [30]: Mikel Hernaez, Idoia Ochoa, Rachel Goldfeder, Tsachy Weissman,
and Euan Ashley, “A cluster-based approach to compression of Quality Scores”, Data
Compression Conference (DCC), 2015.
• Publication [24]: Idoia Ochoa, Mikel Hernaez, Rachel Goldfeder, Tsachy Weissman,
and Euan Ashley, “Effect of lossy compression of quality scores on variant calling”,
Briefings in Bioinformatics, doi: 10.1093/bib/bbw011, 2016.
• Publication [31]: Idoia Ochoa, Mikel Hernaez, Rachel Goldfeder, Tsachy Weissman,
and Euan Ashley, “Denoising of Quality Scores for Boosted Inference and Reduced
Storage”, Data Compression Conference (DCC), 2015.
• Publication [32]: Idoia Ochoa, Amir Ingber, and Tsachy Weissman, “Efficient similarity queries via lossy compression”, 51st Annual Allerton Conference on Communication, Control, and Computing, 2013.
• Publication [33], Idoia Ochoa, Amir Ingber, and Tsachy Weissman, “Compression
schemes for similarity queries”, Data Compression Conference (DCC), 2014.
Chapter 2
Compression of assembled genomes
As the sequencing technologies advance, more genomes are expected to be sequenced and
assembled in the near future. Thus there is a need for compression of genomes guaranteed
to perform well simultaneously on different species, from simple bacteria to humans, that
can ease their transmission, dissemination and analysis.
In this chapter we introduce iDoComp, a compressor of assembled genomes that compresses an individual genome using a reference genome for both the compression and the
decompression. Note that most of the genomes to be compressed correspond to individuals
of a species from which a reference already exists on the database. Thus, it is natural to
assume and exploit the availability of such references.
In terms of compression efficiency, iDoComp outperforms previously proposed algorithms in most of the studied cases, with comparable or better running time. For example,
we observe compression gains of up to 60% in several cases, including H. Sapiens data,
when comparing to the best compression performance among the previously proposed algorithms.
2.1
Survey of compression schemes for assembled genomes
Several compression algorithms for assembled genomes have been proposed in the last two
decades. On the one hand, dictionary based algorithms as BioCompress2 [34] and DNACompress [35] compress the genome by identifying low complexity sub-strings as repeats
16
CHAPTER 2. COMPRESSION OF ASSEMBLED GENOMES
17
or palindromes and replacing them by the corresponding codeword from the codebook. On
the other hand, statistics-based algorithms such as XM [36] generate a statistical model of
the genome and then use entropy coding that relies on the previously computed probabilities.
Although the aforementioned algorithms perform very well over data of relatively small
size, such as mitochondrial DNA, they are impractical for larger sequences (recall that the
size of the human genome is on the order of several gigabytes). Further, they focus on
compressing a single genome without the use of a reference sequence, and thus do not
exploit the similarities across genomes of the same species.
It was in 2009 that interest in reference-based compression started to rise with the publication of DNAzip [37] and the proposal from [38]. In DNAzip, the authors compressed the
genome of James Watson (JW) to a mere 4MB based on the mapping from the JW genome
to a human reference and using a public database of the most common Single Nucleotide
Polymorphisms (SNPs) existing in humans. [39] further improved the DNAzip approach
by performing a parametric fitting of the distribution of the mapping integers. The main
limitations of these two proposals is that they rely on a database of SNPs available only for
humans and further assume that the mapping from the target to the reference is given. Thus,
while they set a high performance benchmark for whole human genome compression, they
are currently not applicable beyond this specific setting.
[40] proposed the RLZ algorithm for reference-based compression of a set of genomes.
The authors improved the algorithm in a subsequent publication, yielding the RLZ-opt
algorithm [41]. The RLZ algorithms are based on parsing the target genome into the reference sequence in order to find longest matches. While in RLZ the parsing is done in a
greedy manner (i.e., always selecting the longest match), in the optimized version, RLZopt, the authors proposed a non-greedy parsing technique that improved the performance
of the previous version. Each of the matches is composed of two values: the position of
the reference where the match starts (a.k.a. offset) and the length of the match. Once the
set of matches is found, some heuristics are used to reduce the size of the set. For example,
short matches may be more efficiently stored as a run of base-pairs (a.k.a. literals) than as
a match (i.e., a position and a length). Finally, the remaining set together with the set of
literals is entropy encoded.
CHAPTER 2. COMPRESSION OF ASSEMBLED GENOMES
18
[42] proposed the GDC algorithm, which is based on the RLZ-opt. One of the differences between the two is that GDC performs the non-greedy parsing of the target into the
reference by hashing rather than using a suffix array. It also performs different heuristics
for reducing the size of the set of matches, such as allowing for partial matches and allowing or denying large “jumps” in the reference for subsequent matches. GDC offers several
variants, some optimized for compression of large collections of genomes, e.g., the ultra
variant. Finally, [42] showed in their paper that GDC outperforms (in terms of compression
ratio) RLZ-opt and, consequently, all the previous algorithms proposed in the literature. It
is worth mentioning that the authors of RLZ did create an improved version of RLZ-opt
whose performance is similar to that of GDC. However, [42] showed that it was very slow
and that it could not handle a data set with 70 human genomes.
At the same time, two other algorithms, namely GRS and GReEn, [43] and [44], respectively, were proposed. The main difference between the aforementioned ones and GRS
and GReEN is that in the later two the authors only consider the compression of a single
genome based on a reference, rather than a set of genomes. Moreover, they assume that
the reference is available and need not be compressed. It was shown in [42, 44] that GRS
only considers pairs of targets and references that are very similar. [44] proposed an algorithm based on arithmetic coding. They use the reference to generate the statistics and
then they perform the compression of the target using arithmetic coding, which uses the
previously computed statistics. They showed clearly that GReEn was superior to both GRS
and the non-optimized RLZ. However, [18] showed in their review paper that there were
some cases where GRS clearly outperformed GReEn in both compression ratio and speed.
Interestingly, this phenomenon was observed only in cases of bacteria and yeast, which
have genomes of relatively small size.
In 2012, another compression algorithm was presented in [45], where they showed
some improved compression results with respect to GReEn. However, the algorithm they
proposed has relatively high running time when applied to big datasets and it does not work
in several cases. Also in 2012, [46] proposed a compression algorithm for single genomes
that trades off compression time and space requirements while achieving comparable compression rates to that of GDC. The algorithm divides the reference sequence into blocks of
fixed size, and constructs a suffix tree for each of the blocks, which are later used to parse
CHAPTER 2. COMPRESSION OF ASSEMBLED GENOMES
19
the target into the reference.
Although it is straightforward to adapt GReEn to the database scenario in order to
compare it with the other state-of-the-art algorithm GDC, in the review paper by [18] they
did not perform any comparison between them. Moreover, the algorithm introduced in
[46] was not mentioned. On the other hand, in the review paper by [19] the authors did
compare all the algorithms stating that GDC and [46] achieved the highest compression
ratios. However, no empirical evidence in support of that statement was shown in the
article. Finally, in [47] they showed that GDC achieved better compression ratios than [46]
in the considered data sets.
After having examined all the available comparisons in the literature, we consider
GReEN and GDC to be the state-of-the-art algorithms in reference-based genomic data
compression. Thus, we use these algorithms as benchmarks1 . However, we also add GRS
to the comparison base in the cases where [18] showed that GRS outperformed GReEn.
Although in this work we do not focus on compression of collections of genomes,
for completeness we introduce the main algorithms designed for this task. As mentioned
above, the version GDC-Ultra introduced in [42] specializes in compression of a collection
of genomes. In 2013, a new algorithm designed for the same purpose, FRESCO, was presented in [48]. The main innovations of FRESCO include a method for reference selection
and reference rewriting, and an implementation of a second-order compression. FRESCO
offers lower running times than GDC-Ultra, with comparable compression ratios. Finally,
in [47] they showed that in this scenario a boost in compression ratio is possible if one
considers the genomes are given as variations with respect to a reference, in VCF format
[16], and the similarity of the variations across genomes is exploited.
In the next section we present the proposed algorithm iDoComp. It is based on a combination of ideas proposed in [37], [38] and [45].
2.2
Proposed method
In this subsection we start by describing the proposed algorithm iDoComp, whose goal is to
compress an individual genome assuming a reference is available both for the compression
1
We do not use the algorithm proposed in [46] because we were unable to run the algorithm.
CHAPTER 2. COMPRESSION OF ASSEMBLED GENOMES
suffix tree
T 2 ⌃t
mapping
generator
20
S 2 ⌃s
M
M, S, I
post process
of M
entropy
encoder
binary file
Figure 2.1: Diagram of the main steps of the proposed algorithm iDoComp.
and the decompression. We then present the data used to compare the performance of the
different algorithms, and the machine specifications where the simulations were conducted.
2.2.1
iDoComp
The input to the algorithm is a target string T œ
reference string S œ
s
t
of length t over the alphabet , and a
of length s over the same alphabet. Note that, in contrast to [41],
the algorithm does not impose the condition that the specific pair of target and reference
contain the same characters, e.g., the target T may contain the character N even if it is not
present in the reference S. As outlined above, the goal of iDoComp is to compress the
target sequence T using only the reference sequence S, i.e., no other aid is provided to the
algorithm.
iDoComp is composed of mainly three steps (see Figure 2.1): i) the mapping generation, whose aim is to express the target genome T in terms of the reference genome S, ii)
the post-processing of the mapping, geared toward decreasing the prospective size of the
mapping, and iii) the entropy encoder, that further compresses the mapping and generates
the compressed file. Next, we describe these steps in more detail.
1) Mapping generation
The goal of this step is to create the parsing of the sequence T relative to S. A parsing of
T relative to S is defined as a sequence (Ê1 , Ê2 , . . . , ÊN ) of sub-strings of S that when concatenated together in order yield the target sequence T . For reasons that will become clear
later, we slightly modify the above definition and re-define the parsing as (Ễ1 , Ễ2 , . . . , ỄN ),
where Ễi = (Êi , Ci ), Êi being a sub-string of S and Ci œ
a mismatched character that
appears after Êi in T but not in S. Note that the concatenation of the Ễ, i.e., both the
CHAPTER 2. COMPRESSION OF ASSEMBLED GENOMES
21
sub-strings and mismatched characters, should still yield the target sequence T .
A very useful way of expressing the sub-string Ễi is as the triplet mi = (pi , li , Ci ), with
pi , li œ (1, . . . , s), where pi is the position in the reference where Êi starts and li is the
length of Êi . If there is a letter X in the target that does not appear in the reference, then the
match (pi≠1 + li≠1 , 0, X) will be generated, where pi≠1 and li≠1 are the position and length
of the previous match, respectively2 . In addition, note that if Êi appears in more than one
place in the reference, any of the starting positions is a valid choice.
With this notation, the parsing of T relative to S can be defined as the sequence of
matches M = {mi = (pi , li , Ci )}N
i=1 .
In this work, we propose the use of suffix arrays to parse the target into the reference due
to its attractive memory requirements, especially when compared to other index structures
such as suffix trees [49]. This makes the compression and decompression of a human
genome doable on a computer with a mere 2 GB of RAM. Also, the use of suffix arrays is
only needed for compression, i.e., no suffix arrays are used for the decompression. Finally,
we assume throughout the chapter that the suffix array of the reference is already precomputed and stored in the hard-drive.
Once the suffix array of the reference is loaded into memory, we perform a greedy parsing of the target as previously described to obtain the sequence of matches M = {mi }N
i=1 .
[42] and [41] showed that a greedy parsing leads to suboptimal results. However, we are
not performing the greedy parsing as described in [40], since every time a mismatch is
found we record the mismatched letter and advance one position in the target. Since most
of the variations between genomes of different individuals within the same species are
SNPs (substitutions), recording the mismatch character leads to a more efficient “greedy”
mapping.
Moreover, note that at this stage the sequence of matches M suffices to reconstruct the
target sequence T given the reference sequence S. However, in the next step we perform
some post-processing over M in order to reduce its prospective size, which will translate
to better compression ratios. This is similar to the heuristic used by [42] and [41] for their
non-greedy mapping.
2
We assume p0 = l0 = 0
CHAPTER 2. COMPRESSION OF ASSEMBLED GENOMES
22
2) Post-Processing of the sequence of matches M
After the sequence of matches M is computed, a post-processing is performed on them.
The goal is to reduce the total number of elements that will be later compressed by the
entropy encoder. Recall that each of the matches mi contained in M is composed of two
integers in the range (1, . . . , s) and a character on the alphabet
. Since | | π s, the
number of unique integers that appear in M will be in general larger than | |. Thus,
the compression of the integers will require in general more bits than those needed to
compress the characters. Therefore, the aim of this step is mainly to reduce the number
of different integers needed to represent T as a parse of S, which will translate to improved
compression ratios.
Specifically, in the post-processing step we look for consecutive matches mi≠1 and
mi that can be merged together and converted into an approximate match. By doing this
we reduce the cardinality of M at a cost of storing the divergences of the approximate
matchings with regards to the exact matchings. We classify these divergences as either
SNPs (substitutions) or insertions, forming the new sets S and I, respectively.
For the case of the SNPs, if we find two consecutive matches mi≠1 and mi that can
be merged at the cost of recording a SNP that occurs between them, we add to the set S
an element of the form si = (pi , Ci ), where pi is the position of the target where the SNP
occurs, with T [pi ] = Ci . Then we merge matches mi≠1 and mi together into a new match
m Ω (pi≠1 , li≠1 + li + 1, Ci ). Hence, with this simple process we have reduced the number
of integers from 4 to 3.
We constrain the insertions to be of length one; that is, we do not explicitly store short
runs of literals (together with its position and length). This is in contrast to the argument
of [42] and [41] stating that storing short runs of literals is more efficient than storing their
respective matching. However, as we show next, we store them as a concatenation of SNPs.
Although this might seem inefficient, the motivation behind it is that storing short runs of
literals will in general add new unique integers, which incurs a high cost, since the entropy
encoder (based on arithmetic coding) will assign to them a larger amount of bits. We found
that encoding them as SNPs and then storing the difference between consecutive positions
of SNPs is more efficient. This process is explained next in more detail.
As pointed out by [41], the majority of the matches mi belong to the Longest Increasing
CHAPTER 2. COMPRESSION OF ASSEMBLED GENOMES
23
Sub-Sequence (LISS) of the pi . In other words, most of the consecutive pi ’s satisfy pi Æ
pi+1 Æ . . . Æ pj , for i < j, and thus they belong to the LISS. From the mi ’s whose pi
value does not belong to the LISS, we examine those whose length li is less than a given
parameter L and whose gap to their contiguous instruction is more than
. Among them,
those whose number of SNPs is less than a given parameter fl, or are short enough (<
),
are classified as several SNPs.
That is, if the match mi fulfills any of the above conditions, we merge the instructions
mi≠1 and mi as described above. Note that the match mi was pointing to the length li
sub-string starting at position pi of the reference, whereas now (after merging it to mi≠1 ) it
points to the one that starts at position pi≠1 + li≠1 + 1. Therefore, we need to add to the set
S as many SNPs as differences between these two sub-strings.
Note that this operation gets rid of the small-length matches whose pi ’s are far apart
from their “natural” position in the LISS. These particular matches will harm our compression scheme as they generate new large integers in most of the cases. On the other hand,
if the matches were either long, or with several SNP’s, and/or extremely close to their contiguous matchings, then storing them as SNP’s would not be beneficial. Therefore, the
values of L,
, fl and
are chosen such that the expected size of the new generated sub-set
of SNPs is less than that of the match mi under consideration. This procedure is similar to
the heuristic used by [42] to allow or deny “jumps” in the reference while computing the
parsing.
The flowchart of this part of the post-processing is depicted in Figure 2.2.
We perform an analogous procedure to find insertions in the sequence of matches M.
Since we only consider length-1 insertions, each insertion in the set I is of the form (p, C),
where p indicates the position in the target where the insertion occurs, and C the character
to be inserted. As mentioned before, the short runs of literals have been taken care of in the
last step of the SNP set generation.
After the post-processing described above is concluded, the sequence of matches M
and the two sets S and I are fed to the entropy encoder.
3) Entropy encoder
CHAPTER 2. COMPRESSION OF ASSEMBLED GENOMES
start
24
stop
Read match m1
and set i to 2
Store the previously merged
instructions in M
yes
no
Read match mi from M
Is there a
substitution
between mi 1 and
mi ?
yes
i == |M|?
i++
mi is now the merger of mi 1
and mi . Store the SNP in S.
no
Is mi short and
away from mi 1 ?
no
Store mi
1
in M
Store mi
1
in M
yes
Compute number of
SNPs if
pi
pi 1 + li 1 + 1
Are there few SNPs
or the length of mi is
very small?
yes
no
mi is now the merger of
mi 1 and mi .
Store all the SNPs in S.
Figure 2.2: Flowchart of the post-processing of the sequence of matches M to generate the
set S.
The goal of the entropy encoder is to compress the instructions contained in the sequence of matches M and the sets S, I generated in the two previous steps. Recall that the
elements in M, S and I are given by integers and/or characters, which will be compressed
in a different manner. Specifically, we first create two vectors ⇡ and
containing all the
integers and characters, respectively, from M, S and I. In order to be able to determine
to which instruction each integer and character belongs, at the beginning of ⇡ we add the
cardinalities of M, S and I, as well as the number of chromosomes.
CHAPTER 2. COMPRESSION OF ASSEMBLED GENOMES
25
To store the integers, first note that all the positions pi in S and I are ordered in as-
cending order, thus we can freely store the pi ’s as pi Ω pi ≠ pi≠1 , for i Ø 2; that is, as the
difference between successive ones. We perform a similar computation with the pi ’s of M.
Specifically, we store each pi as pi Ω |pi ≠(pi≠1 +li≠1 )|, for i Ø 2. However, since some of
the matches may not belong to the LISS, there will be cases where pi≠1 + li≠1 > pi . Hence,
a sign vector s is needed in this case to save the sign of the newly computed positions in
M. Finally, the lengths li œ M are also stored in ⇡.
Once the vector ⇡ is constructed, it is encoded by a byte-based adaptive arithmetic
encoder, yielding the binary stream Afi (⇡). Specifically, we represent each integer with
four bytes3 , and encode each of the bytes independently, i.e., with a different model. This
avoids the need for having to store the alphabet, which can be a large overhead in some
cases. Moreover, the statistics of each of the bytes are updated sequentially (adaptively),
and thus they do not need to be previously computed.
The vector
is constructed by storing all the characters belonging to M, S and I.
First, note that since the reference is available at both the encoder and the decoder, they
both can access any position of the reference. Thus, for each of the characters Ci œ
we can access its corresponding mismatched character in the reference, that we denote as
Ri . Thus, we generate a tuple of the form (R, C) for all the mismatched characters of the
parsing. We then employ a different model for the adaptive Arithmetic encoder for each of
the different R’s, and encode each Ci œ
using the model associated to its corresponding
Ri . Note that by doing this, one or two bits per letter can be saved in comparison with the
traditional one-code-for-all approach.
Finally, the binary output stream is the concatenation of Afi (⇡), s and A“ ( ) (the
arithmetic-compressed vector ).
2.2.2
Data
In order to asses the performance of the proposed algorithm iDoComp, we consider pairwise compression applied to different datasets. Specifically, we consider the scenario where
a reference genome is used to compress another individual of the same species (the target
3
We chose four bytes as it is the least number of bytes needed to represent all possible integers.
CHAPTER 2. COMPRESSION OF ASSEMBLED GENOMES
26
genome)4 . This would be the case when there is already a database of sequences from a
particular species, and a new genome of the same species is assembled and thus needs to
be stored (and therefore compressed).
The data used for the pair-wise compression are summarized in Table 2.1. This scenario
was already considered by [37, 38, 43, 42, 44, 18] for assessing the performance of their
algorithms, and thus the data presented in Table 2.1 include the ensemble of all the datasets
used in the previously mentioned papers.
As evident in Table 2.1, we include in our simulations datasets from a variety of species.
These datasets are also of very different characteristics in terms for example of the alphabet
size, the number of chromosomes they include, and the total length of the target genome
that needs to be compressed
2.2.3
Machine specifications
The machine used to perform the experiments and estimate the running time of the different algorithms has the following specifications: 39 GB RAM, Intel Core i7-930 CPU at
2.80GHz ◊ 8 and Ubuntu 12.04 LTS.
2.3
Results
Next we show the performance of the proposed algorithm iDoComp in terms of both compression and running time, and compare the results with the previously proposed compression algorithms.
As mentioned, we consider pair-wise compression for assessing the performance of the
proposed algorithm. Specifically, we consider the compression of a single genome (target
genome) given a reference genome available both at the compression and the decompression. We use the target and reference pairs introduced in Table 2.1 to assess the performance
of the algorithm. Although in all the simulations the target and the reference belong to the
same species, note that this is not a requirement of iDoComp, which also works for the
case where the genomes are from different species.
4
The target and the reference do not necessarily need to belong to the same species, although better
compression ratios are achieved if the target and the reference are highly similar.
CHAPTER 2. COMPRESSION OF ASSEMBLED GENOMES
Species
Chr.
Assembly
Retrieved from
L. pneumohilia
1
T:
R:
NC 017526.1
NC 017525.1
ncbi.nlm.nih.gov
E. coli
1
T:
R:
NC 017652.1
NC 017651.1
ncbi.nlm.nih.gov
S. cerevisiae
17
T:
R:
sacCer3
sacCer2
ucsc.edu
C. elegans
7
T:
R:
ce10
ce6
genome.ucsc.edu
A. thaliana
7
T:
R:
TAIR10
TAIR9
arabidopsis.org
Oryza sativa
12
T:
R:
TIGR6.0
TIGR5.0
rice.plantbiology.msu.edu
D. melanogaster
6
T:
R:
dmelr41
dmelr31
fruitfly.org
H. sapiens 1
25
T:
R:
hg19
hg18
ncbi.nlm.nih.gov
H. sapiens 2
25
T:
R:
KOREF 20090224
KOREF 20090131
koreangenome.org
H. sapiens 3
25
T:
R:
YH
hg18
yh.genomics.org.cn
ncbi.nlm.nih.gov
H. sapiens 4
25
T:
R:
hg18
YH
ncbi.nlm.nih.gov
yh.genomics.org.cn
H. sapiens 5
25
T:
R:
YH
hg19
yh.genomics.org.cn
ncbi.nlm.nih.gov
H. sapiens 6
25
T:
R:
hg18
hg19
ncbi.nlm.nih.gov
27
Table 2.1: Genomic sequence datasets used for the pair-wise compression evaluation. Each
row specifies the species, the number of chromosomes they contain and the target and the
reference assemblies with the corresponding locations from which they were retrieved.
T. and R. stand for the Target and the Reference, respectively.
CHAPTER 2. COMPRESSION OF ASSEMBLED GENOMES
28
To evaluate the performance of the different algorithms, we look at the compression
ratio, as well as at the running time of both the compression and the decompression. We
compare the performance of iDoComp with those of GDC, GReEn and GRS.
When performing the simulations, we run both GReEn and GRS with the default parameters. The results presented for GDC correspond to the best compression among the
advanced and the normal variant configurations, as specified in the supplementary data
presented in [42]. Note that the parameter configuration for the H. sapiens differs from that
of the other species. We modify it accordingly for the different datasets. Regarding iDoComp, all the simulations are performed with the same parameters (default parameters),
which are hard-coded in the code.
For ease of exhibition, for each simulation we only show the results of iDoComp, GDC
and the best among GReEn and GRS. The results are summarized in Table 2.2. For each
species, the target and the reference are as specified in Table 2.1. To be fair across the
different algorithms, especially when comparing the results obtained in the small datasets,
we do not include the cost of storing the headers in the computation of the final size for any
of the algorithms. The last two columns show the gain obtained by our algorithm iDoComp
with respect to the performance of GReEn/GRS and GDC. For example, a reduction from
100KB to 80KB represents a 20% gain (improvement). Note that with this metric a 0%
gain means the file size remains the same, a 100% improvement is not possible, as this will
mean the new file is of size 0, and a negative gain means that the new file is of bigger size.
As seen in Table 2.2, the proposed algorithm outperforms in compression ratio the previously proposed algorithms in all cases. Moreover, we observe that whereas GReEn/GRS
seem to achieve better compression in those datasets that are small and GDC in the H.
sapiens datasets, iDoComp achieves good compression ratios in all the datasets, regardless
of their size, the alphabet and/or the species under consideration.
In cases of bacteria (small size DNA), the proposed algorithm obtains compression
gains that vary from 30% against GReEn/GRS to up to 64% when compared with GDC. For
the S. cerevisae dataset, also of small size, iDoComp does 55% (61%) better in compression
ratio than GRS (GDC). For the case of medium size DNA (C. elegans, A. thaliana, O. sativa
and D. melanogaster) we observe similar results. iDoComp again outperforms the other
algorithms in terms of compression, with gains up to 92%.
CHAPTER 2. COMPRESSION OF ASSEMBLED GENOMES
29
Finally, for the H. sapiens datasets, we observe that iDoComp consistently performs
better than GReEn, with gains above 50% in four out of the six datasets considered, and up
to 91%. With respect to GDC, we also observe that iDoComp produces better compression
results, with gains that vary from 3% to 63%.
Based on these results, we can conclude that GDC and the proposed algorithm iDoComp are the ones that produce better compression results on the H. Sapiens genomes. In
order to get more insight into the compression capabilities of both algorithms when dealing with Human genomes, we performed twenty extra pair-wise compressions (results not
shown), and found that on average iDoComp employs 7.7 MB per genome, whereas GDC
employs 8.4 MB. Moreover, the gain of iDoComp with respect to GDC for the considered
cases is on average 9.92%.
Regarding the running time, we observe that the compression and the decompression
time employed by iDoComp is linearly dependent on the size of the target to be compressed. This is not the case of GReEn, for example, whose running times are highly
variable. In GDC we also observe some variability in the time needed for compression.
However, the decompression time is more consistent among the different datasets (in terms
of the size), and it is in general the smallest among all the algorithms we considered. iDoComp and GReEn take approximately the same time to compress and decompress. Overall,
iDoComp’s running time is seen to be highly competitive with that of the existing methods. However, note that the time needed to generate the suffix array is not counted in the
compression time of iDoComp, whereas the compression time of the other algorithms may
include construction of index structures, like in the case of GDC.
Finally, we briefly discuss the memory consumption of the different algorithms. We
focus on the compression and decompression of the Homo Sapiens datasets, as they represent the larger files and thus the memory consumption in those cases is the most significant.
iDoComp employs around 1.2 GB for compression, and around 2 GB for decompression.
GReEn consumes around 1.7 GB both for compression and decompression. Finally, the
algorithm GDC employs 0.9 GB for compression and 0.7 GB for decompression.
CHAPTER 2. COMPRESSION OF ASSEMBLED GENOMES
2.4
30
Discussion
Inspection of the empirical results of the previous section shows the superior performance
of the proposed scheme across a wide range of datasets, from simple bacteria to the more
complex humans, without the need for adjusting any parameters. This is a clear advantage
over algorithms like GDC, where the configuration must be modified depending on the
species being compressed.
Although iDoComp has some internal parameters, namely, L,
,
and fl,5 the default
values that are hard-coded in the code perform very well for all the datasets, as we have
shown in the previous section. However, the user could modify these parameters datadependently and achieve better compression ratios. Future work may include exploring the
extent of the performance gain (which we believe will be substantial) due to optimizing for
these parameters.
We believe that the improved compression ratios achieved by iDoComp are due largely
to the post-processing step of the algorithm, which modifies the set of instructions in a way
that is beneficial to the entropy encoder. In other words, we modify the elements contained
in the sets so as to facilitate their compression by the arithmetic encoder.
Moreover, the proposed scheme is universal in the sense that it works regardless of the
alphabet used by the FASTA files containing the genomic data. This is also the case with
GDC and GReEn, but not with previous algorithms like GRS or RLZ-opt which only work
with A, C, G, T and N as the alphabet.
It is also worth mentioning that the reconstructed files of both iDoComp and GDC are
exactly the original files, whereas the reconstructed files under GReEn do not include the
header and the sequence is expressed in a single line (instead of several lines).
Another advantage of the proposed algorithm is that the scheme employed for compression is very intuitive, in the sense that the compression consists mainly of generating
instructions composed of the sequence of matches M and the two sets S and I that suf-
fice to reconstruct the target genome given the reference genome. This information by
itself can be beneficial for researchers and gives insight into how two genomes are related
to each other. Moreover, the list of SNPs generated by our algorithm could be compared
5
See the Post-Processing step in Section 2 for more details.
CHAPTER 2. COMPRESSION OF ASSEMBLED GENOMES
31
with available datasets of known SNPs. For example, the NCBI dbSNP database contains
known SNPs of the H. sapiens species.
Finally, regarding iDoComp, note that we have not included in Table 2.2 the time
needed to generate the suffix array of the reference, only that needed to load it into memory,
which is already included in the total compression time. The reason is that we devise these
algorithms based on pair-wise compression as the perfect tool for compressing several individuals of the same species. In this scenario, one can always use the same reference for
compression, and thus the suffix array can be reused as many times as the number of new
genomes that need to be compressed.
Regarding compression of sets, any pair-wise compression algorithm can be trivially
used to compress a set of genomes. One has merely to choose a reference and then compress each genome in the set against the chosen reference. However, as was shown in [42]
with their GDC-ultra version of the algorithm, and as can be expected, an intelligent selection of the references can lead to significant boosts in the compression ratios. Therefore,
in order to obtain high compression ratios in sets it is of ultimate importance to provide
the pair-wise compression algorithms with a good reference-finding method. This could
be thought of as an add-on that could be applied to any pair-wise compression algorithm.
For example, one could first analyze the genomes contained in the set to detect similarity
among genomes, and then use this information to boost the final compression6 . The latter
is a different problem that needs to be addressed separately.
2.5
Conclusion
We have introduced iDoComp, an algorithm for compression of assembled genomes. Specifically, the algorithm considers pair-wise compression, i.e., compression of a target genome
given that a reference genome is available both for the compression and the decompression.
This algorithm is universal in the sense of being applicable for any dataset, from simple
bacteria to more complex organisms such as humans, and accepts genomic sequences of
an extended alphabet. We show that the proposed algorithm achieves better compression
than the previously proposed algorithms in the literature, in most cases. These gains are up
6
A similar approach is used in [48]
CHAPTER 2. COMPRESSION OF ASSEMBLED GENOMES
32
to 92% in medium size DNA and up to 91% in humans when compared with GReEn and
GRS. When compared with GDC, the gains are up to 73% and 63% in medium size DNA
and humans, respectively.
[MB]
2.7
5.1
12.4
102.3
121.2
378.5
120.7
3,100
3,100
3,100
3,100
3,100
3,100
L. pneumohilia
E. coli
S. cerevisiae
C. elegans
A. thaliana
Oryza sativa
D. melanogaster
Homo Sapiens 1
Homo Sapiens 2
Homo Sapiens 3
Homo Sapiens 4
Homo Sapiens 5
Homo Sapiens 6
Size [KB]
0.122ú
0.119
5.65ú
170
6.6
125.5
390.6
11, 200
18, 000
10, 300
6, 500
35, 500
10, 560
C. time
1ú
1
5ú
45
54
140
51
1687
652
434
352
1761
1686
D. time
1ú
0.6
5ú
47
56
146
2
1701
721
495
372
1846
1775
GReEn/GRSú
Size [KB]
0.229†
0.242†
6.47†
48.7
6.32
128.6†
433.7
2, 770
11, 973
6, 840
6, 426
11, 873†
6, 939
C. time
0.3
0.6
1
1
21
80
23
2636
511
141
191
249
2348
GDC
D. time
0.03
0.06
1
1
2
6
1
67
78
65
70
62
70
Size [KB]
0.084
0.086
2.53
13.3
2.09
105.4
364.4
1, 025
7, 247
6, 290
5, 767
11, 560
5, 241
C. time
0.1
0.2
0.4
3
4
11
4
95
120
118
115
122
100
iDoComp
D. time
0.1
0.2
0.4
4
5
15
4
130
126
125
130
130
120
GR.
31%
30%
55%
92%
68%
16%
7%
91%
60%
39%
11%
67%
50%
GDC
63%
64%
61%
73%
67%
18%
16%
63%
39%
8%
10%
3%
24%
Gain
Table 2.2: Compression results for the pairwise compression. C. time and D. time stand for compression and decompression
time [seconds], respectively. The results in bold correspond to the best compression performance among the different
algorithms. We use the International System of Units for the prefixes, that is, 1MB and 1KB stands for 106 and 103 Bytes,
respectively.
ú
denotes the cases where GRS outperformed GReEn. In these cases, i.e., L. pneumohilia and S. cerevisiae, the compression
achieved by GReEn is 495KB and 304.2KB, respectively.
†
denotes the cases where GDC-advanced outperformed GDC-normal.
Raw Size
Species
CHAPTER 2. COMPRESSION OF ASSEMBLED GENOMES
33
Chapter 3
Compression of aligned reads
As outlined in the introduction, NGS data can be classified into two main categories: i)
Raw NGS data –which is usually stored as a FASTQ file and contains the raw output of
the sequencing machine; and ii) aligned raw NGS data –which besides the raw data it
additionally contains its alignment to a reference. The latter is usually stored as a SAM file
[1].
In this chapter we focus on the aligned (reference-based) scenario, since this data is
both the largest (up to terabytes) and the one generally used in downstream applications.
Moreover, we believe that the size of these files conveys the major bottleneck for the transmission and handling of NGS data among researchers and institutions and, therefore, good
compression algorithms are of primal need.
We demonstrate the benefit of data modeling for compressing aligned reads. Specifically, we show that, by working with data models designed for the aligned data, we can
improve considerably over the best compression ratio achieved by previously proposed algorithms. The results obtained by the proposed method indicate that the pareto-optimal
barrier for compression rate and speed claimed by [17] does not apply for high coverage
aligned data. Furthermore, our improved compression ratio is achieved by splitting the
data in a manner conducive to operations in the compressed domain by downstream applications.
34
CHAPTER 3. COMPRESSION OF ALIGNED READS
3.1
35
Survey of lossless compressors for aligned data
The first compression algorithm proposed for SAM files was BAM [1]. BAM was released
at the same time as its uncompressed counterpart SAM. However, more than a specialized compression algorithm, it is a binarization of the SAM file that uses general-purpose
compression schemes as its compression engine.
Concerned by the exponentially increasing size of the BAM files, in [50] they proposed
CRAM, a new compression scheme for aligned data. CRAM is presented as a toolkit, in the
sense that it combines a reference-based compression of sequence data with a data format
that is directly available for computational use in downstream applications.
In 2013 Goby was presented as an alternative to CRAM [51]. Goby is also presented as
a toolkit that performs reference-based compression with a data format that allows downstream applications to work over the goby files. The authors also provide software to perform some common NGS tasks as differential expression analysis or DNA methylation
analysis over the goby-compressed files.
Alternatively, more compression-oriented proposals have been published recently. These
works do not focus on creating a compression scheme that aids downstream applications to
work over the compressed data, but on maximizing the compression ratio.
In 2012, Quip, a lightweight dependency-free compressor of high-throughput sequencing data was proposed [52]. The main strength of Quip is that it accepts both non-aligned
and aligned data, in contrast to both CRAM and GOBY1 . In the case of non-aligned data,
if a reference is available, it performs its own lightweight and fast assembly and then compresses the result. Since here we focus on aligned data, in the rest of the chapter we consider
Quip exclusively on its SAM file compressor mode.
Finally, in [17] they proposed SamComp, which is, to our knowledge, the best compression algorithm for aligned data. Two versions of the algorithm were proposed, namely,
SamComp1 and SamComp2. Throughout the chapter we consider SamComp1, as it provides better compression results and it is the one recommended by the authors [17]. SamComp is more restrictive than Quip in the sense that it only accepts SAM and/or BAM files
as input files. However, the same authors also proposed other methods that accept FASTQ
1
It is in fact the only algorithm appearing in all the categories when comparing compression algorithms
by data type in the review paper of Bonfield et al. (2013) [17]
CHAPTER 3. COMPRESSION OF ALIGNED READS
36
files and a reference (optional) as input. If a reference is available they perform their own
fast alignment before compression [17]. Note that the compression methods proposed in
[17] do not compress the SAM file itself, but only the necessary information to be able to
recover the corresponding FASTQ file.
The compression of NGS aligned raw data can be divided in several sub-problems of
different nature; namely, the compression of the reads, the compression of the header and/or
the identifiers, the compression of the quality scores, and the compression of the fields related to the alignment. Although all these sub-problems are addressed by the above algorithms2 , in this chapter we focus exclusively on compressing the necessary information to
reconstruct the reads. The reason being that they carry most – if not all – of the information used by downstream applications and thus their compression is of primal importance.
Moreover, as we will see in the following chapters, the compression of quality scores can
be drastically decreased by the use of lossy compression methods without compromising
the performance of the downstream applications (see [20, 21]), while the identifiers and the
headers are generally ignored.
In the context of compressing aligned reads, all the aforementioned algorithms, except
SamComp, perform similarly in terms of compression ratio. Moreover, each of them outperforms the others in at least one data set (see the results tables of [17, 52, 51, 50]). On
the other hand, SamComp clearly outperforms the rest of the algorithms, as shown in [17].
It is important to mention that the comparison is not strictly fair as SamComp is a specialized SAM compressor whereas the other algorithms are oriented to provide a stable toolkit,
which sometimes includes random access capabilities. However, SamComp shows that judicious design of models for the data yields a significant improvement in compression ratio
and thus similar compression techniques should be applied in toolkit oriented programs
such as Goby or CRAM.
Following the approach of SamComp, the main contribution of this chapter is not to
provide a complete compression scheme for aligned data, as done in CRAM or Goby, but
to demonstrate the importance of data modeling when compressing aligned reads. That is,
the purpose of the proposed method is to give insight into the potential gains that can be
2
To the exception of [17] where they do not reconstruct a SAM file but a FASTQ file.
CHAPTER 3. COMPRESSION OF ALIGNED READS
37
achieved by an improved data model. We believe that the techniques described in this chapter can be applied in the future to more complete compression toolkits. We show that, by
constructing effective models adapted to the particularities of the genomic data, significant
improvements in compression ratio are possible. We will show that these improvements
become considerable in high coverage data sets.
Next, we introduce the lossless data compression problem, and examine the approach
(and assumptions) made by the previously proposed algorithms when modeling the data.
3.1.1
The lossless data compression problem
Information theory states that given a sequence s = [s1 , s2 , . . . , sn ] drawn from a probability distribution PS (s), the minimum number of bits required on average to represent the
sequences s is essentially (ignoring a negligible additive constant term) given by H(S) =
E[≠ log(PS (S))] bits3 , where H(·) is the Shannon entropy, which only depends on the
probability distribution PS .
Thus, the lossless compression problem comprises two sub-problems: i) data modeling,
that is, selection of the probability distribution (or model) PS of the sequence, and ii) code
word assignment, that is, given a PS that models the data, find effective means of compressing close to the “ideal” ≠ log(PS (s)) number of bits for any given sequence s, such that the
expected length of the compressed sequences can achieve the entropy (that is, the optimum
compression). Arithmetic coding provides an effective mean to sequentially assign a codeword of length close to ≠ log(PS (s)) to s. There are other codes that achieve the “ideal
code length” for particular models, such as Golomb codes for geometric distributions and
Huffman codes for D-adic distributions.
The lossless compression problem can therefore equivalently be thought of as one of
finding good models for the data to be compressed. The model for a sequence s can be
decomposed as the product of the models for each individual symbol st , where at each instant t, the model for st is computed using the sequence of previous symbols [st≠1 , . . . , s1 ]
as context. This characteristic makes it possible to generate the models sequentially, which
is particularly relevant for the aligned data compression problem, as the files can be of
3
All the logarithms used in this chapter are base-2.
CHAPTER 3. COMPRESSION OF ALIGNED READS
38
remarkable size, and more than one pass over the data could be prohibitive. Note that
the memory needed to store all the different contexts grows exponentially with the context length and becomes prohibitively large very quickly. To solve this issue, the simple
approach is to constrain the context lengths to at most m symbols, while more advanced
techniques rely on variable context lengths [53].
In general, a context within a sequence may refer to entities more general than substrings preceding a given symbol. In this work we show how the symbols appearing in
the aligned file can be estimated using previously compressed symbols. This modeling of
the data ultimately leads to considerable improvements over the state of the art algorithms
proposed so far in the literature.
3.1.2
Data modeling in aligned data compressors
Following a similar classification as in [51], we show different approaches for data modeling in the context of aligned data, and then introduce the ones employed by the above
mentioned algorithms. Throughout the chapter we assume the SAM files are ordered by
position, as this is also the assumption made by Goby, CRAM and SamComp; moreover,
as shown in [17], better compression ratios can be achieved.
The most general approach on data modeling is performed by general compression
algorithms, which perform a serialization of the whole file and compute a byte-based model
over the serialized file. However, since the SAM file is divided in different fields, the first
intuitive approach is to treat each of the fields separately, as each of them is expected to be
ruled by a different model. A further improvement can be done by considering cross-field
correlation. In this case, the value of some fields can be modeled (or estimated) using the
value of other fields, thus achieving better compression ratios. Finally, as shown in Ref.
[51], other modeling techniques could be used, as for example, generating a template for a
field and then compressing the difference between the actual value and the template.
In this context, CRAM encodes each field separately and extensively uses Golomb coding. By using these codes CRAM implicitly models the data as independent and identically
distributed by a Geometric distribution. The main reason for using Golomb codes is their
simplicity and speed. However, we believe that the computational overhead carried by the
CHAPTER 3. COMPRESSION OF ALIGNED READS
39
use of arithmetic codes is negligible, whereas the compression penalty paid by assuming
geometrically distributed fields may be more significant.
Goby offers several compression modes. In the fastest one, general compression methods are used over the serialized data, yielding the worst compression ratios. In the slower
modes a compression technique denoted by the authors as Arithmetic Coding and Template
(ACT) is used. In this case all the fields are converted to a list of integers and each list is
encoded independently using arithmetic codes. To our knowledge, no context is used when
compressing these lists. Moreover, we believe that the conversion to integer list, although
it makes the algorithm more scalable and schema-independent, it damages the compression
as it does not exploit possible models or correlations between neighboring integers. In the
slowest mode, some inter-list modeling is performed over the integer lists to aid compression. The improvement upon CRAM is that no assumptions are made about the distribution
of the fields. Instead, the code learns the distribution of the integers within each list as it
compresses.
As a more data specific compressor, Quip uses a different arithmetic coding over each
field. However, they do not use any context when compressing the data (thus, implicitly
assuming independent and identically distributed data) and they treat each field independently. SamComp, on the other hand, performs an extensive use of contexts; thus, they
do not assume the independence of the data. The results of [17] show that the use of contexts aids compression significantly. The aforementioned proposals –with the exception of
SamComp– lack accurate data models for the reads and make extensive use of generic compression methods (e.g., byte-oriented compressors), relying on models that do not assume
or exploit the possible dependencies across different fields.
In the present chapter, we show the importance of data modeling when compressing
aligned reads. We show that, by generating data models that are particularly designed for
the aligned data, we can improve the compression ratios achieved by the previous algorithms. Moreover, we show that the pareto-optimal barrier for compression rate and speed
claimed in Ref. [17] does not hold in general. This fact becomes more notable when
compressing aligned reads from high coverage data sets.
Finally, as the proposed scheme is envisaged as being part of a more general toolkit, it is
important to aim for a compression scheme that aids the future utilization of the compressed
CHAPTER 3. COMPRESSION OF ALIGNED READS
40
data in downstream applications, as proposed in [50] and [51]. Thus, relevant information,
such as the number of Single Nucleotide Polymorphisms (SNPs) and their position within
the read, should be easily accessible from the compressed data. Quip and SamComp seem
to lack this important feature as they perform, as part of the read compression, a base-pair
by base-pair model and compression. The main drawback of this approach is that in order
to find variations of a specific read and the positions of the reference where the variations
occur, one must first reconstruct the entire read to then extract the information. This can be
computationally intensive for the downstream applications. In this chapter, we show that
the data can be compressed in a way that could enable downstream applications to rapidly
extract relevant information from the compressed files, not only without compromising the
compression ratio, but actually improving it.
3.2
Proposed Compression Scheme
As one of the aims of the proposed compression scheme is to facilitate downstream applications to work over the compressed data, we decompose the aligned information regarding
each read into different lists, and then compress each of these lists using specific models.
All operations –splitting the read into the different lists, computing the corresponding models and compressing the values in the lists– are done read by read, thus performing only
one pass over the data and generating the compressed file as we read the data. Next we
show the different lists in which the read information is split and introduce the models used
to compress each of them.
For each read in the SAM file, we generate the following lists:
• List F: We store a single bit that indicates the strand of the read.
• List P: We store an integer that indicates the position where the read maps in the
reference.
• List M: We store a single bit that indicates if the read maps perfectly to the reference
or not. We extract this information from both the CIGAR field and the MD auxiliary
CHAPTER 3. COMPRESSION OF ALIGNED READS
41
field4 . If the latter is not available, the corresponding mismatch information can be
directly extracted by loading the reference and comparing the read to it.
• List S: In case of a non-perfect match, if no indels (insertions and/or deletions) occur, we store the number of SNPs. We store a 0 otherwise.
• List I: If at least one indel occurs we store the number of insertions, deletions, and
substitutions (in the following we denote them as variations) occurring within the
read.
• List V: For each of the variations (note that each read may have multiple variations),
we store an integer that indicates the position where they occur within the read.
• List C: Finally, for each insertion and substitution, we store the corresponding new
base pair, together with the one of the reference in case of a substitution.
The information contained in the above lists suffices to reconstruct the reads, assuming
the reference is available for decompression.
Note that the amount of information we store per read depends on the quality of the
mapping. For example,
- If a read maps perfectly to the reference we store just a 1 in M.
- If it only has SNPs, we store a 0 in M, the number of SNPs in S and their positions
and base pairs in V and C, respectively.
- If it does have insertions and/or deletions, we store a 0 in M, a 0 in S, the number
of insertions, deletions and SNPs in I, and their respective positions and base pairs
4
in V and C, respectively.
The CIGAR is a string that contains information regarding the mismatches between the read and the
region of the reference where it maps to. The MD field is a string used to indicate the different SNPs that
occur in the read.
CHAPTER 3. COMPRESSION OF ALIGNED READS
42
It can be verified that this transformation of the data is information lossless and thus no
compression capabilities are lost.
Next we describe the model used for each of these lists that will be fed to the arithmetic
encoder. Recall that every list is compressed in a sequential manner, that is, at every step
we compress the next symbol using the model computed from the previous symbols. The
computation of the model for a symbol that hasn’t appeared yet is the well known zero
frequency problem. We use different solutions to this problem for each of the lists.
1) Modeling of F
We use a binary model with no context, as empirical calculations suggest that the directions of the strands are almost independent and identically distributed.
2) Modeling of P
The main challenge in compression of this list is that the alphabet of the data is unknown, very large and almost uniformly distributed. To address this challenge, since the
positions are ordered, we first transform the original positions into gaps between consecutive positions (delta encoding). This transformation considerably reduces the size of the
alphabet and reshapes the distribution, strongly biasing it towards low numbers. This technique is also used in the previously proposed algorithms.
Regarding the unknown alphabet problem, the only information that is available beforehand is that the size of the alphabet is upper bounded by the length of the largest chromosome, which for example for humans is 3 · 106 . Therefore, a uniform initialization of
the model to avoid the zero probability problem is prohibitive. Several solutions have been
proposed to address this problem, as the use of Golomb Codes –which models the data
with a Geometric distribution– or the use of byte-oriented arithmetic encoders which uses
an independent model for each of the 4 bytes that forms the integer. However, since the
distribution is not truly geometrical and a byte-oriented arithmetic encoder has a relevant
overhead, we propose a different approach.
We use a modified order-0 Prediction by Partial Match (PPM)[53] scheme to model
the data. At the beginning, the alphabet of the model is uniquely composed by an escape
symbol e. When a new symbol appears, the encoder looks into the model and if it cannot
CHAPTER 3. COMPRESSION OF ALIGNED READS
43
find the symbol, it emits an escape symbol. Then the new symbol is stored in a separate
list, and the model is updated to contain the new symbol. In this way, the encoder works
nicely in both low coverage data sets, where lots of new symbols are expected; and in high
coverage data sets, where the alphabet is reduced dramatically.
We also use this list to indicate changes of chromosomes and end of files.
3) Modeling of M
To create the model we use the mapping position of the previous read (stored in P) and
its match value (stored in M) as context. The rationale behind this approach is that we
expect each region of the reference to be covered by several reads (specially for high coverage data), and thus if the previous read mapped perfectly, it is likely that the next read will
also map perfectly if they start at close-by positions. The same intuition holds for the case
of reads that do not map perfectly. Finally, the matches are compressed using an arithmetic
encoder over a binary alphabet.
4) Compression of S
To model this list, we use the previously processed number of SNPs seen per read to
continuously update the model. However, due to the non-stationarity of the data, we update
the statistics often enough such that the local behavior of the data is reflected in the model.
5) Modeling of I
The modeling and compression of this list is done analogously to the previous list.
6) Modeling of V
The positions of the variations is the less compressible list together with the mapping
position list (P), mainly due to the large number of elements it contains. In order to model
it we use several contexts. First, we generate a global vector that indicates, for each position
of the genome, if a variation has previously occurred. Since we know in which region of the
reference the read maps, we can use the information stored in the aforementioned vector
to know which variations have previously occurred in that region, and use it as context
to estimate the position of the next variation. We also use the position of the previous
CHAPTER 3. COMPRESSION OF ALIGNED READS
44
variation within the read (in case of no previous variations a 0 is used), and the direction of
the strand of the read (stored in F), as contexts. Finally, note that if the read is of length L
and the previous variation has occurred at position L ≠ 10, the current variation must occur
between L ≠ 9 and L. Therefore, the model updates its probabilities accordingly to assign
a non-zero value only to these numbers.
The purpose of this model is to use the information of the previous variations to estimate where the current variations are going to occur. This works especially well for high
coverage data, as one can expect several reads mapping to the same region of the reference,
and thus having similar –if not the same–, variations.
7) Compression of C
Finally, for the compression of the base-pairs, we distinguish between insertions and
SNPs. Moreover, within the SNPs, we use a different model depending on the base-pair of
the reference that is to be modified. That is, given a SNP, we use the model associated with
the base-pair in the reference. This choice of the model comes from the observation that
the probability of having a SNP between non-conjugated base-pairs is higher than between
conjugated base-pairs.
The models described above for the different lists use previously seen information (not
restricted to the same list) to estimate the next values. These models rely on the assumption
that the reads contain redundant information, a fact particularly apparent in high coverage
data sets. We show in the simulation results that the compression ratio improvements with
respect to the previously proposed algorithms increase with the coverage of the data, as one
may have expected. This is an advantage of the proposed algorithm in light of the continued
drop in the sequencing cost and the increased throughput of the NGS technologies which
will boost the generation of high coverage data sets.
3.2.1
Data
In order to assess the performance of the proposed algorithm and compare it with the previously proposed ones, we consider the raw sequencing data shown in Table 3.1. To generate
CHAPTER 3. COMPRESSION OF ALIGNED READS
45
Low Coverage Data Sets
Name
Reference
SRR062634 1
SRR027520 1
SRR027520 2
SRR032209
SRR043366 2
SRR013951 2
SRR005729 1
Name
SRR065390
SRR065390
ERR262997
ERR262996
Mapped Read Coverage
Readsa Length
hg19
23.3 M
100
0.26◊
hg19
20.4 M
76
0.17◊
hg19
20.4 M
76
0.17◊
mm GRCm38 13.5 M
36
0.17◊
hg19
13.5 M
76
0.11◊
hg19
9.5 M
76
0.08◊
hg19
9.4 M
76
0.08◊
High Coverage Data Sets
Reference
1
2
2
1
ce ws235
ce ws235
hg19
hg19
Mapped Read Coverage
Readsa Length
31.6 M
100
32◊
31.2 M
100
32◊
410.5 M
101
14◊
272.8 M
101
10◊
Size
[GB]
6.9
5.1
5.1
2.7
3.4
2.4
2.5
Size
[GB]
9.5
9.4
122
81
Table 3.1: Data sets used for the assessment of the proposed algorithm.
The alignment program used to generate the SAM files is Bowtie2. The data sets are
divided in two ensembles, low coverage data sets and high coverage data sets. The size
corresponds to the SAM file.
a
M stands for millions.
the aligned data (SAM files), we used the alignment program Bowtie25 [54]. For each of
the data sets we specify the reference used to perform the alignment (which belongs to the
same species as the data set under consideration), the number of reads that mapped to the
reference after the alignment, the read length, the coverage, and the size of the SAM file.
Recall that the SAM files contain only the reads that mapped to the reference. The references hg19, mm GRCm38 and ce ws235 belong to the H. Sapiens, the M. Musculus and the
C. Elegans species, respectively.
As shown in the table, we have divided the data sets into two ensembles, the low coverage data ensemble and the high coverage data ensemble.
5
Note that any other alignment program could have been used for this purpose. We chose Bowtie2 because
it was the one employed by [17] to perform the alignments.
CHAPTER 3. COMPRESSION OF ALIGNED READS
46
The low coverage data ensemble is formed by human (H. Sapiens) and mouse (M.
Musculus) sequencing data of coverage less than 1◊. We chose these human data sets
because they were used in the previously proposed algorithms to assess their performances.
Although these data sets are easy to handle and fast to compress due to their small size (of
the order of a few GB), we believe that they do not represent the majority of the sequencing
data used by researchers and institutions. Furthermore, in order to analyze the performance
of the different compression algorithms, it is important to consider data sets of different
characteristics. With this in mind, in this chapter we also consider data sets of coverage up
to 32◊, which are presented next.
The high coverage data ensemble is composed by the sequencing data of two H. Sapiens
and two C. Elegans (ERR and SRR, respectively). The data sets of the C. Elegans were
also used in previous publications. However, note that the total size of these files is still on
the order of few GB, as the size of the C. Elegans genome is significantly smaller than that
of the H. Sapiens.
All the data sets have been retrieved from the European Nucleotide Archive6 .
3.2.2
Machine specifications
The machine used to perform the experiments has the following specifications: 39 GB
RAM, Intel Core i7-930 CPU at 2.80GHz ◊ 8 and Ubuntu 12.04 LTS.
3.3
Results
Next we show the performance of the proposed compression method when applied to the
data sets shown in Table 3.1, and compare it with the previously proposed algorithms. As
mentioned above, we divide the results into two categories, namely, the low coverage data
sets and the high coverage ones. In the following, we express the gain in compression ratio
of an algorithm A with respect to another algorithm B as gain = 1 ≠ size(A)/size(B).
For example, a reduction from 100MB to 80MB represents a 20% gain (improvement).
Note that with this metric a 0% means the file size remains the same, a 100% improvement
6
http://www.ebi.ac.uk/ena/
CHAPTER 3. COMPRESSION OF ALIGNED READS
47
is not possible, as this will mean the new file is of size 0, and a negative value means that
the new file is of bigger size.
3.3.1
Low coverage data sets
We start by considering the low coverage data sets introduced in Table 3.1 to assess the
performance of the proposed method. Figure 3.1 shows the compression ratio, in bits
per base pair, of the different algorithms proposed in the literature. For completeness,
we also show the performance of Fastqz when it uses a reference [17]. However, note
that the comparison is not strictly fair, as Fastqz performs its own fast alignment before
compression, instead of using the alignment information provided in the SAM file. The
data sets shown in the figure are ordered –from left to right– starting from the data set
with the lowest coverage (SRR005729 1 with a coverage of 0.08◊) and finishing with the
largest (SRR0062634 1 with a coverage of 0.26◊).
Note that the results shown in the figure refer to the compression of the reads. For
Fastqz, Quip and SamComp this value can be computed exactly, whereas for Goby and
CRAM only an approximate value can be computed.
The results show that the compression ratio of Goby and Quip is similar, each of them
outperforming the other in at least one data set, but worse than that of CRAM. SamComp,
on the other hand, outperfoms the other algorithms in the cases considered. Moreover,
as shown in the figure, the proposed method outperforms all the previously proposed
algorithms in all the data sets. The gain with respect to SamComp varies from 0.3%
(SRR013951 2), to 8.6% (SRR013951 2). These gains become higher when the performance is compared with that of the other algorithms. It is important to remark that these
results show the compression ratio (shown in the figure as bits/bp) of the aligned reads, and
it does not include the quality values and/or the identifiers.
Finally, we observe a somewhat expected relation between the compression ratio and
the coverage of each data set (see Table 3.1): the higher the coverage, the more compressible the reads.
Regarding the running time, SamComp offers the best compression times, ranging from
45 to 160 seconds, followed by the proposed method, which takes between 124 and 163
CHAPTER 3. COMPRESSION OF ALIGNED READS
48
1.6
1.4
bits/bp
1.2
1
Prop osed Method
SamComp
CRAM
Goby
Quip
Fastqz
0.8
0.6
0.4
0.2
0
005729_1 013951_2 043366_2 027520_1 027520_2
SRR Data Set
032209
062634_1
Figure 3.1: Performance of the proposed method and the previously proposed algorithms
when compressing aligned reads of low coverage data sets.
CHAPTER 3. COMPRESSION OF ALIGNED READS
49
seconds. Goby, CRAM and Quip need more time for compression. However, note that
they are compressing the whole SAM file. The slowest is Fastqz, mainly because it is
performing its own alignment. Regarding the decompression time, all the algorithms except
Fastqz employ similar times, ranging from 70 to 400 seconds.
Note that since all the algorithms are sequential (that is, they compress and decompress
read by read), the memory consumption of all of them is independent of the number of
reads to compress and/or decompress. For example, for the largest human dataset of the low
coverage ensemble, SRR062634 1, the algorithm with the lowest memory consumption on
the compression is the proposed method with 0.3 GB, followed by SamComp with 0.5 GB.
Fastqz and Quip use around 1.3 GB, Goby 3.2 GB, and CRAM 4 GB. On the other hand,
for decompression, SamComp is the one with the lowest memory consumption (300 MB),
followed by the proposed method with 500MB. Fastqz uses 700MB, Quip 1.5 GB, and
Goby 4 GB7 . Note that this comparison is not fully fair as the other algorithms compress
and decompress the whole SAM file, while the proposed algorithm focuses on the reads.
3.3.2
High coverage data sets
For this ensemble we set Fastqz, Quip, Goby and CRAM aside, as they are outperformed by
SamComp and the proposed method in terms of compression ratio, and focus exclusively
on the latter two.
The performance of SamComp and the proposed method for the high coverage data sets
introduced in Table 3.1 is summarized in Table 3.2. Similarly as for the low coverage data
sets, the proposed method outperforms SamComp in the cases considered. Moreover, we
observe that the gain in compression ratio with respect to SamComp increases as the coverage of the data sets increase. For example, for the 10◊ coverage data (ERR262996 1) we
obtain a compression gain of 10%, whereas for the 32◊ coverage data set (SRR065390 2),
this gain is boosted up to 17%. Note that this gain translates into savings in the MB needed
to store the files.
Regarding the running time, we observe that both algorithms employ similar time for
compression, which is around 2 minutes for the C. Elegans data sets, and between 20
7
We do not specify the memory consumption of CRAM during the decompression because we were
unable to run it.
CHAPTER 3. COMPRESSION OF ALIGNED READS
50
and 30 minutes for the H. Sapiens data sets. The decompression times are similar to the
compression times.
3.4
Discussion
Inspection of the empirical results of the previous section shows the superior performance
of the proposed scheme across a wide range of data sets, from very low coverage to high
coverage. Next we discuss these results in more detail.
3.4.1
Low Coverage Data Sets
As shown in Figure 3.1, SamComp and the proposed method clearly outperform the previously proposed algorithms in all cases. However, we should emphasize that while the
aim of SamComp and the proposed method is to achieve the maximum possible compression, the aim of CRAM, Goby and Quip is to provide a more general and robust platform
(or toolkit) for compression. Moreover, as mentioned before, the compression scheme of
Goby and CRAM facilitates the manipulation of the compressed data by the downstream
applications.
In this context, we believe that the compression method performed by SamComp and
Quip does not allow downstream applications to rapidly extract important information, as
they perform a base-by-base compression. Thus, in order to find variations in the data, one
must first reconstruct the whole read to then find the variations. On the other hand, we proposed a compression method that can considerably facilitate the downstream applications
when working in the compressed domain.
Regarding the compression ratio, in [17] they mentioned that they believed that a
pareto-optimal region was achieved in terms of compression ratio after the SeqSqueeze
(http://www.sequencesqueeze.org) competition of 2012, in which SamComp was the winner in the SAM file compression category. As shown in Figure 3.1, although our algorithm
performs strictly better than SamComp in all the data sets of this ensemble, the gain is very
small. This fact could validate the pareto-optimal statement made in [17]. However, as
we discuss in the next section, for high coverage data sets the pareto-optimal curve is not
CHAPTER 3. COMPRESSION OF ALIGNED READS
51
achieved, as significant improvements in compression ratio are possible. We believe the
reason is that, in low coverage data sets, little information can be inferred from previously
seen reads when modeling the subsequent reads, as the overlap between reads is in general
small or not existent. This, however, is far from the case in high coverage data sets.
3.4.2
High Coverage Data Sets
As outlined before, the proposed method demonstrates that significant improvements in
compression ratio are possible in high coverage data sets. Specifically, we showed in Table
3.2 performance gains varying from 10% to 17% with respect to SamComp, which achieves
the best compression ratio among the previously proposed algorithms.
These improvements in compression ratio are significant as, in the high coverage scenario, small gains in compression ratio can translate to huge savings in storage space. For
example, a 10% improvement (e.g., from 0.18 bits/bp to 0.16 bits/bp) over human data with
very large coverage (around 200◊) corresponds to a saving of approximately 12 GB per
file, with a corresponding reduction in the time required to transfer the data. Thus, in the
compression of several large data sets of high coverage, such improvements would lead to
several PetaBytes of storage savings.
Furthermore, as mentioned before, the proposed scheme generates compressed files
from which important information can be easily extracted, potentially allowing downstream
applications to work over the compressed data. This is an important feature, as with the
increase in the size of the sequencing data, the burden of compressing and decompressing
the files in order to manipulate and/or analyze specific parts in them becomes increasingly
acute.
3.5
Conclusion
We are currently in the $1000 genome era and, as such, a significant increase in the sequencing data being generated is expected in the near future. These files are also expected to grow
in size as the different Next Generation Sequencing (NGS) technologies improve. Therefore, there is a growing need for honing the capabilities of compressors for the aligned data.
CHAPTER 3. COMPRESSION OF ALIGNED READS
52
Further, compression that facilitates downstream applications working in the compressed
domain is becoming of primal importance.
With this in mind, we developed a new compression method for aligned reads that outperforms, in compression ratio, the previously-proposed algorithms. Specifically, we show
that applying effective models to the aligned data can boost the compression, especially
when high coverage data sets are considered. These gains in compression ratio would
translate to huge savings in storage, thus also facilitating the transmission of genomic data
across researchers and institutions. Furthermore, we compress the data in a manner that allows downstream applications to work on the compressed domain, as relevant information
can easily be extracted from specific locations along the compressed files.
Finally, we envisage the methods shown in this chapter to be useful in the construction
of future compression programs that consider the compression not only of the aligned reads,
but also the quality scores and the identifiers.
27 500
41 460
3 160
3 110
608.6
935
47.7
55.2
0.18
0.18
0.12
0.14
bits/bp
[MB]
Size [MB]
SamComp
Raw Sizea
1, 210
1, 781
148
140
C.T. [sec]
548.7
827.3
40.5
45.9
Size [MB]
0.16
0.16
0.10
0.12
bits/bp
1, 092
1, 585
104
108
C.T. [sec]
Proposed Method
10%
11.5%
15%
17%
Gain
Table 3.2: Compression results for the high coverage ensemble. The results in bold show the compression gain obtained
by the proposed method with respect to SamComp.
We use the International System of Units for the prefixes, that is, 1 MB stands for 106 Bytes.
C.T. stands for compression time.
a
Raw size refers solely to the size of the mapped reads (1 Byte per base pair).
10◊
14◊
32◊
32◊
ERR262996
ERR262997
SRR065390
SRR065390
1
2
1
2
Coverage
Data Set
CHAPTER 3. COMPRESSION OF ALIGNED READS
53
Chapter 4
Lossy compression of quality scores
In this chapter we focus on compression of the quality scores presented in the raw sequencing data (i.e., FASTQ and SAM files). As already mentioned in the introduction, quality
scores have proven to be more difficult to compress than the reads. There is also evidence
that quality scores are corrupted by some amount of noise introduced during sequencing
[17]. These features are well explained by imperfections in the base-calling algorithms
which estimate the probability that the corresponding nucleotide in the read is in error
[55]. Further, applications that operate on reads often make use of the quality scores in a
heuristic manner. This is particularly true for sequence alignment algorithms [11, 12] and
variant calling [56, 57]. Based on these observations, lossy (as opposed to lossless) compression of quality scores emerges as a natural candidate for significantly reducing storage
requirements while maintaining adequate performance of downstream applications.
In this chapter we introduce two different methods for lossy compression of the quality scores. The first method, QualComp, transforms the quality scores into Gaussian distributed values and then uses theory from rate distortion to allocate the available bits. The
second method, QVZ, assumes the quality scores are generated by a Markov model, and
generates optimal quantizers based on the empirical distribution of the quality scores to be
compressed. We describe both methods in detail, and analyze their performance in terms
of rate-distortion.1
1
We refer the reader to Chapter 5 for an extensive analysis on the effect that lossy compression of the
quality scores has on variant calling.
54
CHAPTER 4. LOSSY COMPRESSION OF QUALITY SCORES
4.1
55
Survey of lossy compressors for quality scores
Lossy compression for quality scores has recently started to be explored. Slimgene [58] fits
fixed-order Markov encodings for the differences between adjacent quality scores and compresses the prediction using a Huffman code (ignoring whether or not there are prediction
errors). Q-Scores Archiver [59] quantizes quality scores via several steps of transformations, and then compresses the lossy data using an entropy encoder.
Fastqz [17] uses a fixed-length code which represents quality scores above 30 using a
specific byte pattern and quantizes all lower quality scores to 2. Scalce [60] first calculates
the frequencies of different quality scores in a subset of the reads of a FASTQ file. Then the
quality scores that achieve local maxima in frequency are determined. Anytime these local
maximum values appear in the FASTQ file, the neighboring values are shifted to within a
small offset of the local maximum, thereby reducing the variance in quality scores. The
result is compressed using an arithmetic encoder.
BEETL [61] first applies the Burrows-Wheeler Transform (BWT) to reads and uses the
same transformation on the quality scores. Then, the nucleotide suffixes generated by the
BWT are scanned. Groups of suffixes that start with the same k bases while also sharing a
prefix of at least k bases are found. All of the quality scores for the group are converted to a
mean quality value, taken within the group or across all the groups. RQS/QUARTZ [23, 62]
first generates off-line a dictionary of commonly occurring k-mers throughout a populationsized read dataset of the species under consideration. It then computes the divergence of
the k-mers within each read to the dictionary, and uses that information to decide whether
to preserve or discard the corresponding quality scores.
PBlock [21] allows the user to determine a threshold for the maximum per-symbol
distortion. The first quality score in the file is chosen as the first ‘representative’. Quality
scores are then quantized symbol-by-symbol to the representative if the resulting distortion
would fall within the threshold. If the threshold is exceeded, the new quality score takes
the place of the representative and the process continues. The algorithm keeps track of
the representatives and run-lengths, which are compressed losslessly at the end. RBlock
[21] uses the same process, but the threshold instead sets the maximum allowable ratio of
any quality score to its representative as well as the maximum value of the reciprocal of
CHAPTER 4. LOSSY COMPRESSION OF QUALITY SCORES
56
this ratio. [21] also compared the performance of existing lossy compression schemes for
different distortion measures.
Finally, Illumina proposed a new binning scheme that reduces the alphabet size of the
quality scores by applying an 8 level mapping (see Table 4.1). This binning scheme has
been implemented in the state-of-the-art compression tools CRAM [50] and DSRC2 [63].
Quality Score Bins
N (no call)
2-9
10-19
20-24
25-29
30-34
35-39
Ø 40
New Quality Score
N (no call)
6
15
22
27
33
37
40
Table 4.1: Illumina’s proposed 8 level mapping.
4.2
Proposed Methods
In this section we first formalize the problem of lossy compression of quality scores, and
then describe the proposed methods QualComp and QVZ.
We consider the compression of the quality score sequences presented in the genomic
data. Each sequence consists of ASCII characters representing the scores, belonging to an
alphabet Q, for example Q = [33 : 73]. These quality score sequences are extracted from
the genomic file (e.g., FASTQ and SAM files) prior to compression. We denote the total
number of sequences by N , and assume all the sequences are of the same length n. The
quality scores sequences are then denoted by {Qi }N
i=1 , where Qi = [Qi,1 , . . . , Qi,n ].
The goal is to design an encoder-decoder pair that describes the quality score vec-
tors using only some amount of bits, while minimizing a given distortion D between the
N
original vectors {Qi }N
i=1 and the reconstructed vectors {Q̂i }i=1 . More specifically, we
consider each Qi is compressed using at most nR bits, where R denotes the rate (bits
per quality score), and that the distortion D is computed as the average distortion of
CHAPTER 4. LOSSY COMPRESSION OF QUALITY SCORES
each of the vectors, i.e., D =
57
1
N
qN
D(i). Further, we consider a distortion function
1
n
qn
d(Qi,j , Q̂i,j ). Thus we can model the encoder-
i=1
d : (Q, Q̂) æ R+ , which operates symbol by symbol (as opposed to block by block),
so that D(i) = d(Qi , Q̂i ) =
j=1
decoder pair as a rate-distortion scheme of rate R, where the encoder is described by
the mapping fn : Qi æ {1, 2, . . . , 2nR }, which represents the compressed version of
the vector Qi of length n using nR bits, and the decoder is described by the mapping
gn : {1, 2, . . . , 2nR } æ Q̂i , where Q̂i = gn (fn (Qi )) denotes the reconstructed sequence.
4.2.1
QualComp
The proposed lossy compressor QualComp aims to minimize the Mean Squared Error
(MSE) between the original quality scores and the reconstructed ones, for a given rate R
specified by the user. That is, D(i) = d(Qi , Q̂i ) =
1
n
qn
j=1
d(Qi,j , Q̂i,j ) =
1
n
qn
j=1 (Qi,j
≠
Q̂i,j )2 . The design of QualComp is guided by some results on rate distortion theory. For
a detailed description on rate distortion theory and proofs, please refer to [64]. We are
interested in the following result:
Theorem 1: For an i.i.d. Gaussian vector source X ≥ N (µX ,
diag[‡12 , . . . , ‡n2 ]
X ),
with
X
=
(i.e., independent components), the optimal allocation of nR bits that min-
imizes the MSE is given as the solution to the following optimization problem:
n
1ÿ
‡j2 2≠2flj
fl=[fl1 ,··· ,fln ] n
j=1
(4.1)
s.t.
(4.2)
min
n
ÿ
j=1
flj Æ nR,
where flj denotes the number of bits allocated to the j th component of X.
Next we describe how we use this result in the design of QualComp. In real data quality
scores take integer values in a finite alphabet Q, but for the purpose of modeling we assume
Q = R (the set of real numbers). Although the quality scores of different reads may be
correlated, we model correlations only within a read, and consider quality scores across
different reads to be independent. Thus we assume that each quality score vector Qi is
independent and identically distributed (i.i.d.) as PQ .
CHAPTER 4. LOSSY COMPRESSION OF QUALITY SCORES
58
To the best of our knowledge, there are no known statistics of the quality score vectors.
However, given a vector source with a particular covariance matrix, the multivariate Gaussian is the least compressible. Furthermore, compression/coding schemes designed on the
basis of Gaussian assumption, i.e., worst distribution for compression, will also be good for
non-Gaussian sources, as long as the mean and the covariance matrix remain unchanged
[65]. Guided by this observation, we model the quality scores as being jointly Gaussian
with the same mean and covariance matrix, i.e., PQ ≥ N (µQ ,
empirically computed from the set of vectors
scores within a read,
Q
{Qi }N
i=1 .
Q ),
where µQ and
Q
are
Due to the correlation of quality
is not in general a diagonal matrix. Thus to apply the theorem we
need to decorrelate the quality score vectors.
In order to decorrelate the quality score vectors, we first perform the singular value
decomposition (SVD) of the matrix
Q.
This allows us to express
Q
as
Q
= V SV T ,
where V is a unitary matrix that satisfies V V T = I and S is a diagonal matrix whose
diagonal entries sjj , for j œ [1 : n], are known as the singular values of
generate a new set of vectors {Q
Õ
N
i }i=1
Q.
We then
by performing the operation Q i = V (Qi ≠ µQ )
Õ
T
for all i. This transformation, due to the Gaussian nature of the quality score vectors, makes
the components of each QÕ i independent and distributed as N (0, sjj ), for j œ [1 : n], since
QÕ i ≥ N (0, S). This property allows us to use the result of Theorem 1. The number of bits
alloted per quality score vector, nR, is a user-specified parameter. Thus we can formulate
the bit allocation problem for minimizing the MSE as a convex optimization problem, and
solve it exactly. That is, given a budget of nR bits per vector, we allocate the bits by first
transforming each Qi into QÕ i , for i œ [1 : N ], and then allocating bits to the independent
components of QÕ i in order to minimize the MSE, by solving the following optimization
problem:
n
1ÿ
min
‡j2 2≠2flj
fl=[fl1 ,··· ,fln ] n
j=1
(4.3)
s.t.
(4.4)
n
ÿ
j=1
flj Æ nR,
where flj represents the number of bits allocated to the j th position of QÕ i , for i œ [1 :
N ], i.e., the allocation of bits is the same for all the quality score vectors and thus the
CHAPTER 4. LOSSY COMPRESSION OF QUALITY SCORES
59
optimization problem has to be solved only once. Ideally, this allocation should be done
by vector quantization, i.e., by applying a vector quantizer with N flj bits to {QÕ i,j }N
i=1 , for
j œ [1 : n]. However, due to ease of implementation and negligible performance loss,
we use a scalar quantizer. Thus for all i œ [1 : N ], each component QÕ i,j , for j œ [1 :
n], is normalized to a unit variance Gaussian and then it is mapped to decision regions
representable in flj bits. For this we need flj to be an integer. However, this will not be
the case in general, so we randomly map each flj to flÕj , which is given by either the closest
integer from below or from above, so that the average of flÕj and flj coincide. In order to
ensure the decoder gets the same value of flÕj , the same pseudorandom generator is used in
both functions. The decision regions that minimize the MSE for different values of fl and
their representative values are found offline from a Lloyd Max procedure [66] on a scalar
Gaussian distribution with mean zero and variance one. For example, for fl = 1 we have
21 decision regions, which correspond to values below zero (decision region 0) and above
zero (decision region 1), with corresponding representative values ≠0.7978 and +0.7978.
Therefore, if we were to encode the value ≠0.344 with one bit, we will encode it as a ‘0’,
and the decoder will decode it as ≠0.7978. The decoder, to reconstruct the new quality
scores {Q̂i }N
i=1 , performs the operations complementary to that done by the encoder. The
decoder constructs round(V QÕ + µQ ) and replaces the quality scores corresponding to an
unknown basepair (given by the character ‘N’), by the least reliable quality value score.
The steps followed by the encoder are the following:
Encoding the quality scores of a FASTQ file using nR bits per sequence
Precompute:
1. Extract the quality score vectors {Qi }N
i=1 of length n from the FASTQ file.
2. Compute µQ and
4. Compute the SVD:
empirically from {Qi }N
i=1 .
Q
Q
= V SV T .
5. Given S and the parameter nR, solve for the optimal fl = [fl1 , . . . , fln ] that
minimizes the MSE.
For i = 1 to N :
1. QÕ i = V T (Qi ≠ µQ ).
2. For j = 1 to n:
CHAPTER 4. LOSSY COMPRESSION OF QUALITY SCORES
2.1. QÕÕi (j) =
60
QÕi,j
Ô
.
sjj
2.2. Randomly generate integer flÕj from flj .
2.3. Map QÕÕi,j into its corresponding decision region.
2.4. Encode the decision region using flÕj bits and write them to file.
Notice that the final size is given by nRN plus an overhead to specify the mean and
covariance of the quality scores, the length n and the number of sequences N . This can be
further reduced by performing a lossless compression using a standard universal entropy
code.
Clustering
Since the algorithm is based on the statistics of the quality scores, better statistics would
give lower distortion. With that in mind, and to capture possible correlation between the
reads, we allow the user to first cluster the quality score vectors, and then perform the lossy
compression in each of the clusters separately.
The clustering is based on the k-means algorithm [67], and it is performed as follows.
For each of the clusters, we initialize a mean vector V of length n, with the same value at
each position. The values are chosen to be equally spaced between the minimum quality
score and the maximum. For example, if the quality scores go from 33 to 73 and there are
3 clusters, the mean vectors will be initialized as all 33’s, all 53’s, and all 73’s. Then, each
of the quality score vectors will be assigned to the cluster that minimizes the MSE with
respect to its mean vector V , i.e., to the cluster that minimizes
1
n
qn
i=1 (Q(i) ≠ V
(i))2 . After
assigning each quality score vector to a cluster, the mean vectors are updated by computing
the empirical mean of the quality score vectors assigned to the cluster. This process is
repeated until none of the quality score vectors is assigned to a different cluster, or until a
maximum number of iterations is reached.
Finally, notice that R = 0 is not the same as discarding the quality scores, since the
decoder will not assign the same value to all the reconstructed quality scores. Instead,
the reconstructed quality score vectors within a cluster will be the same, and equal to the
empirical mean of the original quality score vectors within the cluster, but each quality
CHAPTER 4. LOSSY COMPRESSION OF QUALITY SCORES
61
score within the vector will in general be different.
4.2.2
QVZ
For the design of QVZ, we model the quality score sequence Q = [Q1 , Q2 , . . . , Qn ] by a
Markov chain of order one: we assume the probability that Qj takes a particular value depends on previous values only through the value of Qj≠1 . We further assume that the quality score sequences are independent and identically distributed (i.i.d.). We use a Markov
model based on the observation that quality scores are highly correlated with their neighbors within a single sequence, and we refrain from using a higher order Markov model to
avoid the increased overhead and complexity this would produce within our algorithm.
The Markov model is defined by its transition probabilities P (Qj |Qj≠1 ), for j œ 1, 2, . . . , n,
where P (Q1 |Q0 ) = P (Q1 ). QVZ finds these probabilities empirically from the entire data
set to be compressed and uses them to design a codebook. The codebook is a set of quantizers indexed by position and previously quantized value (the context). These quantizers
are constructed using a variant of the Lloyd-Max algorithm [66], and are capable of minimizing any quasi-convex distortion chosen by the user (i.e., not necessarily MSE). After
quantization, a lossless, adaptive arithmetic encoder is applied to achieve entropy-rate compression.
In summary, the steps taken by QVZ are:
1. Compute the empirical transition probabilities of a Markov-1 Model from the data.
2. Construct a codebook (section 4.2.2) using the Lloyd-Max algorithm (section 4.2.2).
3. Quantize the input using the codebook and run the arithmetic encoder over the result
(section 4.2.2).
Lloyd-Max quantizer
Given a random variable X governed by the probability mass function P (·) over the alphabet X of size K, let D œ RK◊K be a distortion matrix where each entry Dx,y = d(x, y) is
the penalty for reconstructing symbol x as y. We further define Y to be the alphabet of the
quantized values of size M Æ K.
CHAPTER 4. LOSSY COMPRESSION OF QUALITY SCORES
62
Thus, a Lloyd-Max quantizer, denoted hereafter as LM (·), is a mapping X æ Y
that minimizes an expected distortion. Specifically, the Lloyd-Max quantizer seeks to
find a collection of boundary points bk œ X and reconstruction points yk œ Y, where
k œ {1, 2, . . . , M }, such that the quantized value of symbol x œ X is given by the re-
construction point of the region to which it belongs (see Fig. 4.1). For region k, any
x œ {bk≠1 , . . . , bk ≠ 1} is mapped to yk , with b0 being the lowest score in the quality
alphabet and bM the highest score plus one. Thus the Lloyd-Max quantizer aims to minimize the expected distortion by solving
{bk , yk }M
k=1 = argmin
bk ,yk
j ≠1
M bÿ
ÿ
P (x)d(x, yj ).
(4.5)
j=1 x=bj≠1
Reconstruction Points
P (X)
b0
y1
b1
y2
b2
y3
b3
X
Figure 4.1: Example of the boundary points and reconstruction points found by a LloydMax quantizer, for M = 3.
In order to approximately solve Eq. (4.5), which is an integer programming problem,
we employ an algorithm which is initialized with uniformly spaced boundary values and
reconstruction points taken at the midpoint of these bins. For an arbitrary D and P (·), this
problem requires an exhaustive search. We assume that the distortion measure d(x, y) is
quasi-convex over y with a minimum at y = x, i.e., when x Æ y1 Æ y2 or y2 Æ y1 Æ
x, d(x, y1 ) Æ d(x, y2 ). If the distortion measure is quasi-convex, an exchange argument
suffices to show the optimality of contiguous quantization bins and a reconstruction point
within the bin. The following steps are iterated until convergence:
CHAPTER 4. LOSSY COMPRESSION OF QUALITY SCORES
63
1. Solving for yk : We first minimize Eq. (4.5) partially over the reconstruction points
given boundary values. The reconstruction points are obtained as,
yk =
argmin
bÿ
k ≠1
P (x)d(x, y), ’ k = 1, 2, . . . , M.
y={bk≠1 ,...,bk ≠1} x=bk≠1
(4.6)
2. Solving for bk : This step minimizes Eq. (4.5) partially over the boundary values given
the reconstruction points. bk could range from {yk + 1, . . . , yk+1 } and is chosen as
the largest point where the distortion measure to the previous reconstruction value yk
is lesser than the distortion measure to the next reconstruction value yk+1 , i.e.,
bk =
Ó
max x œ {yk + 1, . . . , yk+1 } : P (x)d(x, yk ) Æ
Ô
P (x)d(x, yk+1 )
’ k = 1, 2, . . . , M ≠ 1.
(4.7)
Note that this algorithm, which is a variant of the Lloyd-Max quantizer, converges in at
most K steps.
Given a distortion matrix D, the defined Lloyd-Max quantizer depends on the number
of regions M and the input probability mass function P (·). Therefore we denote the LloydP
Max quantizer with M regions as LMM
(·), and the quantized value of a symbol x œ X as
P
LMM
(x).
An ideal lossless compressor applied to the quantized values can achieve a rate equal to
P
P
the entropy of LMM
(X), which we denote by H(LMM
(X)). For a fixed probability mass
function P (·), the only varying parameter is the number of regions M . Since M needs
to be an integer, not all rates are achievable. Because we are interested in achieving an
arbitrary rate R, we define an extended version of the LM quantizer, denoted as LM E. The
extended quantizer consists of two LM quantizers with the numbers of regions given by fl
and fl + 1, each of them used with probability 1 ≠ r and r, respectively (where 0 Æ r Æ 1).
Specifically, fl is given by the maximum number of regions such that H(LMflP (X)) < R
P
(which implies H(LMfl+1
(X)) > R). Then, the probability r is chosen such that the
average entropy (and hence the rate) is equal to R, the desired rate. More formally,
CHAPTER 4. LOSSY COMPRESSION OF QUALITY SCORES
64
Y
_
]LM P (x),
w.p. 1 ≠ r,
fl
LM ERP (x) = _
[LM P (x), w.p. r,
fl+1
fl = max{x œ {1, . . . , K} : H(LMxP (X)) Æ R}
R ≠ H(LMflP (X))
r =
.
P
H(LMfl+1
(X)) ≠ H(LMflP (X))
(4.8)
Codebook generation
Because we assume the data follows a Markov-1 model, for a given position j œ {1, . . . , n}
‚ j as there were unique possible quantized values q̂ in
we design as many quantizers Q
q̂
the previous context j ≠ 1. This collection of quantizers forms the codebook for QVZ.
‚ , so Q
„ =
For an unquantized quality score Qj we denote the quantized version as Q
j
‚ ,Q
‚ ,...,Q
‚ ] is the random vector representing a quantized sequence. The quantizers
[Q
1
2
n
are defined as
P(Q )
1
‚ 1 = LM E
Q
–H(Q1 )
‚
P(Q |Q
(4.9)
=q̂)
j
j≠1
‚ j = LM E
Q
q̂
‚j≠1 =q̂) , for j = 2, . . . , n
–H(Qj |Q
(4.10)
where – œ [0, 1] is the desired compression factor. – = 0 corresponds to 0 rate encoding,
– = 1 to lossless compression, and any value in between scales the input file size by that
amount. Note that the entropies can be directly computed from the corresponding empirical
probabilities.
Next we show how the probabilities needed for the LMEs are computed.
‚ ), which
In order to compute the quantizers defined above, we require P(Qj+1 | Q
j
must be computed from the empirical statistics P(Qj+1 | Qj ) found earlier. The first step
‚ | Q ) recursively, and then to apply Bayes rule and the Markov Chain
is to calculate P(Q
j
j
property to find the desired probability:
CHAPTER 4. LOSSY COMPRESSION OF QUALITY SCORES
‚ |Q )=
P(Q
j
j
=
=
=
ÿ
‚j≠1
Q
ÿ
‚j≠1
Q
ÿ
‚j≠1
Q
ÿ
‚j≠1
Q
‚
P(Q
j
65
‚ ,Q
‚
P(Q
j
j≠1 | Qj )
‚
| Qj , Q
j≠1 )
‚ | Q ,Q
‚
P(Q
j
j
j≠1 )
‚ | Q ,Q
‚
P(Q
j
j
j≠1 )
ÿ
Qj≠1
ÿ
Qj≠1
ÿ
Qj≠1
‚
P(Q
j≠1 , Qj≠1 | Qj )
‚
P(Q
j≠1 | Qj≠1 , Qj )P(Qj≠1 | Qj )
‚
P(Q
j≠1 | Qj≠1 )P(Qj≠1 | Qj )
(4.11)
‚
Eq. (4.11) follows from the fact that Q
j≠1 ¡ Qj≠1 ¡ Qj form a Markov chain. Addition‚ | Q ,Q
‚
‚j
‚
ally, P(Q
j
j
j≠1 = q̂) = P(Qq̂ (Qj ) = Qj ), which is the probability that a specific
‚ given previous context q̂. This can be found directly from r (dequantizer produces Q
j
fined in Eq. (4.8)) and the possible values for q̂. We now proceed to compute the required
conditional probability as
‚ ) =
P(Qj+1 | Q
j
=
ÿ
Qj
ÿ
Qj
=
‚ )P(Q
‚
P(Qj | Q
j
j+1 | Qj , Qj )
‚ )P(Q
P(Qj | Q
j
j+1 | Qj )
1 ÿ ‚
P(Qj | Qj )P(Qj , Qj+1 ),
‚ )
P(Q
j Qj
(4.12)
(4.13)
where Eq. (4.12) follows from the same Markov chain as earlier. Terms in Eq. (4.13) are:
‚ | Q ): computed
i) P(Qj , Qj+1 ): joint pmf computed empirically from the data, ii) P(Q
j
j
‚ ): normalizing constant given by
in Eq. (4.11), and iii) P(Q
j
‚ = q̂) =
P(Q
j
ÿ
Qj
‚ = q̂ | Q )P(Q ).
P(Q
j
j
j
The steps necessary to compute the codebook are summarized in Algorithm 1. Note
that support(Q) denotes the support of the random variable Q or the set of values that Q
takes with non-zero probability.
CHAPTER 4. LOSSY COMPRESSION OF QUALITY SCORES
66
Algorithm 1 Generate codebook
Input: Transition probabilities P(Qj | Qj≠1 ), compression factor –
‚l }
Output: Codebook: collection of quantizers {Q
q̂
P Ω P(Q1 )
‚ 1 based on P using Eq. (4.9)
Compute and store Q
for all columns j = 2 to n do
‚
Compute P(Q
j≠1 | Qj≠1 = q) ’q œ support(Qj≠1 )
‚
‚
Compute P(Qj | Q
j≠1 ) ’q̂ œ support(Qj≠1 )
‚
for all q̂ œ support(Q
j≠1 ) do
‚
P Ω P(Qj | Qj≠1 = q̂)
‚ j based on P using Eq. (4.10)
Compute and store Q
q̂
end for
end for
Encoding
The encoding process is summarized in Algorithm 2. First, we generate the codebook and
quantizers. For each read, we quantize all scores sequentially, with each value forming
the left context for the next value. As they are quantized, scores are passed to an adaptive
arithmetic encoder, which uses a separate model for each position and context.
Algorithm 2 Encoding of quality scores
Input: Set of N reads {Qi }N
i=1
‚ l } (codebook) and compressed representation of reads
Output: Set of quantizers {Q
q̂
Compute empirical statistics of input reads
‚ l } according to Algorithm 1
Compute codebook {Q
q̂
for all i = 1 to N do
[Q1 , . . . , Qn ] Ω Qi
‚ ΩQ
‚ 1 (Q )
Q
1
1
for all j = 2 to n do
‚ ΩQ
‚j
Q
i
‚j≠1 (Qj )
Q
end for
‚ ,...,Q
‚ ] to arithmetic encoder
Pass [Q
1
n
end for
CHAPTER 4. LOSSY COMPRESSION OF QUALITY SCORES
67
Clustering
The performance of the compression algorithm depends on the conditional entropy of each
quality score given its predecessor. Earlier we assumed that the data was all i.i.d., but
it is more effective to allow each read to be independently selected from one of several
distributions. If we first cluster the reads into C clusters, then the variability within each
cluster may be smaller. In turn, the conditional entropy would decrease and fewer bits
would be required to encode Qj at a given distortion level, assuming that an individual
codebook is available unique to each cluster.
Thus QVZ has the option of clustering the data prior to compression. We have explored
two approaches for performing the clustering.
1. K-means:
The first approach uses the K-means algorithm [67], initialized using C quality value
sequences chosen at random from the data. It assigns each sequence to a cluster
by means of Euclidean distance. Then, the centroid of each cluster is computed as
the mean vector of the sequences assigned to it. Due to the lack of convergence
guarantees, we have incorporated a stop criterion that avoids further iterations once
the centroids of the clusters have moved less than U units (in Euclidean distance).
The parameter U is set to 4 by default, but it can be modified by the user. Finally,
storing which cluster each read belongs to incurs a rate penalty of at most log2 (C)/L
bits per symbol, which allows QVZ to reconstruct the series of reads in the same
order as they were in the uncompressed input file.
2. Mixture of Markov Models:
Given our assumption that the quality sores sequences are generated by an order-1
Markov source, we can express the probability of a given sequence Qi as:
P (Qi ) =
n
Ÿ
j=1
P (Qi,j |Qi,j≠1 , . . . , Qi,1 ) = P (Qi,1 )
n
Ÿ
j=2
P (Qi,j |Qi,j≠1 ),
where the last equality comes from the Markov assumption.
(4.14)
CHAPTER 4. LOSSY COMPRESSION OF QUALITY SCORES
68
A discrete Markov source can be fully determined by its transition matrix A, where
Ams = P (Qi,j = m|Qi,j≠1 = s) is the probability of going from state s to state
m, ’j, and the prior state probability fis = P (Qi,1 = s), which is the probability of
starting at state s. We further denote the model parameters as ◊ = {A, fi}. With this
notation we can rewrite (4.14) as
P (Qi ; ◊) =
|Q|
Ÿ
1[Qi,1 =s]
(fis )
s=1
|Q| |Q|
n Ÿ
Ÿ
Ÿ
(Ams )1[Qi,j =m,Qi,j≠1 =s] .
(4.15)
j=2 m=1 s=1
Note that with the above definitions we have assumed that the stochastic process that
generates the quality score sequences is time invariant, i.e., that the value of Ams
is independent of the time (position) j. However, strong correlations exist between
adjacent quality scores, as well as a trend that the quality scores degrade as a read
progresses.
1
1
1
1
2
2
2
2
Q
Q
Q
Q
j=1
j=2
j=n
1
j=n
Figure 4.2: Our temporal Markov model.
In order to take into consideration the temporal behavior of the quality score sequences, we increase the number of states from |Q| to |Q| ◊ n, one for each possible
value of Q and j. To represent the temporal dimension, we redefine the transition matrix as a three dimensional matrix, where the first dimension represents the previous
value of the quality score, the second one the current value of the quality score and the
third one the time j within the sequence. That is, Amsj = P (Qi,j = m|Qi,j≠1 = s)
is the probability of transitioning from state n to state s at time j.
For the clustering step, we further assume that the quality score sequences have been
generated independently by one of K underlying Markov models, such that the whole
CHAPTER 4. LOSSY COMPRESSION OF QUALITY SCORES
69
set of quality score sequences are generated by a mixture of Markov models. With
some abuse of notation we now define ◊ = {fi (k) , A(k) }K
k=1 to be the parameters
of the K Markov models, and ◊k = {fi (k) , A(k) } to be the parameters of the kth
Markov model. We further define Zi to be the latent random variable that specifies
the identity of the mixture component for the ith sequence. Thus, the set of quality
score sequences that has been generated by the sequencing machine is distributed as:
Q ≥ P (Q; ◊) =
N
Ÿ
P (Qi ; ◊) =
i=1
N ÿ
K
Ÿ
i=1 k=1
P (Qi |Zi = k; ◊)µk ,
(4.16)
where µk , P (Zi = k), and P (Qi |Zi = k; ◊) is the probability that the sequence Qi
has been generated by the kth Markov model. Substituting (4.15) in (4.16) we get
that the likelihood of the data is given by
P (Q; ◊) =
N ÿ
K
Ÿ
i=1 k=1
Q
µk a
|Q|
Ÿ
(fis(k) )1[Qi,1 =s]
s=1
|Q| |Q|
n Ÿ
Ÿ
Ÿ
j=2 m=1 s=1
R
(k)
(Amsj )1[Qi,j =m,Qi,j≠1 =s] b .
(4.17)
The goal of the clustering step is to assign each sequence to the most probable model
that has generated it. However, since the parameters of the models are unknown, the
clustering step first computes the maximum likelihood estimation of the parameters
{A(k) , fi (k) , µk } of each of the Markov models. Since the log likelihood ¸(◊) ,
log P (Q; ◊) is intractable due to the summand appearing in (4.17), this operation is
done by using the well known Expectation-Maximitation (EM) algorithm [68]. The
EM algorithm iteratively maximizes the function
g(◊, ◊
(l≠1)
) , EZ|Q,◊(l≠1)
C
ÿ
D
log P (Qi , Zi ; ◊) ,
i
which is the expectation of the complete log likelihood with respect to the conditional
distribution of Z given Q and the current estimated parameters. It can be shown [68]
that for any mixture model this function is given by
g(◊, ◊(l≠1) ) =
ÿÿ
i
k
rik log µk +
ÿÿ
i
k
rik log P (Qi ; ◊k ),
(4.18)
CHAPTER 4. LOSSY COMPRESSION OF QUALITY SCORES
70
where rik , P (Zi = k|Qi , ◊(l≠1) ) is the responsibility that cluster k takes for the
quality sequence i. In particular, for the case of a mixture of Markov models, the
previous equation is given by
g(◊, ◊
(l≠1)
)=
N ÿ
K
ÿ
rik log µk +
i=1 k=1
N ÿ
K
ÿ
rik
i=1 k=1
N ÿ
K
ÿ
i=1 k=1
rik
|Q| |Q|
n ÿ
ÿ
ÿ
|Q|
ÿ
1[Qi,1 = s] log(fis(k) )+
s=1
(k)
log(Amsj )1[Qi,j = m, Qi,j≠1 = s], (4.19)
j=2 m=1 s=1
where the expansion of log P (Qi ; ◊k ) is obtained by taking the log of (4.15).
The initialization of the EM algorithm is performed by randomly selecting the pa„(k) , fî (k) , µ̂ }. Then, the algorithm iteratively performs as follows. In the
rameters {A
k
E-step it computes rik , which for the case of Markov mixture models is given by
(k)
(k)
rik à µ̂k fî (k) (Qi,1 )A‚2 (Qi,1 , Qi,2 )A‚3 (Qi,2 , Qi,3 ) . . . A‚(k)
n (Qi,n≠1 , Qi,n ),
where
(k)
A‚j (Qi,j≠1 , Qi,j ) =
|Q| |Q|
Ÿ
Ÿ
m=1 s=1
(4.20)
(k)
(A‚msj )1[Qi,j =m,Qi,j≠1 =s] .
In the M-step it computes the parameters ◊ˆ that maximize Q(◊, ◊(l≠1) ). In the case of
a mixture of Markov models, these parameters can be computed using the Lagrange
multipliers method on Q(◊, ◊(l≠1) ), where the constraints are that all the rows of A(k)
and the vectors fi (k) and µ must sum to one. For the case under consideration, it can
be shown that the maximizing parameters computed in the M-step are given by
µ̂k =
fîs(k)
(k)
A‚msj
N
1 ÿ
rik
N i=1
qN
rik 1[Qi,1 = s]
= q|Q| i=1
qN
Õ
i=1 rik 1[Qi,1 = s ]
sÕ =1
qN
rik 1[Qi,j = m, Qi,j≠1 = s]
= q|Q| i=1
,
qN
Õ
i=1 rik 1[Qi,j = m , Qi,j≠1 = s]
mÕ =1
with j = 2, . . . , n, s = 1, . . . , |Q| and m = 1, . . . , |Q|.
(4.21)
(4.22)
(4.23)
CHAPTER 4. LOSSY COMPRESSION OF QUALITY SCORES
71
Furthermore, the EM algorithm guarantees that choosing ◊ to improve Q(◊, ◊(l≠1) )
beyond Q(◊(l≠1) , ◊(l≠1) ) will improve ¸(◊) beyond ¸(◊(l≠1) ), which yields a decreasing value of Q(◊(l≠1) , ◊(l≠1) ) per iteration. We stop the algorithm once the change on
the value of Q(◊(l≠1) |◊(l≠1) ) is small enough, or after a fixed number of iterations.
Once the EM algorithm terminates, the value of rik tells us the responsibility of each
mixture component k over the sequence i. The clustering step uses this information
to perform the clustering. Specifically, each sequence is assigned to the cluster Ck
with k such that rik Ø rikÕ , ’k Õ ”= k.
4.3
Results
In order to assess the rate-distortion performance of the proposed algorithms QualComp
and QVZ, we compare it with the state of the art lossy compression algorithms PBlock
and RBlock [21], which provide the best rate-distortion performance among existing lossy
compression algorithms for quality scores that do not use any extra information for compression. We also consider CRAM [50] and DSRC2 [63], which apply Illumina’s binning
scheme. For completeness, we also compare the lossless performance of QVZ with that of
CRAM, DSRC2 (in their lossless mode), and gzip.
4.3.1
Data
The datasets used for our analysis are the NA12878.HiSeq. WGS.bwa.cleaned.recal.hg19.20.bam,
which corresponds to the chromosome 20 of a H. Sapiens individual, and a ChIP-Seq
dataset from a M. Musculus (SRR32209). The H. Sapiens dataset was downloaded from
the GATK bundle.2 We generated the SAM file from the BAM file and then extracted
the quality score sequences from it. The data set contains 51, 585, 658 sequences, each of
length 101. The M. Musculus dataset was downloaded from the DNA Data Bank of Japan
(DDBJ),3 and contains 18, 828, 274 sequences, each of length 36.
2
3
https://www.broadinstitute.org/gatk/
http://trace.ddbj.nig.ac.jp/DRASearch/run?acc=%20SRR032209
CHAPTER 4. LOSSY COMPRESSION OF QUALITY SCORES
4.3.2
72
Machine Specifications
The machine used to perform the experiments has the following specifications: 39 GB
RAM, Intel Core i7-930 CPU at 2.80GHz x 8 and Ubuntu 12.04 LTS.
4.3.3
Analysis
First, we describe the options used to run each algorithm. QVZ was run with the default
parameters, multiple rates and different number of clusters (for both clustering approaches,
that is, k-means and Mixture of Markov Models). Similarly, QualComp was run with different number of clusters and multiple rates. PBlock and RBlock were run with different
thresholds (that is, different values of p and r, respectively). CRAM and DSRC2 were run
with the lossy mode that implements Illumina’s proposed binning scheme. Finally, we also
run each of the mentioned algorithms in the lossless mode, except QualComp, since it does
not support lossless compression (see results below).
We start by looking at the performance of QualComp. Fig. 4.3 shows its performance
when applied to the M. Musculus dataset, with one, two, and three clusters, and several
rates. As expected, increasing the rate for a given number of clusters decreases the MSE,
especially for small rates. Similarly, increasing the number of clusters for the same rate
decreases the MSE. As can be observed, the decrease in the MSE is mainly noticeable for
small rates, having less effect for moderate to high rates.
Worth noticing is that QualComp is not able to achieve lossless compression. This is
due to the transformation performed in the quality score sequences prior to compression,
and the assumption made by QualComp that Q = R. As a result, Qualcomp may perform
worse than other lossy compressors for rates close to those of lossless compression (see
results below).
Regarding the performance of QVZ, recall that it can minimize any quasi-convex distortion, as long as the distortion matrix is provided. In addition, for ease of usage, the
implementation contains three built-in distortion metrics: i) the average Mean Squared Error (MSE), where d(x, y) = |x ≠ y|2 ; ii) the average L1 distortion, where d(x, y) = |x ≠ y|;
and iii) the average Lorentzian distortion, where d(x, y) = log2 (1 + |x ≠ y|). Hereafter we
refer to each of them as QVZ-M, QVZ-A and QVZ-L, respectively.
Average MSE distortion
CHAPTER 4. LOSSY COMPRESSION OF QUALITY SCORES
73
M. Musculus (SRR032209)
150
QualComp - c1
QualComp - c2
QualComp - c3
100
50
0
0
0.5
1
1.5
2
Bits per quality scores
Figure 4.3: Rate-Distortion performance of QualComp for MSE distortion, and the M.
Musculus dataset. cX stands for QualComp run with X number of clusters.
Fig. 4.4 shows the performance of QVZ for the H. Sapiens dataset, when k-means
is employed to perform the clustering step. We also show in the same plot the performance of QualComp (when run with 3 clusters), as well as that of the previously proposed
algorithms. As can be seen, QVZ outperforms QualComp and the previously proposed algorithms for all three choices of distortion metric. The lossy modes of CRAM and DSRC2
can each achieve only one rate-distortion point, and both are outperformed by QVZ. We
further observe that although QualComp outperforms RBlock and PBlock for low rates
(in all three distortions), the latter two achieve a smaller distortion for higher rates. This
is in line with our intuition that QualComp does not provide a competitive performance
for moderate to high rates. QVZ however outperforms all previously proposed algorithms
in both low and high rates. QVZ’s advantage becomes especially apparent for distortions
other than MSE.
It is also significant that QVZ achieves a zero distortion at a rate at which the other lossy
algorithms exhibit positive distortion. In other words, QVZ achieves lossless compression
faster than RBlock or PBlock. Moreover, QVZ also outperforms the lossless compressors
CRAM and gzip, and achieves similar performance to that of DSRC2 (see Table 4.2).
Finally, we observe that applying the k-means clustering prior to compression in QVZ
is especially beneficial at low rates. For higher rates, the performance of 1, 3 and 5 clusters
is almost identical. Note that these results are in line with those obtained with QualComp.
CHAPTER 4. LOSSY COMPRESSION OF QUALITY SCORES
PBlock
RBlock
QualComp
QVZ-M c1
QVZ-M c3
QVZ-M c5
CRAM-lossy
CRAM-lossless
DSRC2-lossy
DSRC2-lossless
gzip
Average MSE distortion
120
100
10
80
5
60
40
0
500
20
1000
1500
74
2000
0
0
500
1000
1500
Size [MB]
2000
PBlock
RBlock
QualComp
QVZ-A c1
QVZ-A c3
QVZ-A c5
CRAM-lossy
CRAM-lossless
DSRC2-lossy
DSRC2-lossless
gzip
7
Average L1 distortion
2500
6
5
4
3
2
1
0
Average Lorentzian distortion
0
500
1000
1500
Size [MB]
3
2000
2500
PBlock
RBlock
QualComp
QVZ-L c1
QVZ-L c3
QVZ-L c5
CRAM-lossy
CRAM-lossless
DSRC2-lossy
DSRC2-lossless
gzip
2.5
2
1.5
1
0.5
0
0
500
1000
1500
2000
2500
Size [MB]
Figure 4.4: Rate-Distortion curves of PBlock, RBlock, QualComp and QVZ, for MSE, L1
and Lorentzian distortions. In QVZ, c1, c3 and c5 denote 1, 3 and 5 clusters (when using
k-means), respectively. QualComp was run with 3 clusters.
CHAPTER 4. LOSSY COMPRESSION OF QUALITY SCORES
QVZ
PBlock
(3 clusters) RBlock
Size [MB]
1,632
3,229
75
DSRC2
CRAM
gzip
1,625
2,000
1,999
Table 4.2: Lossless results of the different algorithms for the NA12878 data set.
Next we analyze the Mixture of Markov Model approach for clustering the quality
scores prior to compression with QVZ, and compare it with the k-means approach. Fig.
4.5 shows the rate-distortion performance of both approaches when applied to the H. Sapiens dataset, for the three considered distortion metrics. As can be observed, the Mixture
of Markov Model clustering approach exhibits superior performance under all considered
metrics, when compared to the k-means approach. For example, for a rate of 1 bit per
quality score, the Mixture of Markov Model clustering approach achieves half the MSE
QVZ-c3
20
QVZ-c10
15
QVZ-K3
10
QVZ-K10
5
0
4
3
2
1
0
0.5
1
1.5
2
Bits per quality score
2.5
0
0.5
1
1.5
2
Bits per quality score
2.5
Average Lorentzian distortion
25
Average L1 distortion
Average MSE distortion
distortion incurred by the k-means approach.
200
150
100
50
0
0
0.5
1
1.5
2
2.5
Bits per quality score
Figure 4.5: Rate-Distortion curves of QVZ for the H. Sapiens dataset, when the clustering step is performed with k-means (c3 and c10), and with the Mixture of Markov Model
approach (K3 and K10). In both cases we used 3 and 10 clusters.
Regarding the running time, QualComp takes around 90 minutes to compute the necessary statistics, and 20 minutes to finally compress the quality scores presented in the
H. Sapiens dataset. The decompression is done in 15 minutes. QVZ, on the other hand,
requires approximately 13 minutes to compress the same dataset, and 12 minutes to decompress it. These numbers were computed without performing the clustering step. DSRC2
CHAPTER 4. LOSSY COMPRESSION OF QUALITY SCORES
76
requires 20 minutes to compress and decompress, whereas CRAM employs 14 minutes to
compress and 4 minutes to decompress. Finally, both Pblock and Rblock take around 4
minutes to compress and decompress, being the algorithms with the least running times
among those that we analyzed. The running times of gzip to compress and decompress are
7 and 30 minutes, respectively.
In terms of memory usage, QVZ uses 5.7 GB to compress the analyzed data set and less
than 1 MB to decompress, whereas QualComp employs less than 1 MB for both operations.
Pblock and Rblock have more memory usage than QualComp, but this is still below 40 MB
to compress and decompress. DSRC2 uses 3 GB to compress and 5 GB to decompress,
whereas CRAM employs 2 GB to compress and 3 GB to decompress. Finally, gzip uses
less than 1 MB for both operations.
4.4
Discussion
From the results presented in the previous section, we can conclude that QualComp offers
competitive performance for small rates, whereas it is outperformed for moderate to high
rates. Due to the transformation performed by QualComp on the quality scores, lossless
compression cannot be achieved, even if the rate is set to a very high value. This is part
of the reason why QualComp performs worse for rates close to that of lossless. On the
other hand, QVZ is able to achieve lossless compression. In addition, it outperforms the
previously proposed lossy compressors for all rates.
Another advantage of QVZ with respect to previously proposed methods is that it can
minimize any quasi-convex distortion metric. This feature is important for lossy compressors of quality scores, since the criterion under which the goodness of the reconstruction
should be assessed is still not clear. It makes sense to pick a distortion measure by examining how different distortion measures affect the performance of downstream applications,
but the abundance of applications and variations in how quality scores are used makes
this choice too dependent on the specifics of the applications considered. These trade-offs
suggest that an ideal lossy compressor for quality scores should not only provide the best
possible compression and accommodate downstream applications, but it should provide
flexibility to allow a user to pick a desired distortion measure and/or rate.
CHAPTER 4. LOSSY COMPRESSION OF QUALITY SCORES
77
Based on the results presented in the previous section, it becomes apparent that performing clustering prior to compression can improve the rate-distortion performance, especially
for small rates. k-means is a valid approach for performing the clustering step, and as we
have seen, improves the performance for both QualComp and QVZ. However, for QVZ,
the clustering approach based on a Mixture of Markov Models outperforms that of k-means.
This suggests that the clustering step should take into account the statistics employed by
the lossy compressor, so as to boost the performance.
Finally, we have demonstrated through simulation that the binning scheme proposed by
Illumina can be outperformed by lossy compressors that take the statistics of the quality
scores into account for designing the compression scheme.
4.5
Conclusions
To partially tackle the problem of storage and dissemination of genomic data, in this chapter we have developed QualComp and QVZ, two new lossy compressors for the quality
scores. One advantage of the proposed methods with respect to previously proposed lossy
compressors is that they allow the user to specify the rate prior to compression. Whereas
QualComp aims to minimize the Mean Squared Error (MSE), QVZ can work for several
distortion metrics, including any quasi-convex distortion metric provided by the user, a feature not supported by the previously proposed algorithms. Moreover, QVZ also allows for
lossless compression, and a seamless transition from lossy to the lossless with increasing
rate. QualComp exhibits better rate-distortion performance for small rates than previously
proposed methods, whereas QVZ exhibits better rate-distortion performance for all rates.
The results presented in this chapter demonstrate the significant savings in storage that can
be achieved by lossy compression of quality scores.
Chapter 5
Effect of lossy compression of quality
scores on variant calling
In this chapter we evaluate the effect of lossy compression of quality scores on variant
calling (SNP and INDEL detection). To that end, we first propose a methodology for the
analysis, and then use it to compare the performance of the recently proposed lossy compressors for quality scores. Specifically, we investigate how the output of the variant caller
when using the original data differs from that obtained when quality scores are replaced by
those generated by a lossy compressor.
We demonstrate that lossy compression can significantly alleviate the storage while
maintaining variant calling performance comparable to that with the original data. Further,
in some cases lossy compression can lead to variant calling performance that is superior
to that of using the original file. We envisage our findings and framework serving as a
benchmark in future development and analyses of lossy genomic data compressors.
5.1
Methodology for variant calling
In this section we describe the proposed methodology to test the effect of lossy compressors
of quality scores on variant calling. The methodologies suggested for SNPs and INDELs
differ, and thus we introduce each of them separately.
78
CHAPTER 5. EFFECT OF LOSSY COMPRESSION OF QUALITY SCORES
5.1.1
79
SNP calling
Based on the most recent literature that compares different SNP calling pipelines ([69, 70,
71, 72, 73]) we have selected three pipelines for our study. Specifically, we propose the use
of: i) the SNP calling pipeline suggested by the Broad Institute, which uses the Genome
Analysis Toolkit (GATK) software package [14, 57, 74]; ii) the pipeline presented in the
High Throughput Sequencing LIBrary (htslib.org), which uses the Samtools suite developed by The Wellcome Trust Sanger Institute [1]; and iii) the recently proposed variant
caller named Platypus developed by Oxford University [75]. In the following we refer to
these pipelines as GATK1 , htslib.org2 and Platypus, respectively.
In all pipelines we use BWA-mem [76] to align the FASTQ files to the reference (NCBI
build 37, in our case), as stated in all best practices.
Regarding the GATK pipeline, we note that the best practices recommends to further
filter the variants found by the Haplotype Caller by either applying the Variant Quality
Score Recalibration (VQSR) or the Hard Filter. The VQSR filter is only recommended if
the data set is big enough (more than 100K variants), since otherwise one of the steps of
the VQSR, the Gaussian mixture model, may be inaccurate. Therefore, in our analysis we
consider the use of both the VQSR and the Hard Filter after the Haplotype Caller, both as
specified in the best practices.
5.1.2
INDEL detection
To evaluate the effect of lossy compression of base quality scores on INDEL calling, we
employ popular INDEL detection pipelines: Dindel [77], Unified Genotyper, Haplotype
Caller [14, 57, 74] and Freebayes [78]. First, reads were aligned to the reference genome,
NCBI build 37, with BWA-mem [76]. We replaced the quality scores of the corresponding SAM/BAM file by those obtained after applying various lossy compressors, and then
we performed the INDEL calling with each of the four tools. Note that several of these
pipelines can be used to call both SNPs and INDELs, but the commands or parameters are
different for each variant type.
1
2
https://www.broadinstitute.org/gatk/guide/best-practices
More commonly referred to as samtools. http://www.htslib.org/workflow
CHAPTER 5. EFFECT OF LOSSY COMPRESSION OF QUALITY SCORES
5.1.3
80
Datasets for SNP calling
A crucial part of the analysis is selecting a dataset for which a consensus of SNPs exists
(hereafter referred to as “ground truth”), as it serves as the baseline for comparing the performance of the different lossy compressors against the lossless case. Thus, for the SNP
calling analysis, we consider datasets from the H. Sapiens individual NA12878, for which
two “ground truth” of SNPs have been released. In particular, we consider the datasets
ERR174324 and ERR262997, which correspond to a 15x-coverage pair-end WGS dataset
and a 30x-coverage pair-end WGS dataset, respectively. For each of them we extracted
the chromosomes 11 and 20. The decision of extracting some chromosomes was made
to speed up the computations. We chose chromosome 20 because it is the one normally
used for assessment3 , and chromosome 11 as a representative of a longer chromosome.
Regarding the two “ground truths”, they are the one released by the Genome in a Bottle
consortium (GIAB) [79], which has been adapted by the National Institute of Standardizations and Technology (NIST); and the ground truth released by Illumina as part of the
Platinium Genomes project.4 Fig. 5.1 summarizes the differences between the two. As
can be observed, most of the SNPs contained in the NIST ground truth are also included
in Illumina’s ground truth, for both chromosomes. Note also that the number of SNPs on
chromosome 20 is almost half of chromosome 11, for both “ground truths”5 .
5.1.4
Datasets for INDEL detection
To evaluate the effect of lossy compression on INDEL detection, we simulated four datasets.
Each dataset is composed of one chromosome with approximately 3000 homozygous INDELs. To mimic biologically realistic variants, we generated distributions of INDEL sizes
and frequencies, and insertion to deletion ratios, all conditioned on location (coding vs
non-coding) using the Mills and 1000Genomes INDELs provided in the GATK bundle.
We drew from these distributions to create our simulated data.
3
http://gatkforums.broadinstitute.org/discussion/1213/
whats-in-the-resource-bundle-and-how-can-i-get-it
4
http://www.illumina.com/platinumgenomes
5
As is clear from the discussion in this subsection, the term ground truth should be taken with a grain of
salt and as such should appear in quotation marks throughout. We omit these marks henceforth for simplicity.
CHAPTER 5. EFFECT OF LOSSY COMPRESSION OF QUALITY SCORES
(a) Chromosome 11
81
(b) Chromosome 20
Figure 5.1: Difference between the GIAB NIST “ground truth” and the one from Illumina,
for (a) chromosomes 11 and (b) 20.
We generated approximately 30◊ coverage of the chromosome with 100bp paired-end
sequencing reads (using an Illumina-like error profile) for these simulated datasets using
ART [80].
5.1.5
Performance metrics
The output of each of the pipelines is a VCF file [16], which contains the set of the called
variants. We can compare these variants with those contained in the ground truth. True
Positives (T.P.) refer to those variants contained both in the VCF file under consideration
and the ground truth (a match in both position and genotyping must occur for the call to be
declared T.P. for SNP, while for INDELs the criteria were more lenient: any INDEL within
10bp of the true location was considered a T.P., methods similar to [72]); False Positives
(F.P.) refer to variants contained in the VCF file but not in the ground truth; and False
Negatives (F.N.) correspond to variants that are present in the ground truth dataset but not
in the VCF file under consideration. The more T.P. (or equivalently the fewer F.N.) and
the fewer F.P. the better. To evaluate the impact of lossy compression on variant calling,
we compare the number of T.P. and F.P. from various lossy compression approaches to the
number of T.P. and F.P. obtained from lossless compression. Ideally, we would like to apply
a lossy compressor to the quality scores, such that the resulting file is smaller than that of
the losslessly compressed, while obtaining a similar number of T.P. and F.P. We will show
that not only is this possible, but that in some cases we can simultaneously obtain more T.P.
and fewer F.P than with the original data.
CHAPTER 5. EFFECT OF LOSSY COMPRESSION OF QUALITY SCORES
82
To analyze the performance of the lossy compressors on the proposed pipelines, we
employ the widely used metrics sensitivity and precision, which include in their calculation
the true positives, false positives and false negatives, as described below:
• Sensitivity: measures the proportion of all the positives that are correctly called,
computed as
T.P.
.
(T.P.+F.N.)
• Precision: measures the proportion of called positives that are true, computed as
T.P.
.
(T.P.+F.P.)
Depending on the application, one may be inclined to boost the sensitivity at the cost
of slightly reducing the precision, in order to be able to find as many T.P. as possible. Of
course, there are also applications where it is more natural to optimize for precision than
sensitivity. In an attempt to provide a measure that combines the previous two, we also
calculate the f-score:
• F-score: corresponds to the harmonic mean of the sensitivity and precision, computed as
2◊Sensitivity◊Precision
.
Sensitivity+Precision
In the discussion above we have considered that all the variants contained in a VCF file
are positive calls. However, another approach is to consider only the subset of variants in
the VCF file that satisfy a given constraint to be positive calls. In general, this constraint
consists of having the value of one of the parameters associated with a variant above a
certain threshold. This approach is used to construct the well-known Receiver Operating
Curves (ROC). In the case under consideration, the ROC curve shows the performance
of the variant caller as a classification problem. That is, it shows how well the variant
caller differentiates between true and false variants when filtered by a certain parameter.
Specifically, it plots the False Positive Rate (F.P.R.) versus the True Positive Rate (T.P.R.)
(also denoted as sensitivity) for all thresholding values. Given an ROC plot with several
curves, a common method for comparing them is by calculating the area under the curve
(AUC) of each of them, such that larger AUCs are better.
There are several drawbacks with this approach. The main one, in our opinion, relates
to how to compare the AUC of different VCF files. Note that in general, different VCF
CHAPTER 5. EFFECT OF LOSSY COMPRESSION OF QUALITY SCORES
83
files contain a different number of calls. Thus, it is not informative to compute the ROC
curve of each VCF file independently, and then compare the respective AUCs. A more
rigorous comparison can be performed by forcing all the VCF files under consideration to
contain the same number of calls. This can be achieved by computing the union of all the
calls contained in the VCF files, and adding to each VCF file the missing ones, such that
they all contain the same number of calls. In [62] they followed this approach to perform
pair-wise comparisons. However, this does not scale very well for a large number of VCF
files. Moreover, after performing the analysis, if one more VCF file is generated, all the
AUC files must be re-computed (assuming the new VCF file contains at least a call not
included in the previous ones). The other main drawback that we encountered relates to
the selection of the thresholding parameter. For instance, in SNP calling, when using the
GATK pipeline, the QD (Quality by Depth) field is as valid a parameter as the QUAL field,
but each of them results in different AUCs.
Given the above discussion, we believe that this approach is mainly suitable to analyze
the VCF files that contain a clear thresholding parameter, like those VCF files obtained
by the GATK pipeline after applying the VQSR filter, since in this case there is a clear
parameter to be selected, namely the VQSLOD.
5.2
Results
We analyze the output of the variant caller (i.e., the VCF file) for each of the introduced
pipelines when the quality scores are replaced by those generated by a lossy compressor.
We focus on those lossy compressors that use only the quality scores for compression, as
it would be too difficult to draw conclusions about the underlying source that generates the
quality scores from analyzing algorithms like [62], where the lossy compression is done
mainly using the information from the reads. To our knowledge, and based on the results
presented in the previous chapter, RBlock, PBlock [21] and QVZ are the algorithms that
perform better among the existing lossy compressors that solely use the quality scores to
compress. Therefore, those are the algorithms that we consider for our study. In addition,
CHAPTER 5. EFFECT OF LOSSY COMPRESSION OF QUALITY SCORES
84
we consider Illumina’s proposed binning6 , which is implemented both by DSRC2 [63] and
CRAM [50]. Hereafter we refer to the performance of DSRC2.
There are some important differences between the lossy compressors selected for our
study. For example, the compression scheme of Illumina’s proposed binning does not
depend on the statistics of the quality scores, whereas QVZ and P/R-Block do. Also, in
both Illumina’s proposed binning and P/R-Block the maximum absolute distance between
a quality score and its reconstructed one (after decompression) can be controlled by the
user, whereas in QVZ this is not the case. The reason is that QVZ designs the quantizers
to minimize a given average distortion based on a rate constraint, and thus even though
on average the distortion is small, some specific quality scores may have a reconstructed
quality score that is far from the true one. Also, note that whereas Illumina’s proposed
binning applies more precision to high quality scores, R-Block does the opposite, and PBlock does it equally among all the quality scores. Finally, in Illumina’s proposed binning
and P/R-Block the user cannot estimate the size of the compressed file in advance, whereas
this is possible in QVZ.
Due to the extensive number of simulations performed, we only show the results for
QVZ with MSE distortion and three clusters, RBlock and the Illumina proposed binning.
We selected these as they are good representatives of the overall results.
5.2.1
SNP calling
Figures 5.2 and 5.3 show the average sensitivity, precision and f-score, together with the
compression ratio (in bits per quality score), over the 4 datasets and for the three pipelines
when the golden standard is that of NIST and Illumina, respectively. For ease of visualization, we only show the results obtained with the lossless compressed data, and the one
lossily compressed with QVZ (applied with MSE distortion and 3 clusters as computed
with k-means, denoted as QVZ-Mc3), RBlock, and Illuminas proposed binning. We chose
to show the results on these algorithms because we found the results to be very representative. The lossless compressed rate is computed using QVZ in lossless mode.
When reading the results, it is important to note the ground truth that was used for
6
http://www.illumina.com/documents/products/whitepapers/whitepaper_
datacompression.pdf
CHAPTER 5. EFFECT OF LOSSY COMPRESSION OF QUALITY SCORES
GATK (RBlock)
F-score
Precision
Sensitivity
0.99
85
0.74
0.85
0.73
0.84
0.72
0.83
0.71
0.82
0.7
0.81
0.69
0.8
0.68
0.79
0.67
0.78
0.66
0.77
GATK (QVZ)
GATK (Illumina)
0.98
Platypus (RBlock)
Platypus (QVZ)
0.97
Platypus (Illumina)
HTSlib (RBlock)
0.96
HTSlib (QVZ)
HTSlib (Illumina)
0.95
Platypus (Lossless)
GATK (Lossless)
0.94
HTSlib (Lossless)
GATK (Q40)
0.93
Platypus (Q40)
HTSlib (Q40)
0.92
0.65
0
1
2
bits per quality score
3
0.76
0
1
2
bits per quality score
3
0
0.5
1
1.5
2
2.5
3
bits per quality score
Figure 5.2: Average sensitivity, precision and f-score of the four considered datasets using
the NIST ground truth. Different colors represent different pipelines, and different points
within an algorithm represent different rates. Q40 denotes the case of setting all the quality
scores to 40.
the evaluation, as the choice of ground truth can directly affect the results. Recall that, as
shown in Fig. 5.1, Illuminas ground truth contains the majority of the SNPs contained in
the NIST-GIAB, plus some more. Thus, assuming both ground truths are largely correct,
a SNP caller is likely to achieve a higher sensitivity with the NIST ground truth, while the
precision will probably be lower. The opposite holds when comparing the output of a SNP
caller against the Illumina ground truth: we will probably obtain a lower sensitivity and a
higher precision.
We further define the variability observed in the output of the different SNP calling
pipelines as the methodological variability, and the variability introduced by the lossy compressor within a pipeline as the lossy variability. We show that the lossy variability is orders
of magnitude smaller than the methodological variability; this indicates that the changes in
calling accuracy introduced by lossy compressing the quality scores are negligible.
As shown in the figures, the variability obtained between different variant callers (methodological variability) is significantly larger than the variability introduced by the lossy compressors (for most rates), i.e., the lossy variability. Specifically, for rates larger than 1 bit
per quality score, we observe that the effect that lossy compressors have on SNP calling
CHAPTER 5. EFFECT OF LOSSY COMPRESSION OF QUALITY SCORES
Sensitivity
0.95
Precision
0.94
86
F-score
0.93
GATK (RBlock)
GATK (QVZ)
GATK (Illumina)
0.92
0.92
0.94
Platypus (RBlock)
Platypus (QVZ)
0.91
0.9
0.93
Platypus (Illumina)
0.9
HTSlib (RBlock)
HTSlib (QVZ)
0.92
0.88
0.91
0.86
0.9
0.84
0.89
HTSlib (Illumina)
0.88
HTSlib (Lossless)
Platypus (Lossless)
0.87
GATK (Lossless)
GATK (Q40)
0.86
Platypus (Q40)
HTSlib (Q40)
0.89
0
1
2
bits per quality score
3
0.82
0
1
2
bits per quality score
3
0.85
0
1
2
3
bits per quality score
Figure 5.3: Average sensitivity, precision and f-score of the four considered datasets using
the Illumina ground truth. Different colors represent different pipelines, and different points
within an algorithm represent different rates. Q40 denotes the case of setting all the quality
scores to 40.
is several orders of magnitude smaller than the variability that already exists within the
different variant calling pipelines. For smaller rates, we observe a degradation in performance when using QVZ, and the lossy variability becomes more noticeable in this case.
Recall that QVZ minimizes the average distortion, and thus at very small rates some of the
quality scores may be highly distorted. If the highly distorted quality scores happen to play
an important role in calling a specific variant, the overall performance may be affected.
On the other hand, RBlock permits the user to specify the maximum allowed individual
distortion, and less degradation is obtained in general for small rates. Note also that for
rates higher than 1 bit per quality score the performance of both QVZ and RBlock is similar. Illumina’s proposed binning achieves around 1 bit per quality score on average, and
achieves a performance comparable to that of QVZ and RBlock. Finally, we found that
swapping the original quality scores with ones generated uniformly at random7 , or with
7
Results not shown.
CHAPTER 5. EFFECT OF LOSSY COMPRESSION OF QUALITY SCORES
87
all set to a fixed value (Q40 in the figure), significantly degraded the performance. These
observations demonstrate that the quality scores are actively used in all the pipelines when
calling variants, and thus discarding them is not a viable option.
Regarding the selection of the ground truth, we observe a higher sensitivity with the
NIST ground truth, and a higher precision with the Illumina ground truth. Note that these
results are in line with the above discussion regarding the choice of ground truth.
The performance of QVZ can be further improved if clustering is performed by means
of Mixture of Markov Models rather than k-means. Fig. 5.4 compares the performance
of both methods when 3 clusters are used. As can be observed, the choice of clustering
method has little effect for rates above 1 bit per quality score. However, for small rates,
using the Mixture of Markov Models for clustering significantly improves the performance,
specially for GATK and Platypus pipelines.
F-score
Precision
Sensitivity
0.92
0.9
0.96
GATK (QVZ-Mc3)
GATK (Lossless)
0.89
0.91
0.95
Platypus (QVZ-Mc3)
0.88
Platypus (Lossless)
0.9
0.94
Samtools (QVZ-Mc3)
0.87
Samtools (Lossless)
GATK
(QVZ-MMM3)
Samtools
(QVZ-MMM3)
0.89
0.93
0.86
0.88
0.92
0.85
Platypus
(QVZ-MMM3)
0.87
0.91
0.84
0.86
0.83
0.9
0
0.5
1
1.5
2
Bits per quality score
2.5
0
0.5
1
1.5
2
Bits per quality score
2.5
0
0.5
1
1.5
2
2.5
Bits per quality score
Figure 5.4: Comparison of the average sensitivity, precision and f-score of the four considered datasets and two ground truths, when QVZ is used with 3 clusters computed with
k-means (QVZ-Mc3) and Mixture of Markov Models (QVZ-MMM3). Different colors represent different pipelines, and different points within an algorithm represent different rates.
To gain insight into the possible benefits of using lossy compression, we show the
distribution of the f-score difference between the lossy and lossless case for different lossy
compressors and rates (thus a positive number indicates an improvement over the lossless
case). The distribution is computed by averaging over all simulations (24 values in total; 4
datasets, 3 pipelines and 2 ground truths). Fig. 5.5 shows the box-plot and the mean value
of the f-score difference for six different compression rates. Since QVZ performs better for
CHAPTER 5. EFFECT OF LOSSY COMPRESSION OF QUALITY SCORES
88
high rates, we show the results for QVZ-Mc3 with parameters 0.9, 0.8 and 0.6 (left-most
side of the figure). Analogously, for high compression ratios we show the results of RBlock
with parameters 30, 20, and 10 (right-most side of the figure).
Lossy F-score { Lossless F-score
#10 -3
1
0.5
0
-0.5
-1
-1.5
10
17
37
43
68
77
Compression (%)
Figure 5.5: Box plot of f-score differences between the lossless case and six lossy compression algorithms for 24 simulations (4 datasets, 3 pipelines and 2 ground truths). The
x-axis shows the compression rate achieved by the algorithm. The three left-most boxes
correspond to QVZ-Mc3 with parameters 0.9, 0.8 and 0.6, while the three right-most boxes
correspond to RBlock with parameters 30, 20 and 10. The blue line indicates the mean
value, and the red one the median.
CHAPTER 5. EFFECT OF LOSSY COMPRESSION OF QUALITY SCORES
89
Remarkably, for all the rates the median is positive, which indicates that in at least 50%
of the cases lossy compression improved upon the uncompressed quality scores. Moreover,
the mean is also positive, except for the point with highest compression. This suggests that
lossy compression may be used to reduce the size of the quality scores without compromising the performance on the SNP calling.
The above reported results show that lossy compression of quality scores (up to a certain
threshold on the rate) does not affect the performance on variant calling. Moreover, the box
plot of Fig. 5.5 indicates that in some cases an improvement with respect to the original
data can be obtained. We now look into these results in more detail, by focusing on the
individual performance of each of the variant calling pipelines.
We choose to show the results using tables as they help visualize which lossy compressors and/or parameters work better for a specific setting. We color in red (will appear
as a shaded cell) the values of the sensitivity, precision and f-score that improve upon the
uncompressed.
Table 5.1 shows the results for algorithms RBlock, QVZ-Mc3 (MSE distortion criteria and 3 clusters) and Illumina binning-DSRC2 for the GATK with hard filtering pipeline
when using the NIST ground truth. The two columns refer to the average results of Chromosomes 11 and 20 of the ERR262996 and ERR174310 datasets, respectively. For ease
of exposition, we omit the results of QVZ using other distortions and rates, and those of
PBlock, as well as the results of individual chromosomes. Similarly, we omit the results
for the htslib.org and Platypus pipelines, but we comment on the results.
It is worth noting that with the GATK pipeline, several compression approaches improve simultaneously the sensitivity, precision, and f-score when compared to the uncompressed (original) quality scores. For example, in the 30-coverage dataset, RBlock improves the performance while reducing the size by more than 76%. In the 15-coverage
dataset QVZ improves upon the uncompressed and reduces its size by 20%. With the
htslib.org pipeline, it is interesting to see that most of the points improve the sensitivity
parameter, meaning that they are able to find more T.P. than with the uncompressed quality scores. Finally, with the Platypus pipeline, the parameters that improve in general are
the precision and the f-score, which indicates that a bigger percentage of the calls are T.P.
rather than F.P. Some points also improve upon the uncompressed. When the ground truth
Lossless
Rblock
3
8
10
20
30
QVZ-Mc3
0.9
0.8
0.6
0.4
0.2
Illumina-DSRC2
GATK
0.7424
0.7421
0.7422
0.7422
0.7422
0.7421
0.7425
0.7424
0.7422
0.7418
0.7424
0.9828
0.983
0.983
0.9831
0.9833
0.9829
0.983
0.9828
0.9833
0.9831
0.9827
0.8456
0.8459
0.8457
0.8458
0.8454
0.8457
0.8457
0.8456
0.8457
0.8457
0.8458
10.2
14.7
37.6
57.5
77.7
54.78
-16.7
31
42
67.4
76.4
ERR262996 (30◊): Chr11,Chr20
Sensitivity Precision F-Score Compression
0.9829
0.7422 0.8456
0
0.9699
0.9699
0.97
0.9691
0.9679
0.9694
0.97
0.9708
0.9708
0.971
0.9709
0.7332
0.7333
0.7326
0.7321
0.7309
0.7345
0.7331
0.733
0.733
0.7328
0.7326
0.8351
0.8351
0.8347
0.8341
0.8328
0.8357
0.835
0.8353
0.8353
0.8352
0.8351
10
19
38.3
58.6
77.9
55.66
-16.9
31.6
43
70.7
78.9
ERR174310 (15◊): Chr11,Chr20
Sensitivity Precision F-Score Compression
0.9699
0.7332 0.8351
0
Table 5.1: Sensitivity, precision, f-score and Compression ratio for the 30◊ and 15◊ Coverage datasets for the GATK
pipeline, using the NIST ground truth.
CHAPTER 5. EFFECT OF LOSSY COMPRESSION OF QUALITY SCORES
90
CHAPTER 5. EFFECT OF LOSSY COMPRESSION OF QUALITY SCORES
91
is provided by Illumina, with the GATK pipeline, R/P-Block improves mainly the sensitivity and f-score, with PBlock improving the precision as well in the 30x coverage dataset.
QVZ seems to perform better in this case, improving upon the uncompressed for several
rates. It also achieves a performance better than that of Illumina’s proposed binning for a
similar compression rate. With the htslib.org pipeline R/P-Block improves mainly the sensitivity, while QVZ improves the precision and the f-score (in the 30x coverage dataset).
The performance on Platypus is similar to the one obtained when the NIST ground truth is
used instead.
In summary, in terms of the distortion metric that QVZ aims to minimize, MSE works
significantly better for small rates (in most of the cases), whereas for higher rates the three
analyzed distortions offer a similar performance. Thus the compression rate seems much
more significant to the variability in the performance than the choice of distortion criterion.
RBlock offers in general better performance than PBlock for similar compression rates.
Finally, in most of the analyzed cases, Illumina’s binning is outperformed by at least one
other lossy compressor, while offering a similar compression rate. Overall, for high compression ratios (30%-70%), RBlock seems to perform the best, whereas QVZ is preferred
for lower compression rates (>70%).
In the previously analyzed cases we have assumed that all the SNPs contained in the
VCF file are positive calls, since the pipelines already follow their “best practice” to generate the corresponding VCF file. As discussed in the Methodology, another possibility is
to select a parameter and consider positive calls only for those whose parameter is above
a certain threshold. Varying the threshold results in the ROC curve. We believe this approach is of interest to analyze the VCF files generated by the GATK pipeline followed by
the VQSR filter, with thresholding parameter given the VQSLOD field, and thus we present
the results for this case.
Fig. 5.6 shows the ROC curve of chromosome 11 of the 30x coverage dataset (ERR262996),
with the NIST ground truth. The results correspond to those obtained when the quality
scores are the original ones (lossless), and the ones generated by QVZ-Mc3 (MSE distortion and 3 clusters), PBlock with parameter 8, RBlock with parameter 25 and the Illumina
binning (as the results of applying the DSRC2 algorithm). As shown in the figure, each
of the algorithms outperform the rest in at least one point of the curve. This is not the
CHAPTER 5. EFFECT OF LOSSY COMPRESSION OF QUALITY SCORES
92
case for the Illumina Binning, as it is outperformed by at least one other algorithm in all
points. Moreover, the AUC of all the lossy compressors except that of the Illumina Binning
outperform that of the lossless case.
1
0.9
True Positive Rate
0.8
0.7
0.6
0.5
0.4
QVZ-Mc3 [θ = 0.4] (AUC =0.6662)
PBlock [p = 8] (AUC =0.66667)
RBlock [r = 25] (AUC =0.66548)
DSRC2-Illumina Bin. (AUC =0.66428)
Lossless (AUC =0.6649)
0.3
0.2
0.1
0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
False Positive Rate
Figure 5.6: ROC curve of chromosome 11 (ERR262996) with the NIST ground truth and
the GATK pipeline with the VQSR filter. The ROC curve was generated with respect to the
VQSLOD field. The results are for the original quality scores (uncompressed), and those
generated by QVZ-Mc3 (MSE distortion and 3 clusters), PBlock (p = 8) and RBlock (r =
25).
CHAPTER 5. EFFECT OF LOSSY COMPRESSION OF QUALITY SCORES
5.2.2
93
INDEL detection
We show that lossy compression of quality scores leads to smaller files while enabling
INDEL detection algorithms to achieve accuracies similar to the accuracies obtained with
data that has been compressed losslessly.
We simulated four datasets that each consisted of the CEU major alleles for chromosome 22 [81] with approximately 3000 homozygous INDELs that were biologically realistic in length, location, and insertion-to-deletion ratio.
Fig. 5.7 shows the sensitivity, precision, and f-score achieved by each INDEL detection pipeline using input data from the aforementioned compression approaches, together
with the compression ratio in bits per quality score. Note that the figure displays the
means across the four simulated datasets. In terms of sensitivity, all four INDEL detection pipelines (HaplotypeCaller, UnifiedGenotyper, Dindel, and Freebayes) resulted in a
lossy variability, as described above, that does not exceed the methodological variability.
All compression algorithm and INDEL detection pipeline combinations had high precision
(all but 1 obtained precision > 0.995). Besides the DSRC2 compression approach applied
to HaplotypeCaller, lossy compression did not result in variability in precision.
Precision
Sensitivity
DINDEL RBlock
1
F-score
1
1
DINDEL QVZ
0.9
0.9
DINDEL Illumina
0.995
DINDEL Lossless
0.8
0.8
UG RBlock
0.7
0.7
0.99
UG QVZ
UG Illumina
0.6
0.6
UG Lossless
0.985
0.5
0.5
HC RBlock
HC QVZ
0.4
0.4
0.98
HC Illumina
0.3
0.3
HC Lossless
Freebayes RBlock
0.2
0.2
0.975
Freebayes QVZ
0.1
0.1
Freebayes Illumina
Freebayes Lossless
0
1
2
3
bits per quality score
4
0.97
1
2
3
bits per quality score
4
0
1
2
3
4
bits per quality score
Figure 5.7: Average (of four simulated datasets) sensitivity, precision and f-score for INDEL detection pipelines. Different colors represent different pipelines, and different points
within an algorithm represent different rates.
Table 5.2 displays the sensitivity for an example INDEL detection pipeline, Dindel;
results are shown for each compression approach for each simulated dataset individually,
CHAPTER 5. EFFECT OF LOSSY COMPRESSION OF QUALITY SCORES
94
along with the mean and standard deviation across datasets. The mean sensitivity for the
lossless compression was 0.9796. Interestingly, RBlock (with r parameter set to 8 or 10)
achieves a slightly higher average sensitivity of 0.9798. The remaining pipelines have mean
sensitivities ranging from 0.9644 to 0.9796. The standard deviation across pipelines was
low, ranging from 0.0016 to 0.0033.
5.3
Discussion
We have shown that lossy compressors can reduce file size at a minimal cost – or even
benefit – to sensitivity and precision in SNP and INDEL detection.
We have analyzed several lossy compressors introduced recently in the literature that do
not use any biological information (such as the reads) for compression. The main difference
among them relates to the way they use the statistics of the quality scores for compression.
For example, Illumina’s proposed binning is a fixed mapping that does not use the underlying properties of the quality scores. In contrast, algorithms like QVZ are fully based on
the statistics of the quality scores to design the corresponding quantizers for each case.
Based on the results shown in the previous section, we conclude that in many cases lossy
compression can significantly reduce the genomic file sizes (with respect to the losslessly
compressed) without compromising the performance on the variant calling. Specifically,
we observe that the variability in the calls output by different existing SNP and INDEL
callers is generally orders of magnitude larger than the variability introduced by lossy compressing the quality scores, specially for moderate to high rates. For small rates (around
less than 0.5 bits per quality score), lossy compressors that minimize the average distortion,
such as QVZ, get a degradation in performance. This is due to some of the quality scores
getting highly distorted. At high rates, the analyzed lossy compressors perform similarly,
except for Illumina’s proposed binning, which is generally outperformed by the other lossy
compressors. This suggests that using the statistics of the quality scores for compression is
beneficial, and that not all datasets should be treated in the same way.
The degradation in performance observed when setting the quality scores to a random
value or all to maximum, demonstrates that the quality scores do matter, and thus discarding
them is not a viable option in our opinion. We recommend applying lossy compression with
CHAPTER 5. EFFECT OF LOSSY COMPRESSION OF QUALITY SCORES
95
moderate to high rates to ensure the quality scores are not highly distorted. In algorithms
such as PBlock and RBlock, the user can directly specify the maximum allowed distortion.
In algorithms that minimize an average distortion, such as QVZ, we recommend to employ
at least around one bit per quality score.
Finally, in several cases we have observed that lossy compression actually leads to superior results compared to lossless compression, i.e., they generate more true positives and
fewer false positives than with the original quality scores, when compared to the corresponding ground truth. One important remark is that none of the analyzed lossy compressors make use of biological information for compression, in contrast to other algorithms
such as the one introduced in [62]. We believe this is of importance, as one could argue
that the latter algorithms are tailored towards variant calling, and thus a careful read of the
results should be made. The fact that we are able to show improved variant calling performance in some cases with algorithms that do not use any biological information further
shows the potential of lossy compression of quality scores to improve on any downstream
application.
Our findings put together with the fact that, when losslessly compressed, quality scores
comprise more than 50% of the compressed file [17], seem to indicate that lossy compression of quality scores could become an acceptable practice in the future for boosting
compression performance or when operating in bandwidth constrained environments. The
main challenge in such a mode may be to decide which lossy compressor and/or rate to use
in each case. Part of this is due to the fact that the results presented so far are experimental,
and we have yet to develop theory that will guide the construction or choice of compressors
geared toward improved inference.
5.4
Conclusion
Recently there has been a growing interest in lossy compression of quality scores as a way
to reduce raw genomic data storage costs. However, the genomic data under consideration
is used for biological inference, and thus it is important to first understand the effect that
lossy compression has on the subsequent analysis performed on it. To date, there is no
clear methodology to do so, as can be inferred from the variety of analyses performed in
CHAPTER 5. EFFECT OF LOSSY COMPRESSION OF QUALITY SCORES
96
the literature when new lossy compressors are introduced. To alleviate this issue, in this
chapter we have described a methodology to analyze the effect that lossy compression of
quality scores has on variant calling, one of the most widely used downstream applications
in practice. We hope the described methodology will be of use in the future when analyzing
new lossy compressors and/or new datasets.
Specifically, the proposed methodology considers the use of different pipelines for SNP
calling and INDEL calling, and datasets for which true variants exist (“ground truth”). We
have used this methodology to analyze the behavior of the state-of-the-art lossy compressors, which to our knowledge constitutes the most complete analysis to date. The results
demonstrate the potential of lossy compression as a means to reduce the storage requirements while obtaining performance close to that based on the original data. Moreover, in
many cases we have shown that it is possible to improve upon the original data.
Our findings and the growing need for reducing the storage requirements suggest that
lossy compression may be a viable mode for storing quality scores. However, further research should be performed to better understand the statistical properties of the quality
scores, to enable the principled design of lossy compressors tailored to them. Moreover,
methodologies for the analysis on other important downstream applications should be developed.
Dataset 1
0.9817
0.9661
0.9776
0.9817
0.9817
0.9817
0.9810
0.9654
0.9817
0.9817
0.9817
Compression approach
Lossless
Illumina - DSRC2
Mc3 - 0.3
Mc3 - 0.7
Mc3 - 0.9
Pblock - 2
Pblock - 8
Pblock - 16
Rblock - 3
Rblock - 8
Rblock - 10
Sensitivity
0.9788
0.9662
0.9737
0.9775
0.9788
0.9788
0.9778
0.9662
0.9788
0.9788
0.9788
Dataset 2
0.9805
0.9666
0.9747
0.9799
0.9805
0.9805
0.9802
0.9652
0.9805
0.9805
0.9805
Dataset 3
0.9775
0.9621
0.9696
0.9758
0.9764
0.9775
0.9775
0.9607
0.9775
0.9781
0.9781
Dataset 4
0.9796
0.9652
0.9739
0.9787
0.9794
0.9796
0.9791
0.9644
0.9796
0.9798
0.9798
Mean
0.0019
0.0021
0.0033
0.0026
0.0023
0.0019
0.0017
0.0025
0.0019
0.0016
0.0016
Standard deviation
Table 5.2: Sensitivity for INDEL detection by Dindel pipeline with various compression approaches for 4 simulated
datasets.
CHAPTER 5. EFFECT OF LOSSY COMPRESSION OF QUALITY SCORES
97
Chapter 6
Denoising of quality scores
The results presented in the previous chapter, which show that lossy compression of the
quality scores can lead to variant calling performance that improves upon the uncompressed, suggest that denoising of the quality scores is possible and of potential benefit.
However, reducing the noise in the quality scores has remained largely unexplored.
With that in mind, in this chapter we propose a denoising scheme to reduce the noise
presented in the quality scores, and demonstrate improved inference with the denoised
data. Specifically, we show that replacing the quality scores with those generated by the
proposed denoiser results in more accurate variant calling in general. Moreover, we show
that reducing the noise leads to a smaller entropy than that of the original quality scores,
and thus a significant boost in compression is also achieved. Thus the angle of the present
work is denoising for improved inference, with boosted compression performance as an
important benefit stemming from data processing principles. Such schemes to reduce the
noise of genomic data while easing its storage and dissemination can significantly benefit
the field of genomics.
6.1
Proposed method
We first formalize the problem of denoising quality scores, and then describe the proposed
denoising scheme in detail. We conclude this section with a description of the evaluation
criteria.
98
CHAPTER 6. DENOISING OF QUALITY SCORES
6.1.1
99
Problem Setting
Let Xi = [Xi,1 , Xi,2 , . . . , Xi,n ] be a sequence of true quality scores of length n, and X =
N
{Xi }N
i=1 a set of quality score sequences. We further let Q = {Qi }i=1 be the set of noisy
quality score sequences that we observe and want to denoise1 , where Qi,j = Xi,j + Zi,j and
Qi = [Qi,1 , Qi,2 , . . . , Qi,n ]. Note that {Zi,j : 1 Æ i Æ N, 1 Æ j Æ n} represents the noise
added during the sequencing process. This noise comes from different sources of generally
unknown statistics, some of which are not reflected in the mathematical models used to
estimate the quality scores [82, 55].
Our goal is to denoise the noisy quality scores Q to obtain a version closer to the true
underlying quality score sequences X. We further denote the output of the denoiser by
„ = {X
„ }N , with X
„ = [X
„ ,X
„ ,...X
„ ].
X
i i=1
i
i,1
i,2
i,n
6.1.2
Denoising Scheme
The suggested denoising scheme is depicted in Fig. 6.1. It consists of a lossy compressor applied to the noisy quality scores Q, the corresponding decompressor, and a post„ and the original
processing operation that uses both the reconstructed quality scores Q
„ In order to
ones. The output of the denoiser is the sequence of noiseless quality scores X.
compute the final storage size, a lossless compressor for quality scores is applied to the de-
„ Note that we cannot simply store the output of the lossy compressor and
noised signal X.
use that as the final size, since the post-processing operation also needs access to the original quality scores. That is, the denoiser needs to perform both the lossy compression and
the decompression, and incorporate the original (uncompressed) noisy data for computing
its final output.
The proposed denoiser is based on the one outlined in [83], which is universally optimal
in the limit of large amounts of data when applied to a stationary ergodic source corrupted
by additive white noise. Specifically, consider a stationary ergodic source X n and its noisy
corrupted version Y n given by
Yi = Xi + Zi ,
1
For example, the quality score sequences found in a FASTQ or SAM file.
CHAPTER 6. DENOISING OF QUALITY SCORES
100
Denoiser
Q
Lossy Compressor
Lossy Decompressor
b
Q
Post-Processing
Operation
Q
b
X
Final Size
Lossless Compressor
Figure 6.1: Outline of the proposed denoising scheme.
for 1 Æ i Æ n, where Z n is an additive white noise process. Then, the first step towards
recovering the noiseless signal X n consists of applying a lossy compressor under the distortion measure fl : Y ◊ Y‚ æ R+ given by
fl(y, ŷ) , log
1
,
PZ (y ≠ ŷ)
(6.1)
where PZ (·) is the probability mass function of the random variable Z. Moreover, the lossy
compressor should be tuned to distortion level H(Z), that is, the entropy of the noise.
In the case of the quality scores, the statistics of the noise are unknown, and thus we
cannot set the right distortion measure and level at the lossy compressor. In the presence
of such uncertainty, one could make the worst case assumption that the noise is Gaussian
[84] of unknown variance, which would translate into a distortion measure given by the
square of the error (based on Eq. (6.1)). However, even with this assumption, the correct
distortion level depends on the unknown variance. Thus, instead, we take advantage of
the extensive work performed on lossy compressors for quality scores in the past, and use
them for the lossy compression step. Since we do not know the right distortion level to
set at the lossy compressor, we apply each of them with different distortion levels. This
decision, although lacking theoretical guarantees, works in practice, as demonstrated in the
following section. Moreover, it makes use of the lossy compressors that have already been
proposed and tested.
The second step consists of performing a post-processing operation based on the noisy
signal Y n and the reconstructed sequence Y‚ n . For a given integer m = 2m0 + 1 > 0,
CHAPTER 6. DENOISING OF QUALITY SCORES
101
‚ define the joint empirical distribution as
y m œ Y m and ŷ œ Y,
(m)
p̂Y n Y‚ n (y m , ŷ) ,
i+m0 ‚
|{m0 + 1 Æ i Æ n ≠ m0 : (Yi≠m
, Yi ) = (y m , ŷ)}|
0
.
n≠m+1
(6.2)
i+m0
Thus, Eq. (6.2) represents the fraction of times Yi≠m
= y m while Yi = ŷ, for all i. Once
0
the joint empirical distribution is computed, the denoiser generates its output as
„ = argmin
X
i
‚
x̂œX
ÿ (m)
i+m0
p̂Y n Y‚ n (Yi≠m
, x)d(x̂, x),
0
‚
xœY
(6.3)
„ ◊ X æ R+ is the original loss function under which
for 1 Æ i Æ n. Note that d : X
„ is the alphabet of the denoised
the performance of the denoiser is to be measured, and X
„n .
sequence X
For the case of the quality scores, the joint empirical distribution can be computed
mostly as described in Eq. (6.2). However, since now we have a set of quality score
sequences, we redefine it as
(m)
p̂Q,Q‚ (q m , q̂) ,
‚ ) = (q m , q̂)}|
|{(i, j) : (Qi,j≠m0 , . . . , Qi,j+m0 , Q
i,j
,
nN
(6.4)
where Qi,j = 0 for j < 1 and j > n. Finally, the output of the denoiser is given by
„ = argmin
X
i,j
‚
x̂œX
ÿ (m)
p̂Q,Q‚ (Qi,j≠m0 , . . . , Qi,j+m0 , q̂)d(x̂, q̂),
‚
q̂œQ
(6.5)
for 1 Æ i Æ N and 1 Æ j Æ n, with d being squared distortion. Note also that the alphabets
‚ =X
„.
of the original, reconstructed and denoised quality scores are the same, i.e., Q = Q
Finally, as mentioned above, we apply a lossless compressor to the output of the decoder
to compute the final size.
As outlined in [85], the intuition behind the proposed scheme is as follows. First, note
that adding noise to a signal always increases its entropy, since
I(X n + Z n ; Z n ) = H(X n +Z n ) ≠ H(X n +Z n |Z n ) = H(X n +Z n ) ≠ H(X n ) Ø 0, (6.6)
CHAPTER 6. DENOISING OF QUALITY SCORES
102
which implies H(Y n ) Ø H(X n ), with Y n = X n + Z n . Also, lossy compression of Y n
at distortion level D can be done by searching among all reconstruction sequences within
radius D of Y n , and choosing the most compressible one. Thus, if the distortion level is set
appropriately, a reasonable candidate for the reconstruction sequence can be the noiseless
sequence X n . The role of the lossy compressor is to partially remove the noise and to learn
the source statistics in the process, such that the post-processing operation can be though of
as performing Bayesian denoising. Therefore, we also expect the denoised quality scores
to be more compressible than the original ones, due to the reduced entropy.
6.1.3
Evaluation Criteria
To measure the quality of the denoiser we cannot compare the set of denoised sequences
„ to the true sequences X, as the latter are unavailable. Instead, we analyze the effect
X
on variant calling when the original quality scores are replaced by the denoised ones. For
the analysis, we follow the methodology described in Chapter 5. Recall that it consists of
several pipelines and datasets specific for SNP calling and INDEL detection.
In brief, the considered pipelines for SNP calling are GATK [14, 57, 74], htslib.org
[1] and Platypus [75], and for INDEL detection we used Dindel [77], Unified Genotyper,
Haplotype Caller [14, 57, 74] and Freebayes [78]. The datasets used for SNP calling corre-
spond to the H. Sapiens individual NA12878. In particular, chromosomes 11 and 20 of the
pair-end whole genome sequencing datasets ERR174324 (15x) and ERR262997 (30x). For
these data two consensus of SNPs are available, the one released by the GIAB (Genome In
A Bottle) consortium [79], and the one released by Illumina. The dataset used for INDEL
detection correspond to a chromosome containing 3000 heterozygous INDELs from which
100bp paired-end sequencing reads were generated with ART [80]. All datasets in this
study have a consensus sequence, making it possible to analyze the accuracy of the variant
calls. We expect that using the denoised in lieu of the original data would yield higher
sensitivity, precision and f-score.
CHAPTER 6. DENOISING OF QUALITY SCORES
6.2
103
Results and Discussion
We analyze the performance of the proposed denoiser for both SNP calling and INDEL
detection. For the lossy compressor block we used the algorithms RBlock [21], PBlock
[21], QVZ and Illumina’s proposed binning. Since the right distortion level at which they
should operate is unknown, we run each of them with different parameters (i.e., different
distortion levels)2 . Regarding the post-processing operation, we set m in Eq. (6.4) to be
equal to three in all the simulations. This choice was made to reduce the running time and
complexity, because of the large alphabet of the quality scores. As the entropy encoder we
applied QVZ in lossless mode, which offers competitive performance (see Chapter 4).
600
550
Original data
Denoised data
Size [MB]
500
450
400
350
300
250
0
5
10 15 20 25
MSE distortion at the lossy compressor
Figure 6.2: Reduction in size achieved by the denoiser when compared to the original data
(when losslessly compressed).
Due to the extensive amount of simulations, here we focus on the most representative
results. For completeness, in addition to analyzing the performance of the proposed denoiser with that of the original data, we also compared it with the performance obtained
2
Except for Illumina’s proposed binning which generates only one point in the rate-distortion plane.
CHAPTER 6. DENOISING OF QUALITY SCORES
104
when only lossy compression is applied to the data (i.e., without the post-processing operation). We observed that the post-processing operation improves the performance beyond
that achieved by applying only lossy compression in most cases. Moreover, the denoised
data occupies less than the original one, corroborating our expectation that the denoiser
reduces the noise of the quality scores and thus the entropy, and consistent with the data
processing principle. As an example, Fig. 6.2 compares the size of the chr. 20 of the data
ERR262997 (used for the analysis on SNP calling), with that generated by the denoiser
with different lossy compressors targeting different distortion levels (x-axis). As can be
observed, for all distortion levels above 4, the reduction in size is between 30% and 44%.
Interestingly, similar results were obtained with all the tested datasets, which suggests that
more than 30% of the entropy (of the original data) is due to noise.
In the following we focus on the performance of the denoiser in terms of its effect on
SNP calling and INDEL detection.
6.2.1
SNP calling
We observe that the results for the chromosomes 11 and 20 of the 30x coverage dataset are
very similar for all the considered pipelines, and thus for ease of exposition, we restrict our
attention to chromosome 20. Regarding the 15x coverage dataset, we focus on chromosome
11 and the SNP consensus produced by Illumina (similar results where obtained with the
GIAB consensus).
Fig. 6.3 shows the results for the 30x coverage dataset on the GATK pipeline. As can be
observed, for MSE distortion levels between 0 and 20 approximately, and any lossy compressor, the denoiser improves all three metrics; f-score, sensitivity and precision. Among
the analyzed pipelines, GATK is the most consistent and the one offering the best results.
This suggests that the GATK pipeline uses the quality scores in the most informative way.
For htslib.org and Platypus, we also observe that the points that improve upon the original one exhibit an MSE distortion less than 20 in general. However, in this case the lossy
compressors perform differently. For example, QVZ improves the precision and f-score
with the htslib.org pipeline and the sensitivity with the platypus one. On the other hand,
Pblock and Rblock achieve best results in terms of precision and fscore with the platypus
CHAPTER 6. DENOISING OF QUALITY SCORES
0.895
0.939
Precision
Sensitivity
0.94
105
0.938
0.937
0.936
0.89
Original
Rblock
Pblock
Illumina
QVZ 1 cluster
0.885
0.88
0
20
40
0
60
20
40
60
Distortion level at the lossy compressor (MSE) Distortion level at the lossy compressor (MSE)
0.916
F-Score
0.914
0.912
0.91
0.908
0
10
20
30
40
50
60
Distortion level at the lossy compressor (MSE)
Figure 6.3: Denoiser performance on the GATK pipeline (30x dataset, chr. 20). Different
points of the same color correspond to running the lossy compressor with different parameters.
pipeline and sensitivity with htslib.org.
With the 15x dataset, the denoiser achieves in general better performance when using
the lossy compressors Rblock and Pblock. For example, Fig. 6.4 shows the f-score for the
GATK and htslib.org pipelines. As can be observed, the denoised data improves upon the
uncompressed in both cases with Rblock and Illumina’s proposed binning, and with Pblock
when the distortion level is below 20. With QVZ the denoiser achieves better precision with
the GATK and htslib.org pipelines, and better sensitivity with Platypus.
Finally, it is worth noticing the potential of the post-processing operation to improve
upon the performance when applying only lossy compression. We observed that this is true
for all the four considered datasets (chromosomes 11 and 20 and coverages 15x and 30x),
and the three pipelines (GATK, htslib.org and platypus). To give some concrete examples,
with the platypus pipeline the post-processing operation boosts the performance of the
sensitivity when applying any lossy compressor, for all datasets. The general improvement
is more noticeable for the 15x coverage datasets, where all metrics improve in most of the
CHAPTER 6. DENOISING OF QUALITY SCORES
GATK
htslib.org
0.942
0.941
0.941
0.94
0.94
0.939
0.939
0.938
0.938
F-Score
F-Score
0.942
106
0.937
0.937
0.936
0.936
0.935
0.935
0.934
0.934
0.933
0.933
0.932
Original
Rblock
Pblock
Illumina
QVZ 1 cluster
0.932
0
20
40
60
Distortion level at the lossy compressor (MSE)
0
20
40
60
Distortion level at the lossy compressor (MSE)
Figure 6.4: Denoiser performance on the GATK and hstlib.org pipelines (15x dataset, chr.
11).
cases.
6.2.2
INDEL detection
Among the analyzed pipelines, the denoiser exhibits the best performance on the Haplotype
Caller pipeline. For example, in terms of f-score, we observe that the proposed scheme with
Illumina’s binning and Rblock as the lossy compressor achieves better performance than the
original data. QVZ and Pblock also improve for the points with smaller distortions. Similar
results are obtained for the sensitivity and precision. Moreover, in this case the potential
of applying the post-processing operation after any of the considered lossy compressors
becomes particularly apparent, as the performance always improves (see Fig. 6.5).
We also observe improved performance using Freebayes with QVZ in terms of sensitivity, precision and f-score, and an improved precision with the remaining lossy compressors.
With the GATK-UG and Dindel pipelines, Rblock achieves the best performance, improving upon the original data under all three performance metrics.
CHAPTER 6. DENOISING OF QUALITY SCORES
Performance of the lossy compressed data vs the denoised data
1
Denoised data
107
Sensitivity
Precision
F-score
0.8
0.6
0.4
0.2
0
0
0.05
0.1
0.15
0.2
Lossy compressed data
0.25
0.3
Figure 6.5: Improvement achieved by applying the post-processing operation. x-axis represents the performance in sensitivity, precision and f-score achieved by solely applying
lossy compression, and the y-axis represents the same but when the post-processing operation is applied after the lossy compressor. Grey line corresponds to x = y, and thus all the
points above it correspond to an improved performance.
6.3
Conclusion
In this chapter we have proposed a denoising scheme for quality scores. The proposed
scheme is composed of a lossy compressor followed by the corresponding decompressor
and a post-processing operation. Experimentation on real data suggests that the proposed
scheme has the potential to improve the quality of the data insofar as its effect on the downstream inferential applications, while at the same time significantly reducing the storage
requirements.
Further study of denoising of quality scores is merited as it seems to hold the potential
to enhance the quality of the data while at the same time easing its storage requirements.
We hope the promising results presented in this chapter serve as a baseline for future research in this direction. Further research should include improved modeling of the statistics
of the noise, construction of denoisers tuned to such models, and performing more experimentation on real data and with additional downstream applications.
Chapter 7
Compression schemes for similarity
queries
In this chapter we study the problem of compressing sequences in a database so that similarity queries can still be performed efficiently on the compressed domain. Specifically,
we focus on queries of the form: “which sequences in the database are similar to a given
sequence y?”, which are of practical interest in genomics.
More formally, we consider schemes that generate, for each sequence x in the database,
a short signature of fixed-length, denoted by T (x), that is stored in the compressed database.
Then, given a query sequence y, we answer the question of whether x and y are similar,
based only on the signature T (x), rather than the original sequence x.
When answering a query, there are two types of errors that can be made: a false positive, when a sequence is misidentified as similar to the query sequence; and a false negative,
when a similar sequence stays undetected. We impose the restriction that false negatives
are not permitted, as even a small probability of a false negative translates to a substantial
probability of misdetection of some sequences in the large database, which is unacceptable
in many applications. On the other hand, false positives do not cause an error per se as
the precise level of similarity is assessed upon retrieval of the full sequence from the large
database. However, they introduce a computational burden due to the need of further verification (retrieval), so we would like to reduce their probability as much as possible. Fig.
7.1 shows a typical usage case.
108
CHAPTER 7. COMPRESSION SCHEMES FOR SIMILARITY QUERIES
109
Figure 7.1: Answering queries from signatures: a user first makes a query to the compressed database, and upon receiving the indexes of the sequences that may possibly be
similar, discards the false positives by retrieving the sequences from the original database.
This problem has been studied from an information-theoretic perspective in [25], [86]
for discrete sources, and in [27] for Gaussian sources. These papers analyze the fundamental tradeoff between compression rate, sequence length and reliability of queries performed
on the compressed data. Although these limits provide a bound on the optimal performance of any given scheme, the achievability proofs are non-constructive, which raises the
question of how to design such schemes in practice.
With that in mind, in this chapter we propose two schemes for this task, based in part
on existing lossy compression algorithms. We show that these schemes achieve the fundamental limits in some statistical models of relevance. In addition, the proposed schemes
are easy to analyze and implement, and they can work with any discrete database and any
similarity measure satisfying the triangle inequality.
Variants of this problem have been previously considered in the literature. For example, the Bloom filter [87], which is restricted to exact matches, enables membership queries
from compressed data. Another related notion is that of Locality-Sensitive Hashing (LSH)
[88, Chapter 3], which is a framework for the nearest neighbor search (NNS) problem. The
key idea of LSH is to hash points in such a way that the probability of collision is higher
for points that are similar than for those that are far apart. Other methods for NNS include
CHAPTER 7. COMPRESSION SCHEMES FOR SIMILARITY QUERIES
110
vector approximation files (VA-File) [89], that employs scalar quantization. An extension
of this method is the so called compression/clustering based search [90], which performs
vector quantization implemented through clustering. While these techniques trade off accuracy with computational complexity and space, and false negatives are allowed, in our
setting false negatives are not allowed, but significant compression can still be achieved.
7.1
7.1.1
Problem Formulation and Fundamental Limits
Problem Description
Given two sequences x and y of length n, we measure their similarity by computing the
distortion d(x, y) given by
1
n
qn
i=1
fl(xi , yi ), where fl : X ◊ Y æ R+ is an arbitrary
distortion measure. We say that two sequences x and y are D-similar (or simply similar
when clear from the context) when d(x, y) Æ D.
We consider databases consisting of M discrete sequences of length n, i.e., {xi }M
i=1 ,
with xi = [xi,1 , . . . , xi,n ]. The proposed architecture generates, for each sequence x, a
signature T (x), so that the compressed database is {T (xi )}M
i=1 . Then, given a query se-
quence y, the scheme makes the decision of whether x is D-similar to y, based only on its
compressed version T (x), rather than on the original sequence x. Note that a scheme is
completely defined given its signature assignment and the corresponding decision rule.
More formally, a rate-R identification system (T, g) consists of a signature assignment
T : X n æ [1 : 2nR ] and a decision function g : [1 : 2nR ] ◊ Y n æ {no, maybe}. We
use the notation {no, maybe} instead of {no, yes} to reflect the fact that false positives are
permitted, while false negatives are not. This is formalized next. A system is said to be
D-admissible if
g(T (x), y) = maybe ’ x, y s.t. d(x, y) Æ D.
(7.1)
Since a D-admissible scheme does not produce false negatives, a natural figure of merit is
the frequency at which false positives occur, that we wish to minimize.
We recall next the fundamental limits on performance in this problem, as we will refer
to them in the following sections when assessing the performance of the proposed schemes.
CHAPTER 7. COMPRESSION SCHEMES FOR SIMILARITY QUERIES
7.1.2
111
Fundamental limits
Let X and Y be random vectors of length n, representing the sequence from the database
and the query sequence, respectively. We assume X and Y are independent, with entries
drawn independently from PX and PY , respectively. Define the false positive event as
fp = {g(T (X), Y) = maybe|d(X, Y) > D}. For a D-admissible scheme,
P (g(T (X), Y) = maybe) = P (d(X, Y) Æ D) + P (fp)P (d(X, Y) > D).
(7.2)
Note that P (fp) is the only term that depends on the scheme used, as the other terms
depend strictly on the probability distribution of X and Y. Hence minimizing P (fp) over
all D-admissible schemes is equivalent to minimizing P (g(T (X), Y) = maybe). Thus, for
a given D, the fundamental limits characterize the trade-off between the compression rate
R and the P (g(T (X), Y) = maybe).
Note that as n æ Œ, P (d(X, Y) Æ D) goes to one or to zero (according to whether
D is above or below the expected level of similarity between X and Y ). The problem is
non-trivial only when the event of similarity is atypical, the case on which we focus. In this
case, as is evident from Eq. (7.2), P (maybe) æ 0 iff P (fp) æ 0.
Definition 1. For given distribution PX , PY and similarity threshold D, a rate R is said
to be D-achievable if there exists a sequence of rate-R admissible schemes (T (n) , g (n) ), s.t.
1
1
2
2
limnæŒ P g (n) T (n) (X), Y = maybe = 0.
Definition 2. For a similarity threshold D, the identification rate RID (D) is the infimum of
D-achievable rates. That is, RID (D) , inf{R : R is D-achievable}.
For the case considered in this paper: discrete sources, fixed-length signature assignment and zero false negatives, the identification rate is characterized in [86, Theorem 1]
as
RID (D) =
PU |X :
q
uœU
min
PU (u)fl̄(PX|U (·|u),PY )ØD
I(X; U ),
(7.3)
where U is any random variable with finite alphabet U (|U| = |X | + 2 suffices to obtain the
true value of RID (D)), that is independent of Y . fl̄(PX , PY ) = min E[fl(X, Y )] is a distance
between distributions, with fl being the distortion under which similarity is measured, and
CHAPTER 7. COMPRESSION SCHEMES FOR SIMILARITY QUERIES
112
where the minimization is w.r.t. all jointly distributed random variables X, Y with marginal
distributions PX and PY , respectively.
Finally, we define DID (R) as the inverse function of RID (D), i.e., the similarity threshold below which any similarity level can be achieved at given rate R.
Characterizing the identification rate and exponent is a hard problem in general. In
[25], where the variable-length coding equivalent of our setting was considered, the authors present an achievable rate (they do not consider the converse to the identification rate
problem). The results of [25] for the identification exponent rely on an auxiliary random
variable of unbounded cardinality, thus making the quantities uncomputable in general. For
the quadratic-Gaussian case the identification rate and exponent were found in [26] [27],
and for discrete binary sources and Hamming distortion they were found in [91].
7.2
Proposed schemes
We propose two practical schemes that achieve the limits introduced above in some cases.
The first scheme is based on Lossy Compressors (LC) and the second one on a Type Covering lemma (TC), and they both use a decision rule based on the triangle inequality (—).
Based on this, hereafter we refer to them as the LC≠— and TC≠— schemes, respectively.
Note that whereas a scheme based on lossy compressors (LC ≠ — scheme) is straightfor-
ward to implement, implementation of the type covering lemma based scheme (TC ≠ —
scheme) in practice is more challenging.
Next we introduce both schemes and analyze their optimality.
7.2.1
The LC ≠ — scheme
Description
The signature of the LC≠— scheme is based on fixed-length lossy compression algorithms.
They are characterized by an encoding function fn : x æ [1 : 2nR ] and a decoding
Õ
function gn : [1 : 2nR ] æ x̂, where x̂ = gn (fn (x)) denotes the reconstructed sequence.
Õ
Specifically, the signature of a sequence x is composed of the output i œ [1 : 2nR ] of the
Õ
lossy compressor, and the distortion between x and x̂, i.e., T (x) = {i, d(x, x̂)} (see Fig.
CHAPTER 7. COMPRESSION SCHEMES FOR SIMILARITY QUERIES
7.2). The total rate of the system is R = RÕ +
compressor and
113
R, where RÕ is the rate of the lossy-
R represents the extra rate to represent and store the distortion value
d(x, x̂).
T (·)
x
x
fn (x)
i 2 [1 : 2
nR0
]
gn (i)
x̂
d(·, ·)
d(x, x̂)
{i, d(x, x̂)}
i
Figure 7.2: Signature assignment of the LC ≠ — scheme for each sequence x in the
database.
Regarding the decision function g : [1 : 2nR ] ◊ Y n æ {no, maybe}, recall that it
must satisfy Eq. (7.1). Given the signature assignment described above, the decision rule
for sequence x and query sequence y is based on the tuple (T (x), y) = ({i, d(x, x̂)}, y).
Notice that x̂ can be recovered from the signature as gn (i). The decision rule is given by
g(T (x), y) =
Y
]
[
maybe, d(x, x̂) ≠ D Æ d(x̂, y) Æ d(x, x̂) + D;
no,
otherwise,
(7.4)
which satisfies Eq. (7.1) for any given distortion measure satisfying the triangle inequality.
In an attempt to reduce the rate of the system (e.g., decrease the size of the compressed
database) without affecting the performance, one can decrease the value of
R by quan-
tizing the distortion d(x, x̂). In that case, assuming d0 Æ d(x, x̂) Æ d1 ,
g(T (x), y) =
Y
]
[
maybe, d0 ≠ D Æ d(x̂, y) Æ d1 + D;
no,
otherwise,
which preserves the admissibility of the scheme. While
(7.5)
R can be arbitrary small (for
n æ Œ), there is a trade-off for finite n between its value and the P (maybe). This will
become relevant for the simulations.
CHAPTER 7. COMPRESSION SCHEMES FOR SIMILARITY QUERIES
114
Asymptotic analysis
Recall from rate distortion theory [64] that an optimal lossy compressor with rate R attains
for long enough sequences and with high probability, a distortion between x and x̂ arbitrarily close to the distortion-rate function D(R). Finally, consider the looser decision rule
g(T (x), y) = no if d(x̂, y) > d(x, x̂) + D. Note that the scheme is still admissible (zero
false negatives) with this decision rule. Under these premises, as shown in [86], an LC ≠ —
LC≠—
scheme of rate R can attain any similarity threshold below DID
(R), with
LC≠—
DID
(R) , E[fl(X̂, Y )] ≠ E[fl(X, X̂)] = E[fl(X̂, Y )] ≠ D(R),
(7.6)
where E[fl(X̂, Y )] is completely determined by PX̂ (induced by the lossy compressor) and
LC≠—
LC≠—
PY . Finally, let RID
(D) be the inverse function of DID
(R), i.e., the compression rate
achieved for a similarity threshold D.
As shown in [86], for binary symmetric sources and Hamming distortion, RID (D) =
LC≠—
RID
(D),
i.e., the scheme achieves the fundamental limit. However, the scheme is sub-
LC≠—
optimal in general, in the sense that RID (D) < RID
(D).
7.2.2
The TC ≠ — scheme
Motivation
A closer look at Eq. (7.6) suggests the following intuitive idea: in the distortion rate case,
we wish to minimize the distortion with a constraint on the mutual information. The optimization is with respect to the transition probability PX̂|X . This is in agreement with Eq.
(7.6), as we also want to minimize E[fl(X, X̂)]. However, the quantity E[fl(X̂, Y )] also de-
pends on PX̂ (determined by PX̂|X and PX ). This suggests optimizing both terms together.
As shown in [86], this is possible, and the key is to use a type covering lemma (TC) to generate x̂ (and not just the one that minimizes the distortion between X and X̂). Specifically,
TC≠—
any similarity threshold below DID
(R) can be attained by a TC ≠ — scheme of rate R,
where
TC≠—
DID
(D) ,
max
PX̂|X :I(X;X̂)ÆR
E[fl(X̂, Y )] ≠ E[fl(X, X̂)].
(7.7)
CHAPTER 7. COMPRESSION SCHEMES FOR SIMILARITY QUERIES
X , Y ∼ Bern(0.5)
115
X , Y ∼ Bern(0.7)
1
−△
R ILC
(D)
D
R I D(D)
Entropy
0.6
R [bits]
R [bits]
0.8
0.8
−△
R ILC
(D)
D
0.6
R ITDC − △(D)
R I D(D)
Entropy
R ITDC − △(D)
0.4
0.4
0.2
0.2
0
0
0.1
0.2
0.3
0.4
Similarity threshold D
0
0
0.1
0.2
0.3
0.4
Similarity threshold D
Figure 7.3: Binary sources and Hamming distortion: if PX = PY = Bern(0.5),
LC≠—
TC≠—
LC≠—
RID
(D) = RID
(D) = RID (D), whereas if PX = PY = Bern(0.7), RID
(D) >
TC≠—
RID (D) = RID (D).
TC≠—
TC≠—
As in the previous case, we denote by RID
(D) the inverse function of DID
(R).
TC≠—
LC≠—
It is easy to see that RID
(D) Æ RID
(D). Furthermore, for memoryless binary
TC≠—
sources and Hamming distortion, RID
(D) = RID (D) and both are strictly lower than
LC≠—
RID
(D) for non-symmetric sources, the difference being particularly pronounced at low
distortion, as shown in [86] (see Fig 7.3).
TC≠—
The question now is how to create a practical TC≠— scheme that achieves RID
(D),
which will imply that the scheme achieves a smaller compression rate that an LC ≠ —
scheme, and that it is optimal for general binary sources and Hamming distortion. While
LC≠—
creating a practical scheme that achieves RID
(D) is straightforward, how to implement
a TC ≠ — scheme is not clear in general. We propose a valid TC ≠ — scheme, which we
introduce next.
Description
Based on the previous results, for each sequence x in the database, we want to generate a
signature assignment from which we can reconstruct a sequence x̂ such that the empirical
distribution between x and x̂ is equal to the one associated with the solution to the optimizaTC≠—
tion problem shown in Eq. (7.7). This will imply that the scheme attains RID
(D); which
LC≠—
is better than RID
(D), and even optimal for memoryless binary sources and Hamming
distortion.
CHAPTER 7. COMPRESSION SCHEMES FOR SIMILARITY QUERIES
116
We propose a practical scheme for this task based on lossy compression algorithms.
Specifically, we show that the desired distribution can be achieved by carefully choosing
the distortion to be applied by the lossy compressor. In other words, if
ú
PX̂|X
=
arg max
PX̂|X :I(X;X̂)ÆR
E[fl(X̂, Y )] ≠ E[fl(X, X̂)],
(7.8)
we are seeking a distortion measure flú (X, X̂) such that
ú
PX̂|X
=
argmin
PX̂|X :I(X;X̂)ÆR
E[flú (X, X̂)],
(7.9)
ú
i.e., the conditional probability induced by the lossy compressor is equal to PX̂|X
.
ú
We show that Eq. (7.9) holds if flú (X, X̂) = log P ú1 , where PX|
is induced from
X̂
X|X̂
ú
PX̂|X
and PX , and I(X; X̂) = R. Note that fl (X, X̂) is reminiscent of logarithmic loss
ú
[92]. This is based on the following lemma1 :
Lemma 1. Let X ≥ PX , and let PX (x) > 0 for all x œ X . For a channel PX̂|X , let PX|X̂
be the reversed channel, and consider a rate distortion problem with distortion measure
fl(x, u) = log
1
.
PX|X̂ (x|u)
(7.10)
Then, for the rate constraint I(X; U ) Æ I(X; X̂), the optimal test channel PUú |X is
equal to PX̂|X .2
Proof. First, note that
E[fl(X, U )] =
=
ÿ
x,u
ÿ
u
1
PX,U (x, u) log
1
PX|X̂ (x|u)
PU (u)D(PX|U (·|u)||PX|X̂ (·|u)) + H(X|U ),
(7.11)
(7.12)
Private communication, Thomas Courtade.
Note that here U denotes the reconstruction symbol, PX̂|X the desired distribution, and the optimization
ú
is done over PU |X , whereas in Eq. (7.8), (7.9) PX̂|X
denotes the optimal distribution and the optimization is
done over PX̂|X .
2
CHAPTER 7. COMPRESSION SCHEMES FOR SIMILARITY QUERIES
117
and that the rate constraint implies H(X|U ) Ø H(X|X̂). Therefore,
E[fl(X, U )] Ø
ÿ
u
PU (u)D(PX|U (·|u)||PX|X̂ (·|u)) + H(X|X̂).
(7.13)
Thus,
min
PU |X :I(X;U )ÆI(X;X̂)
E[fl(X, U )] = H(X|X̂),
(7.14)
and the minimum is attained if and only if PX|U = PX|X̂ .
Going back to our setting, note that the optimization problem shown in Eq. (7.7) that
solves for RID (D)TC≠— has the constraint I(X, X̂) Æ R. The maximizing probability in
Eq. (7.8) will in general achieve I(X, X̂) = R, and thus we can apply the lemma.
Therefore, the proposed TC ≠ — scheme effectively employs for the signature assign-
ment a good lossy compressor for distortion measure fl(x, x̂) = log P ú
1
,
(x|x̂)
X|X̂
ú
where PX|
X̂
ú
is induced by PX̂|X
, given by Eq. (7.8), and PX . With an optimal lossy compressor, and asú
suming I(X, X̂) = R, the joint type of the sequences x and x̂ will be close to PX̂|X
, which
TC≠—
achieves RID
, which is optimal for the case of general binary sources and Hamming
distortion. In the next section we show that the performance of the proposed scheme approaches the fundamental performance limit, and performs notably better than the LC ≠ —
scheme.
7.3
Simulation results
In this section we examine the performance of both the LC ≠ — and the TC ≠ — schemes.
We consider datasets composed of M binary sequences of length n, and Hamming distortion for computing the similarity between sequences. We generate the sequences in the
database as X ≥
rn
i=1
PX (xi ), with PX = Bern(p). These sequences are independent
of the query sequences, generated as Y ≥
rn
i=1
PY (yi ), with PY = Bern(q). With these
assumptions, for each sequence xi in the database, i œ [1 : M ], given its signature T (xi ),
we can compute the probability that g(T (xi ), y) = maybe (for a similarity threshold D),
CHAPTER 7. COMPRESSION SCHEMES FOR SIMILARITY QUERIES
118
denoted by P (maybe|T (xi )), analytically, with the following formula:
P (maybe|T (xi )) =
Ân(d1 +D)Ê
ÿ
d
ÿ
d=Án(d0 ≠D)Ë i=0
A
n0
i
BA
B
n ≠ n0 n≠n0 ≠d+2i
q
(1 ≠ q)n0 +d≠2i ,
d≠i
(7.15)
where n0 denotes the number of zeros of x̂i , and d0 and d1 are the delimiters of the decision
region to which d(xi , x̂i ) belongs. If no quantization is applied, d0 = d1 = d(xi , x̂i ).
Finally, we compute the probability of maybe for the database as the average over all the
sequences it contains, i.e., P (maybe) =
probability to be as small as possible.
1
M
qM
i=1
P (maybe|T (xi )). Note that we want this
Regarding the quantization of d(x, x̂), we approximate the distribution of d(X, X̂) as a
Gaussian N (µ, ‡ 2 ), where µ and ‡ 2 are computed empirically (for each rate). We then use
the k-means algorithm to find the 2k decision regions ( R = k/n, i.e., k bits are allocated
for the description of the quantized distortion). Thus, for each distortion, we store only the
decision region to which it belongs.
Finally, we also analyze the performance of the LC ≠ — scheme when applied to q≠ary
sources.
7.3.1
Binary symmetric sources and Hamming distortion
Note that the performance of the LC ≠ — and the TC ≠ — schemes are equivalent in this
case. For the analysis, we consider a dataset composed of M = 1000 binary sequences of
length n = 512, with p = q = 0.5. As the fixed-length lossy compression algorithm, we
use a binary-Hamming version of the successive refinement compression scheme [93].
Regarding the quantization of d(x, x̂), there exists a tradeoff between the quantization
level and the probability of maybe. Fig. 7.4(a) shows the results for different quantization
levels (denoted by k) and a similarity threshold D = 0.20 (i.e., 80% similarity). As expected, no special value of k performs better than the others for any overall compression
rate R. Therefore, in the subsequent figures the presented results correspond to the best
value of k for each rate. As can be observed, we can reduce the size of the database by 76%
(R = 0.24) and retrieve on average 1% of the sequences per query. With 70% reduction we
can get a P (maybe) of 10≠4 (on average one sequence every 1000 is retrieved). One can
CHAPTER 7. COMPRESSION SCHEMES FOR SIMILARITY QUERIES
119
get even more compression with the same P (maybe) for lower values of D. For example,
95% compression with a 1% average retrieval is achieved for D = 0.05.
0
10
LSH: AND-OR scheme
0.3
LSH: OR-AND scheme
−2
10
10
−6
10
−8
10
0.1
R ITDC − △ =
P (fp)
P (maybe )
LC − △ scheme
−4
R ITDC − △
RID =
No quantization
k=1
k=2
k=3
k=4
k=5
k=6
k=7
k=8
0.15
0.2
0.25
0.3
R[bits]
(a)
0.2
0.1
10 −4
0.35
0.4
0
0
0.1
0.2
P (fn)
0.3
(b)
Figure 7.4: Binary symmetric sequences and similarity threshold D = 0.2: (a) performance
of the proposed architecture with quantized distortion (b) comparisson with LSH for rate
R = 0.3.
Finally, we include a comparison with LSH [88]. We use the accepted family of functions H = {hi , i œ [1 : n]}, with hi (x) = x(i), the ith coordinate of x, and consider both
the AND-OR and the OR-AND constructions described in [88, Chapter 3]. Note that the
comparison is not completely fair, as LSH allows false negatives (fn’s), compresses the
query sequence, and its design is not optimized for the problem considered in this paper.
This is reflected in Fig. 7.4(b), where we show the achievable probabilities of fn’s and
false positives (fp’s) for both schemes, considering the database introduced above and rate
R = 0.3. As can be observed, it is not possible to have both probabilities going to zero at
the same time, whereas the proposed scheme achieves for the same rate a P (fp) close to
10≠4 with zero fn’s.
7.3.2
General binary sources and Hamming distortion
We compare the performance of the T C ≠ — and LC ≠ — schemes, assuming PX =
PY = Bern(p), with p ”= 0.5. For a fair comparison, we simulate both schemes with the
lossy compressor presented in [94], that allows us to specify the distortion to be used. The
CHAPTER 7. COMPRESSION SCHEMES FOR SIMILARITY QUERIES
X, Y ∼ B ern(0.7), D = 0.05
0
X, Y ∼ B ern(0.8), D = 0.05
0
10
120
10
P (maybe )
P (maybe )
△
R ILC−
D
−2
10
−4
10
−6
10
△
R ILC−
D
LC − △
TC − △
LC − △
TC − △
−2
10
−4
10
−6
0
0.1
10
0.2
0
0.1
R[bits]
X, Y ∼ B ern(0.7), D = 0.10
0
0.4
10
−2
P (maybe )
P (maybe )
0.2
0.3
R[bits]
X, Y ∼ B ern(0.8), D = 0.10
0
10
10
−4
10
−6
10
scheme: approx.
scheme: approx.
scheme
scheme
−2
10
−4
10
−6
0
0.1
0.2
R[bits]
10
0
0.1
0.2
0.3
R[bits]
0.4
Figure 7.5: Performance of the proposed schemes for sequences of length n = 512, similarity thresholds D = {0.05, 0.1} and PX = PY = Bern(0.7) and Bern(0.8).
LC ≠ — scheme uses Hamming distortion, whereas the T C ≠ — scheme uses the distor-
tion measure given by Eq. (7.10), with PX|X̂ computed from PX and PX̂|X as defined in
Eq. (7.8) (for each rate). Note that this distortion is to be used only by the lossy compressor. The decision rule g(T (x), y) in both schemes still uses Hamming distortion to
measure similarity between sequences and the triangle inequality property for computing
the decision threshold.
We show simulation results in Fig. 7.5 for a dataset composed of M = 1000 sequences
of length n = 512, and PX = PY = Bern(0.7) and Bern(0.8). We also plot the three
TC≠—
LC≠—
rates (RID = RID
< RID
) and an approximation for each scheme, computed as
follows. For a given rate R, the approximation for the LC ≠ — scheme assumes PX̂|X
is given as argminP
X̂|X :I(X;X̂)ÆR
E[fl(X, X̂)] (rate distortion optimization problem), with fl
representing Hamming distortion. On the other hand, for the TC ≠ — scheme, PX̂|X is
assumed to be equal to that of Eq. (7.8). We then compute the P (maybe) of each scheme
using Eq. (7.15), with d0 = d1 = E[fl(X, X̂)], with fl representing Hamming distortion,
and n0 = nPX̂ (x̂ = 0).
As can be observed, the TC ≠ — scheme performs better than the LC ≠ — scheme in
CHAPTER 7. COMPRESSION SCHEMES FOR SIMILARITY QUERIES
121
all cases, as is suggested by the theory. For example, for X, Y ≥ Bern(0.7), D = 0.05
and R = 0.13 (87% compression), while the LC ≠ — scheme achieves P (maybe) = 10≠4 ,
the TC ≠ — scheme achieves 10≠5 . Similarly, for D = 0.1 and R = 0.2, the P (maybe)
decreases from 10≠3 to 10≠4 , i.e., on average it retrieves 1 sequence every 10000, instead
of every 1000. For the case X, Y ≥ Bern(0.8) we observe similar results. With D = 0.05
(95% similarity) and P (maybe) = 10≠2 , the TC ≠ — scheme attains 93% compression
(R = 0.07), whereas the LC ≠ — schemes achieves only 84% compression (R = 0.16),
LC≠—
i.e., a reduction in rate of 55%. Furthermore, R = 0.07 is close to RID
. Similarly,
for D = 0.1 and P (maybe) = 10≠4 the decrease in rate is from 0.35 to 0.3 bits, which
represents an improvement in compression of 14.2%. Finally, notice that for a given rate,
the smaller the similarity threshold D, the smaller the P (maybe).
7.3.3
q-ary sources and Hamming distortion
The LC ≠ — scheme can be easily extended to the case of q-ary sources. Note that the
decision rule of Eq. (7.4) still applies in this case. One important example of the kind of
source where this scheme would be of special importance is DNA data, where the alphabet
is of size four, {A, C, G, T }.
We consider a database composed of M = 1000 i.i.d. uniform 4-ary sequences of
length n = 100, and apply the proposed architecture with the lossy compression algorithm
presented in [94]. To see how the scheme works on real data, we generate a database
composed of 1000 DNA sequences of length 100, taken from BIOZON [2]. The empirical
distribution is given by pA = 0.25, pC = 0.23, pG = 0.29, pT = 0.23. We emphasize that
the proposed architecture makes the scheme D-admissible, independently of the probabilistic model behind the sequences of the database, if any. We consider i.i.d. and uniformly
distributed query sequences to compute the probability of maybe in both cases.
The results for both datasets are shown in Fig. 7.6. As can be observed, the performance
on the simulated data and on the DNA dataset are very similar. We present some results for
the DNA database. For D = 0.1, we get a probability of maybe of 0.001 with a reduction
in size of 83.5% (R = 0.33). For D = 0.2 and R = 0.47, we get a probability of maybe of
0.01.
CHAPTER 7. COMPRESSION SCHEMES FOR SIMILARITY QUERIES
D = 0.1
100
P r{maybe}
122
RID (D)
Approximation
Simulated data
DNA data
10-2
10-4
10-6
10-8
0
0.1
0.2
0.3
0.4
100
P r{maybe}
0.5
0.6
0.7
0.8
0.9
1
0.6
0.7
0.8
0.9
1
R[bits]
D = 0.2
10-2
10-4
10-6
10-8
0
0.1
0.2
0.3
0.4
0.5
R[bits]
Figure 7.6: Performance of the LC≠— scheme for D = {0.1, 0.2} applied to two databases
composed of 4-ary sequences: one generated uniformly i.i.d. and the other comprised of
real DNA sequences from [2].
7.4
Conclusion
In this chapter we have investigated schemes for compressing a database so that similarity
queries can be performed efficiently on the compressed database. These schemes are of
practical interest in genetics, where it is important to find sequences that are similar. The
fundamental limits for this problem have been characterized in past work, and they serve
as the basis for performance evaluation.
Specifically, we have introduced two schemes for this task, the LC≠— and the TC≠—
schemes, respectively, both based on lossy compression algorithms. The performance of
the LC ≠ — scheme, although close to the fundamental limits in some cases (e.g., binary
symmetric sources and Hamming distortion), is suboptimal in general. The TC≠— scheme
builds upon the previous one and achieves a better compression rate in many cases. For example, for general memoryless binary sequences and Hamming distortion, the TC ≠ —
scheme exhibits on simulated data performance approaching the fundamental limits, substantially improving over the LC ≠ — scheme. The TC ≠ — scheme is also based on lossy
CHAPTER 7. COMPRESSION SCHEMES FOR SIMILARITY QUERIES
123
compression algorithms, but in this case the distortion measure to be applied by the lossy
compressor is judiciously designed, a measure which is not Hamming despite the fact that
similarity for the query is measured under Hamming. Finally, both schemes are applicable
to any discrete database and similarity measure satisfying the triangle inequality.
Chapter 8
Conclusions
This dissertation has been motivated by the ever growing amount of genomic data that is
being generated. These data must be stored, processed, and analyzed, which poses significant challenges. To partially address this issue, in this thesis we have investigated methods
to ease the storage and distribution of the genomic data, as well as methods to facilitate the
access to these data on databases.
Specifically, we have designed lossless and lossy compression schemes for the genomic
data, improving upon the previously proposed compressors. Lossy compressors have been
traditionally analyzed in terms of their rate-distortion performance. However, since the
genomic data is used for biological inference, an analysis on their effect on downstream
analyses is necessary in this case. With that in mind, we have also presented an extensive
analysis on the effect that lossy compression of the genomic data has on variant calling,
one of the most important analysis performed in practice. The results provided in this
thesis show that lossy compression has the means to significantly reduce the size of the
data while providing a performance on variant calling that is comparable to that obtained
with the lossless compressed data. Moreover, these results show that in some cases lossy
compression can lead to inference that improves upon the uncompressed, which suggest
that the data could be denoised. In that regard, we have also proposed the first denoising
scheme for these type of data, and demonstrated improved inference with the denoised data.
In addition, we have shown that it is possible to do so while reducing the size of the data
beyond its lossless limit. Finally, we have proposed two schemes to compress sequences
124
CHAPTER 8. CONCLUSIONS
125
in a database such that similarity queries can still be performed in the compressed domain.
The proposed schemes are able to achieve significant compression, allowing to replicate
the database in several locations, thus providing easier and faster access to the data.
There are several interesting research directions for future work related to the results
presented in this thesis. We conclude by listing some of them.
• Genomic data compression: There remain several challenges in genomic data compression. In particular, the compressors may need to incorporate important features
beyond end-to-end compression. For example, it may be desirable to trade off compression performance for random access capabilities. Random access may be necessary in applications where one is interested in accessing only one part of the genome,
for example, while avoiding the need to decompress the whole file. In addition, being able to perform operations in the compressed domain can speed up the analysis
performed on the data, especially as the size of the data grows.
• Error-Correction: Current sequencing technologies are imperfect and, as a result,
the reads contained in the sequencing files contain errors. Correcting these errors can
improve the analysis performed on the data by downstream applications, which will
generate more accurate results. For example, one could use a Bayesian framework
and ideas from Coding Theory to model the process that generates the reads, and
develop a robust approach for correction of these errors based on this model.
• Boosting downstream analyses: There are many downstream applications that use
genomic data, each with a different purpose. Currently, several algorithms exist for a
given downstream analysis. These algorithms rarely produce the same results when
applied to the same data. This can be understood by the lack of accurate models
and the compromises made in favor of complexity reduction. An example of this
are algorithms to identify variants, which are used in important applications, such
as medical decision making. Improving the variant callers can have significant impact in practice. It would be desirable to design algorithms for this task guided by
sound theory with performance guarantees, and extend this line of research to other
important downstream applications beyond variant calling.
CHAPTER 8. CONCLUSIONS
126
• Compression schemes for similarity queries: There are several applications that may
require similarity queries based on a similarity metric other than Hamming. For
example, in genomics, similarity between genomes is often measure with the edit
distance, which accounts for substitutions, insertions, and deletions. Thus extending
these type of schemes to such metrics will increase the practically of such schemes.
In addition, it would be desirable to be able to perform queries directly on the compressed data, without the need of decompressing the whole sequences.
Bibliography
[1] Heng Li, Bob Handsaker, Alec Wysoker, Tim Fennell, Jue Ruan, Nils Homer, Gabor
Marth, Goncalo Abecasis, Richard Durbin, et al. The sequence alignment/map format
and samtools. Bioinformatics, 25(16):2078–2079, 2009.
[2] Aaron Birkland and Golan Yona. Biozon: a system for unification, management and
analysis of heterogeneous biological data. BMC bioinformatics, 7(1):1, 2006.
[3] Eric S Lander, Lauren M Linton, Bruce Birren, Chad Nusbaum, Michael C Zody,
Jennifer Baldwin, Keri Devon, Ken Dewar, Michael Doyle, William FitzHugh, et al.
Initial sequencing and analysis of the human genome. Nature, 409(6822):860–921,
2001.
[4] Zachary D Stephens, Skylar Y Lee, Faraz Faghri, Roy H Campbell, and et. al. Big
data: Astronomical or genomical? PLoS Biol, 13(7):e1002195, 2015.
[5] C Re, A Ro, and A Re. Will computers crash genomics? Science, 5:1190, 2010.
[6] HPJ Buermans and JT Den Dunnen. Next generation sequencing technology: advances and applications. Biochimica et Biophysica Acta (BBA)-Molecular Basis of
Disease, 1842(10):1932–1941, 2014.
[7] Peter JA Cock, Christopher J Fields, Naohisa Goto, Michael L Heuer, and Peter M Rice. The sanger fastq file format for sequences with quality scores, and the
solexa/illumina fastq variants. Nucleic acids research, 38(6):1767–1771, 2010.
127
BIBLIOGRAPHY
128
[8] Marc Lohse, Anthony Bolger, Axel Nagel, Alisdair R Fernie, John E Lunn, Mark
Stitt, and Björn Usadel. Robina: a user-friendly, integrated software solution for rnaseq-based transcriptomics. Nucleic acids research, page gks540, 2012.
[9] Murray P Cox, Daniel A Peterson, and Patrick J Biggs. Solexaqa: At-a-glance quality assessment of illumina second-generation sequencing data. BMC bioinformatics,
11(1):485, 2010.
[10] Heng Li, Jue Ruan, and Richard Durbin. Mapping short dna sequencing reads and
calling variants using mapping quality scores. Genome research, 18(11):1851–1858,
2008.
[11] Ben Langmead, Cole Trapnell, Mihai Pop, and Steven L Salzberg. Ultrafast and
memory-efficient alignment of short dna sequences to the human genome. Genome
biology, 10(3):1, 2009.
[12] Heng Li and Richard Durbin. Fast and accurate short read alignment with burrows–
wheeler transform. Bioinformatics, 25(14):1754–1760, 2009.
[13] Gerton Lunter and Martin Goodson. Stampy: a statistical algorithm for sensitive and
fast mapping of illumina sequence reads. Genome research, 21(6):936–939, 2011.
[14] Aaron McKenna and et. al. The genome analysis toolkit: a mapreduce framework for
analyzing next-generation DNA sequencing data. Genome research, 2010.
[15] Jinghui Zhang, David A Wheeler, Imtiaz Yakub, Sharon Wei, Raman Sood, William
Rowe, Paul P Liu, Richard A Gibbs, and Kenneth H Buetow. Snpdetector: a software
tool for sensitive and accurate snp detection. PLoS Comput Biol, 1(5):e53, 2005.
[16] Petr Danecek, Adam Auton, Goncalo Abecasis, Cornelis A Albers, Eric Banks,
Mark A DePristo, Robert E Handsaker, Gerton Lunter, Gabor T Marth, Stephen T
Sherry, et al. The variant call format and vcftools. Bioinformatics, 27(15):2156–
2158, 2011.
[17] James K Bonfield and Matthew V Mahoney. Compression of fastq and sam format
sequencing data. PloS one, 8(3):e59190, 2013.
BIBLIOGRAPHY
129
[18] Zexuan Zhu, Yongpeng Zhang, Zhen Ji, Shan He, and Xiao Yang. High-throughput
dna sequence data compression. Briefings in Bioinformatics, page bbt087, 2013.
[19] Sebastian Deorowicz and Szymon Grabowski. Data compression for sequencing data.
Algorithms for Molecular Biology, 8(1):1, 2013.
[20] Idoia Ochoa, Himanshu Asnani, Dinesh Bharadia, Mainak Chowdhury, Tsachy
Weissman, and Golan Yona. Qualcomp: a new lossy compressor for quality scores
based on rate distortion theory. BMC bioinformatics, 14(1):1, 2013.
[21] Rodrigo Cánovas, Alistair Moffat, and Andrew Turpin. Lossy compression of quality
scores in genomic data. Bioinformatics, 30(15):2130–2136, 2014.
[22] Greg Malysa, Mikel Hernaez, Idoia Ochoa, Milind Rao, Karthik Ganesan, and Tsachy
Weissman. Qvz: lossy compression of quality values. Bioinformatics, page btv330,
2015.
[23] Y William Yu, Deniz Yorukoglu, and Bonnie Berger. Traversing the k-mer landscape
of ngs read datasets for quality score sparsification. In Research in Comp. Molecular
Bio., 2014.
[24] Idoia Ochoa, Mikel Hernaez, Rachel Goldfeder, Tsachy Weissman, and Euan Ashley. Effect of lossy compression of quality scores on variant calling. Briefings in
Bioinformatics, page doi: 10.1093/bib/bbw011, 2016.
[25] Rudolf Ahlswede, E-h Yang, and Zhen Zhang. Identification via compressed data.
IEEE Transactions on Information Theory, 43(1):48–70, 1997.
[26] Amir Ingber, Thomas Courtade, and Tsachy Weissman. Quadratic similarity queries
on compressed data. In Data Compression Conference (DCC), 2013, pages 441–450.
IEEE, 2013.
[27] Amir Ingber, Thomas Courtade, and Tsachy Weissman. Compression for quadratic
similarity queries. IEEE Transactions on Information Theory, 61(5):2729–2747,
2015.
BIBLIOGRAPHY
130
[28] Idoia Ochoa, Mikel Hernaez, and Tsachy Weissman. iDoComp: a compression
scheme for assembled genomes. Bioinformatics, page btu698, 2014.
[29] Idoia Ochoa, Mikel Hernaez, and Tsachy Weissman. Aligned genomic data compression via improved modeling. Journal of bioinformatics and computational biology,
12(06), 2014.
[30] Mikel Hernaez, Idoia Ochoa, Rachel Goldfeder, Tsachy Weissman, and Euan Ashley.
A cluster-based approach to compression of quality scores. Submitted to the Data
Compression Conference (DCC), 2015.
[31] Idoia Ochoa, Mikel Hernaez, Rachel Goldfeder, Tsachy Weissman, and Euan Ashley.
Denoising of quality scores for boosted inference and reduced storage. Submitted to
the Data Compression Conference (DCC), 2015.
[32] Idoia Ochoa, Amir Ingber, and Tsachy Weissman. Efficient similarity queries via
lossy compression. In Allerton, pages 883–889, 2013.
[33] Idoia Ochoa, Amir Ingber, and Tsachy Weissman. Compression schemes for similarity queries. In Data Compression Conference (DCC), 2014, pages 332–341. IEEE,
2014.
[34] Stéphane Grumbach and Fariza Tahi. A new challenge for compression algorithms:
genetic sequences. Information Processing & Management, 30(6):875–886, 1994.
[35] Xin Chen, Ming Li, Bin Ma, and John Tromp. Dnacompress: fast and effective dna
sequence compression. Bioinformatics, 18(12):1696–1698, 2002.
[36] Minh Duc Cao, Trevor I Dix, Lloyd Allison, and Chris Mears. A simple statistical
algorithm for biological sequence compression. In Data Compression Conference,
2007. DCC’07, pages 43–52. IEEE, 2007.
[37] Scott Christley, Yiming Lu, Chen Li, and Xiaohui Xie. Human genomes as email
attachments. Bioinformatics, 25(2):274–275, 2009.
BIBLIOGRAPHY
131
[38] Marty C Brandon, Douglas C Wallace, and Pierre Baldi. Data structures and compression algorithms for genomic sequence data. Bioinformatics, 25(14):1731–1738,
2009.
[39] Dmitri S Pavlichin, Tsachy Weissman, and Golan Yona. The human genome contracts
again. Bioinformatics, page btt362, 2013.
[40] Shanika Kuruppu, Simon J Puglisi, and Justin Zobel. Relative lempel-ziv compression
of genomes for large-scale storage and retrieval. In String Processing and Information
Retrieval, pages 201–206. Springer, 2010.
[41] Shanika Kuruppu, Simon J Puglisi, and Justin Zobel. Optimized relative lempel-ziv
compression of genomes. In Proceedings of the Thirty-Fourth Australasian Computer
Science Conference-Volume 113, pages 91–98. Australian Computer Society, Inc.,
2011.
[42] Sebastian Deorowicz and Szymon Grabowski.
Robust relative compression of
genomes with random access. Bioinformatics, 27(21):2979–2986, 2011.
[43] Congmao Wang and Dabing Zhang. A novel compression tool for efficient storage of
genome resequencing data. Nucleic acids research, 39(7):e45–e45, 2011.
[44] Armando J Pinho, Diogo Pratas, and Sara P Garcia. Green: a tool for efficient compression of genome resequencing data. Nucleic acids research, 40(4):e27–e27, 2012.
[45] BG Chern, Idoia Ochoa, Alexandros Manolakos, Albert No, Kartik Venkat, and
Tsachy Weissman. Reference based genome compression. In Information Theory
Workshop (ITW), 2012 IEEE, pages 427–431. IEEE, 2012.
[46] Sebastian Wandelt and Ulf Leser. Adaptive efficient compression of genomes. Algorithms for Molecular Biology, 7(1):1, 2012.
[47] Sebastian Deorowicz, Agnieszka Danek, and Szymon Grabowski. Genome compression: a novel approach for large collections. Bioinformatics, page btt460, 2013.
BIBLIOGRAPHY
132
[48] Sebastian Wandelt and Ulf Leser. Fresco: Referential compression of highly similar
sequences. Computational Biology and Bioinformatics, IEEE/ACM Transactions on,
10(5):1275–1288, 2013.
[49] Dan Gusfield. Algorithms on strings, trees and sequences: computer science and
computational biology. Cambridge university press, 1997.
[50] Markus Hsi-Yang Fritz, Rasko Leinonen, Guy Cochrane, and Ewan Birney. Efficient
storage of high throughput dna sequencing data using reference-based compression.
Genome research, 21(5):734–740, 2011.
[51] Fabien Campagne, Kevin C Dorff, Nyasha Chambwe, James T Robinson, and Jill P
Mesirov. Compression of structured high-throughput sequencing data. PloS one,
8(11):e79871, 2013.
[52] Daniel C Jones, Walter L Ruzzo, Xinxia Peng, and Michael G Katze. Compression of
next-generation sequencing reads aided by highly efficient de novo assembly. Nucleic
acids research, 40(22):e171–e171, 2012.
[53] Khalid Sayood. Introduction to data compression. Newnes, 2012.
[54] Ben Langmead and Steven L Salzberg. Fast gapped-read alignment with bowtie 2.
Nature methods, 9(4):357–359, 2012.
[55] Shreepriya Das and Haris Vikalo. Onlinecall: fast online parameter estimation and
base calling for illumina’s next-generation sequencing. Bioinformatics, 28(13), 2012.
[56] Heng Li. A statistical framework for snp calling, mutation discovery, association
mapping and population genetical parameter estimation from sequencing data. Bioinformatics, 27(21):2987–2993, 2011.
[57] Mark A DePristo, Eric Banks, et al. A framework for variation discovery and genotyping using next-generation dna sequencing data. Nature genetics, 43(5), 2011.
[58] Christos Kozanitis, Chris Saunders, Semyon Kruglyak, Vineet Bafna, and George
Varghese. Compressing genomic sequence fragments using slimgene. Journal of
Computational Biology, 18(3):401–413, 2011.
BIBLIOGRAPHY
133
[59] Raymond Wan, Vo Ngoc Anh, and Kiyoshi Asai. Transformations for the compression of fastq quality scores of next-generation sequencing data. Bioinformatics,
28(5):628–635, 2012.
[60] Faraz Hach, Ibrahim Numanagić, Can Alkan, and S Cenk Sahinalp. Scalce: boosting
sequence compression algorithms using locally consistent encoding. Bioinformatics,
28(23):3051–3057, 2012.
[61] Lilian Janin, Giovanna Rosone, and Anthony J Cox. Adaptive reference-free compression of sequence quality scores. Bioinformatics, page btt257, 2013.
[62] Y William Yu, Deniz Yorukoglu, Jian Peng, and Bonnie Berger. Quality score compression improves genotyping accuracy. Nature biotechnology, 33(3):240–243, 2015.
[63] Łukasz Roguski and Sebastian Deorowicz. Dsrc2 industry-oriented compression of
fastq files. Bioinformatics, 30(15):2213–2215, 2014.
[64] Thomas M Cover and Joy A Thomas. Elements of information theory. John Wiley &
Sons, 2012.
[65] Amos Lapidoth. On the role of mismatch in rate distortion theory. IEEE Trans. Inf.
Theory, 43(1):38–47, 1997.
[66] Stuart Lloyd. Least squares quantization in pcm. IEEE transactions on information
theory, 28(2):129–137, 1982.
[67] James MacQueen et al. Some methods for classification and analysis of multivariate observations. In Proceedings of the fifth Berkeley symposium on mathematical
statistics and probability, volume 1, pages 281–297. Oakland, CA, USA., 1967.
[68] Kevin P Murphy. Machine learning: a probabilistic perspective. MIT press, 2012.
[69] Michael D Linderman, Tracy Brandt, Lisa Edelmann, Omar Jabado, Yumi Kasai,
Ruth Kornreich, Milind Mahajan, Hardik Shah, Andrew Kasarskis, and Eric E Schadt.
Analytical validation of whole exome and whole genome sequencing for clinical applications. BMC medical genomics, 7(1):1, 2014.
BIBLIOGRAPHY
134
[70] Xiangtao Liu, Shizhong Han, Zuoheng Wang, Joel Gelernter, and Bao-Zhu Yang.
Variant callers for next-generation sequencing data: a comparison study. PloS one,
8(9):e75619, 2013.
[71] Xiaoqing Yu and Shuying Sun. Comparing a few snp calling algorithms using lowcoverage sequencing data. BMC bioinformatics, 14(1):1, 2013.
[72] Jason O’Rawe, Tao Jiang, Guangqing Sun, Yiyang Wu, Wei Wang, Jingchu Hu, Paul
Bodily, Lifeng Tian, Hakon Hakonarson, W Evan Johnson, et al. Low concordance
of multiple variant-calling pipelines: practical implications for exome and genome
sequencing. Genome medicine, 5(3):1, 2013.
[73] Hugo YK Lam, Michael J Clark, Rui Chen, Rong Chen, Georges Natsoulis, Maeve
O’Huallachain, Frederick E Dewey, Lukas Habegger, Euan A Ashley, Mark B Gerstein, et al. Performance comparison of whole-genome sequencing platforms. Nature
biotechnology, 30(1):78–82, 2012.
[74] Geraldine A Auwera, Mauricio O Carneiro, et al. From fastq data to high-confidence
variant calls: the genome analysis toolkit best practices pipeline. Current Protocols
in Bioinformatics, pages 11–10, 2013.
[75] Andy Rimmer, Hang Phan, Iain Mathieson, et al. Integrating mapping-, assembly-and
haplotype-based approaches for calling variants in clinical sequencing applications.
Nature genetics, 46(8):912–918, 2014.
[76] Heng Li. Aligning sequence reads, clone sequences and assembly contigs with bwamem. arXiv preprint arXiv:1303.3997, 2013.
[77] Cornelis A Albers, Gerton Lunter, and et. al. Dindel: accurate indel calls from shortread data. Genome research, 21(6):961–973, 2011.
[78] Erik Garrison and Gabor Marth. Haplotype-based variant detection from short-read
sequencing. arXiv preprint arXiv:1207.3907, 2012.
[79] Justin M Zook, Brad Chapman, Jason Wang, David Mittelman, Oliver Hofmann, Winston Hide, and Marc Salit. Integrating human sequence data sets provides a resource
BIBLIOGRAPHY
135
of benchmark snp and indel genotype calls. Nature biotechnology, 32(3):246–251,
2014.
[80] Weichun Huang, Leping Li, Jason R Myers, and Gabor T Marth. Art: a nextgeneration sequencing read simulator. Bioinformatics, 28(4):593–594, 2012.
[81] Frederick E Dewey, Rong Chen, Sergio P Cordero, Kelly E Ormond, Colleen Caleshu,
Konrad J Karczewski, Michelle Whirl-Carrillo, Matthew T Wheeler, Joel T Dudley,
Jake K Byrnes, et al. Phased whole-genome genetic risk in a family quartet using a
major allele reference sequence. PLoS Genet, 7(9):e1002280, 2011.
[82] Wei-Chun Kao, Kristian Stevens, and et. al. Bayescall: A model-based base-calling
algorithm for high-throughput short-read sequencing. Genome research, 19(10),
2009.
[83] Tsachy Weissman and Erik Ordentlich. The empirical distribution of rate-constrained
source codes. IEEE Trans. Inf. Theory, 51, 2005.
[84] Himanshu Asnani, Ilan Shomorony, and et. al. Network compression: Worst-case
analysis. In Inf. Theory Proceedings (ISIT), 2013 IEEE Intern. Symp. on, pages 196–
200, 2013.
[85] Shirin Jalali and Tsachy Weissman. Denoising via MCMC-based lossy compression.
IEEE Trans. Signal Process., 60(6), 2012.
[86] Amir Ingber and Tsachy Weissman. The minimal compression rate for similarity
identification. arXiv preprint arXiv:1312.2063, 2013.
[87] Burton H Bloom. Space/time trade-offs in hash coding with allowable errors. Communications of the ACM, 13(7):422–426, 1970.
[88] Anand Rajaraman and Jeffrey David Ullman. Mining of massive datasets. Cambridge
University Press, 2012.
[89] Sunil Arya, David M Mount, Nathan S Netanyahu, Ruth Silverman, and Angela Y
Wu. An optimal algorithm for approximate nearest neighbor searching fixed dimensions. Journal of the ACM (JACM), 45(6):891–923, 1998.
BIBLIOGRAPHY
136
[90] Sharadh Ramaswamy and Kenneth Rose. Adaptive cluster distance bounding for
high-dimensional indexing. IEEE Transactions on Knowledge and Data Engineering,
23(6):815–830, 2011.
[91] Amir Ingber, Thomas Courtade, and Tsachy Weissman. Compression for exact match
identification. In Information Theory Proceedings (ISIT), 2013 IEEE International
Symposium on, pages 654–658. IEEE, 2013.
[92] Thomas A Courtade and Tsachy Weissman. Multiterminal source coding under logarithmic loss. In Information Theory Proceedings (ISIT), 2012 IEEE International
Symposium on, pages 761–765. IEEE, 2012.
[93] Ramji Venkataramanan, Tuhin Sarkar, and Sekhar Tatikonda. Lossy compression via
sparse linear regression: Computationally efficient encoding and decoding. IEEE
Transactions on Information Theory, 60(6):3265–3278, 2014.
[94] Ankit Gupta and Sergio Verdú. Nonlinear sparse-graph codes for lossy compression.
IEEE Transactions on Information Theory, 55(5):1961–1975, 2009.