Lau_abstract

Searching for the diamonds in the ocean of sequencing reads
Maggie C.Y. Lau (Geosciences)
Biological sciences face the “Big Data” problem after the emergence of next generation sequencing,
which generates terabytes of data in less than a week. The hundreds of millions of reads (nucleotide
sequences) inform about what organisms are in the studied samples, what can/do they do and how.
Next generation sequencing has been applied to study human genomes, microbiomes, sewage
treatment plants, air, soils, etc. My research employs this technology:
 To investigate the response of microorganisms in the Arctic and Antarctic terrestrial
systems to global warming
 To reveal the metabolic potential and activity of microorganisms that are analogs of
life in the early Earth history and on other extraterrestrial planets
 To examine the role of microorganisms in biogeochemical cycles and their
distribution patterns
 To discover the metabolic capabilities of the yet-to-be cultivated organisms
Research computing is necessary for processing the large quantity of data in order to produce useful
information to address the aforementioned scientific questions. The procedure includes, but not
limited to, basic text manipulation, quality-filtering, sequence assembly (i.e. joining short reads into
longer contiguous sequences), gene prediction, sequence annotation (i.e. assign taxonomic and
functional identity) and phylogenetic analysis. The large input files already demand for a growing
amount of storage space, let alone the memory required for analyzing the complex datasets that
exceeds the computing capacity of a personal computer. Moreover, many commercial or opensource algorithms available for bioinformatics analyses have been developed to employ multiprocessors to shorten the computational time needed for certain memory-hungry and timedemanding tasks. The access to high performance computing units have enabled multiple
bioinformatics projects to be performed in parallel and the development of analytical approaches
customized for each project. These projects have yielded significant findings that advance our
knowledge in the aforementioned areas.
Bioinformatics is an indispensable tool in all biological sciences. Research computing is foreseen to
play a critical in the success of our research in geomicrobiology as well as the relevant courses. Our
demand for high-speed data transfer and large-scale computing resources will continue to grow in
the coming future.