Identification of Environmental Alphaproteobacteria

M. Sc. Thesis—Quan Yao McMaster—Biology
IDENTIFICATION OF ENVIRONMENTAL ALPHAPROTEOBACTERIA
WITH CONSERVED SIGNATURE PROTEINS IN METAGENOMIC
DATASETS
M. Sc. Thesis—Quan Yao McMaster—Biology
IDENTIFICATION OF ENVIRONMENTAL ALPHAPROTEOBACTERIA
WITH CONSERVED SIGNATURE PROTEINS IN METAGENOMIC
DATASETS
BY
QUAN YAO, B.Sc.
A Thesis
Submitted to the School of Graduate Studies
in Partial Fulfillment of the Requirements
For the Degree
Master of Science
McMaster University
© Copyright by Quan Yao, Dec 2013
M. Sc. Thesis—Quan Yao McMaster—Biology
MASTER OF SCIENCE (2013)
McMaster University
(Biology)
Hamilton, Ontario
TITLE: Identification of Environmental Alphaproteobacteria with Conserved
Signature Proteins in Metagenomic Datasets
AUTHOR: Quan Yao, B.Sc. (Ocean University of China)
SUPERVISOR: Professor H.E. Schellhorn
NUMBER OF PAGES: ix, 94
ii
M. Sc. Thesis—Quan Yao McMaster—Biology
Abstract
Microbial metagenomics is the exploration of taxonomical diversity of microbial
communities in environmental habitats using large, exhaustive DNA sequence datasets.
However, due to inherent limitations of sequencing technology and the complexity of
environmental genomes, current analytical approaches do not reveal the existence of all
microbes that may be present. In this study, a new classification approach is proposed
based upon unique proteins that are specific for different clades of Alphaproteobacteria to
predict the presence and absence of species from these groups of bacteria in published
metagenomic datasets. In this work, 264 previously–identified, published conserved
signature proteins (CSPs) characteristic of individual taxonomic clades of
Alphaproteobacteria are used as probes to detect the presence of bacteria in metagenomic
datasets. Although public genome sequence information has increased manifold since
these CSPs were initially identified 6 years ago, results indicate that nearly all of these
CSPs (259 of 265) are specific for their previously characterized clades. Furthermore,
they are confirmed to be present in the newly–identified and sequenced members of these
clades. In view of their specificity and predictive ability in different monophyletic clades
of Alphaproteobacteria, the sequences of these CSPs provide reliable probes to determine
the presence or absence of these Alphaproteobacteria in metagenomic datasets. In this
work, CSPs are used to determine the presence of Alphaproteobacteria diversity in 10
published metagenomic datasets (bioreactor, compost, wastewater, activated sludge,
groundwater, freshwater sediment, microbial mat, marine, hydrothermal vent and whale
fall metagenomes), which cover diverse environment and ecosystems. It is indicated that
iii
M. Sc. Thesis—Quan Yao McMaster—Biology
the BLAST searches with these CSPs can be used to efficiently identify
Alphaproteobacteria species in these metagenome dataset and substantial differences can
be determined in the distribution and relative abundance of different Alphaproteobacteria
species in the tested metagenome datasets. Thus the CSPs, which are specific for different
microbial taxa, provide novel and powerful means for identification of microbes and for
their taxonomic profiling in metagenomic datasets.
iv
M. Sc. Thesis—Quan Yao McMaster—Biology
Acknowledgements
First, I must thank my Supervisor, Dr. Herb Schellhorn, who gave a lot of valuable
suggestions and recommendations during my research work, along with his generosity for
taking us to attend the conference of Canadian Society of Microbiologist in Ottawa,
during which we had a great experience to share research work and communicate with the
world’s top researchers. The second summer in Dr. Schellhorn’s cottage is an
unforgettable memory, where we enjoyed a fascinating retreat after a year of hard work.
Equally important, I would like to thank my co-supervisor, Dr. Gupta for his continuous
support in my work and the inspirations he ignited in my mind and my committee chair,
Dr. Igdoura for his kindness and assistance in my defense.
Secondly, I have to thank my lab mate who accompanied me in the past 2 years both
in the lab and out of campus. I want to acknowledge Lingzi, Mohammed, Shirley, Sohail,
Steve, Rachel, and Pardis. The coffee break chats for casual and entertaining topics, the
cooperative work we managed to accomplish when encountering the bottlenecks in
research, or some in-depth exchange of ideas and thoughts about philosophy, universe
and ourselves, all these pieces make up an indispensable part in my life to establish my
values and faiths.
Finally, I must thank my parents for their encouragement in my life. Without their
guidance and instruction, I can never achieve the goal that I have ever dreamed of. Their
love to me is my forever treasure and provides the motive power to help me conquer
future obstacles in my career.
v
M. Sc. Thesis—Quan Yao McMaster—Biology
Table of Contents
Part I. Uniqueness of Alphaproteobacteria specific CSPs ................................................ 1 Chapter 1 Introduction ........................................................................................................................ 1 1.1 Significance of Alphaproteobacteria ............................................................................................... 1 1.2 Conserved signature proteins as phylogenetic markers ......................................................... 5 1.3 Standards for taxonomic hierarchy ................................................................................................ 6 Chapter 2 Materials and methods ...................................................................................................... 9 2.1 Confirmation of the uniqueness of CSPs ...................................................................................... 9 2.2 Grouping of CSP into Taxonomic levels ..................................................................................... 10 Chapter 3. Results .............................................................................................................................. 13 3.1 Confirmation of the uniqueness of CSPs .................................................................................... 13 3.2 Grouping of CSP into Taxonomic levels ..................................................................................... 15 Chapter 4 Discussion ......................................................................................................................... 27 4.1 Confirmation of the uniqueness of CSP ...................................................................................... 27 4.2 Grouping of CSP into Taxonomic levels ..................................................................................... 28 4.3 Future experiments ............................................................................................................................... 29 Part II Identification of Alphaproteobacteria specific CSPs in metagenomic samples
..................................................................................................................................................... 31 Chapter 1 introduction ...................................................................................................................... 31 1.1 Metagenome, environmental genomes ......................................................................................... 31 1.2 Taxonomic classification of metagenomic reads: methods and challenges .................. 34 1.3 Application of metagenomics ........................................................................................................... 36 1.4 Project objectives ................................................................................................................................... 40 Chapter 2 Materials and methods ................................................................................................... 42 2.1 Metagenome selection .......................................................................................................................... 42 2.2 Identification of CSP in metagenomic samples ........................................................................ 42 2.3 Comparative analysis of Alphaproteobacteria in metagenomes ...................................... 43 Chapter 3 Results ............................................................................................................................... 45 3.1 Metagenome selection .......................................................................................................................... 45 3.2 Identification of CSPs in metagenomic samples ...................................................................... 47 3.3 Comparative analysis of Alphaproteobacteria in metagenomes ...................................... 50 Chapter 4 Discussion ......................................................................................................................... 74 4.1 Metagenome selection .......................................................................................................................... 74 4.2 Identification of CSPs in metagenomic samples ...................................................................... 75 4.3 Comparative analysis of Alphaproteobacteria in metagenomes ...................................... 77 4.4 Overall conclusions ............................................................................................................................... 79 4.5 Future directions .................................................................................................................................... 80 References ................................................................................................................................. 82 vi
M. Sc. Thesis—Quan Yao McMaster—Biology
List of Figures
Figure 1: Summary heatmap of 16 Alphaproteobacteria specific CSPs in 10 metagenomes
............................................................................................................................................ 54 Figure 2: Alphaproteobacteria specific CSPs identified in 10 metagenomes ................... 62 Figure 3: Similarity of significant hits in 10 metagenomes ............................................... 70 Figure 4: Overall relative abundance of Alphaproteobacteria based on CSP distribution in
10 metagenomes................................................................................................................. 71 Figure 5: The relative abundance of Alphaproteobacteria and its different sub-clades in
the studied metagenomes based upon BLASTp searches with CSPs ................................ 72 Figure 6: Comparative results of Alphaproteobacteria distribution in 4 metagenomes
derived from (A) CSPs-based binning and (B) similarity-based binning. ......................... 73 vii
M. Sc. Thesis—Quan Yao McMaster—Biology
List of Tables
Table 1: Alphaproteobacteria specificity and predictive ability of CSPs identified in 2007
and 2013 ............................................................................................................................. 11 Table 2: Comparison of the Results of BLAST Search with Protein and Nucleotide
Sequences ........................................................................................................................... 12 Table 3: CSPs specific to Alphaproteobacteria ................................................................. 16 Table 4 CSPs specific to Rhizobiales................................................................................. 17 Table 5: CSPs specific to Bradyrhizobiaceae and Xanthobacteraceae ............................. 19 Table 6 CSPs specific to Rhodobacterales ........................................................................ 21 Table 7: CSPs specific to Caulobacterales ........................................................................ 23 Table 8: CSPs specific to Sphingomonadales .................................................................... 24 Table 9 CSPs specific to Rhodospirillales ......................................................................... 25 Table 10: CSPs specific to Rickettsiales ............................................................................ 26 Table 11 Characteristics of Metagenomic Datasets Investigated in this Study ................. 44 viii
M. Sc. Thesis—Quan Yao McMaster—Biology
ix
M. Sc. Thesis—Quan Yao McMaster—Biology
Part I. Uniqueness of Alphaproteobacteria specific CSPs
Chapter 1 Introduction
1.1 Significance of Alphaproteobacteria
Alphaproteobacteria is one of the largest classes of Proteobacteria phylum, which
comprises 4 major classes: Betaproteobacteria, Gammaproteobacteria,
Deltaproteobacteria and Epsilonproteobacteria (Kersters et al., 2006).
Alphaproteobacteria contains 6 main orders: Rhizobiales, Rhodobacterales,
Caulobacterales, Sphingomonadales, Rhodospirillales and Rickettsiales, which are
featured by different characteristics (Williams et al., 2007). Alphaproteobacterial species
are morphologically, physiologically and metabolically diverse and adapt to different
habitats associated with both terrestrial and marine conditions (Rathsack et al., 2011;
Williams et al., 2007). Most characterized Alphaproteobacteria species are Gram-negative
bacteria (Olson et al., 2002). A myriad of them develop mechanisms to adopt an
intracellular lifestyle either as plant mutualists or animal pathogens (Dumler et al., 2001).
Some Alphaproteobacterial species can grow at low levels of nutrients (Kang et al.,
2010). Alphaproteobacteria undertake several important metabolic strategies such as
photosynthesis, nitrogen fixation, ammonia oxidation and methylotrophy (Campagne et
al., 2012). They are also morphologically diverse with stellate, spiral and prosthecate
(Hallez et al., 2004). Alphaproteobacteria is the most abundant cellular organism in
marines (Williams et al., 2007). Pelagibacter ubique, which was isolated in 2002, was
discovered to comprise 1/4 of all plankton cells in the ocean (Sowell et al., 2008).
1
M. Sc. Thesis—Quan Yao McMaster—Biology
Rhizobiales is the largest order of Alphaproteobacteria. It constitutes 1/3 of all
sequenced Alphaproteobacteria species (Carvalho et al., 2010). Rhizobiales species
develop several strategies to adapt both intracellular and extracellular niches (Carvalho et
al., 2010). Plant mutualists such as Rhizobium, Sinorhizobium and Bradyrhizobium are
capable of fixing nitrogen in symbiosis with most leguminous plants (Fischer, 1996).
Agricultural and animal pathogen such as Agrobacterium, Bartonella and Brucella are
obligatory and facultative intracellular bacteria of either plants or animal parasites and
have been studied extensively (Bowman, 2011). Bartonella henselae, the chief causative
agent of cat scratch disease (CSD) is called Gram-negative bacillus (English, 1988).
Intimate contact with infected cats such as scratches, bites and saliva can cause the
transmission of B. henselae (Andersson and Kempf, 2004). Fortunately, infection by
Bartonella sp. causes a mild injury, which can be easily treated with common antibiotics
(Holley, 1991). Another obligatory parasite of mammals——Brucella, are small, nonmotile coccobacilli and are more severe pathogens than Bartonella sp. (Alsmark et al.,
2004). They are usually passed in animals through gastrointestinal tract (GI track),
respiration and skin wounds, subsequently caussing brucellosis in many animals due to
their ability to survive phagocytosis (Breitschwerdt and Kordick, 2000). Severe infections
may affect the central nervous system or circulatory system, and antibiotic treatment such
as a combination of doxycycline and rifampin is necessary for at least 6 weeks while
treatment period mainly depends on the timing of treatment and severity of illness (Raoult
et al., 2003).
2
M. Sc. Thesis—Quan Yao McMaster—Biology
Most Rhodobacterales are purple non-sulfur bacteria, belonging to a larger group
called photolithotrophic bacteria (Dang et al., 2008). They employ several metabolic
mechanisms including photosynthesis, nitrogen fixation and fermentation, either under
aerobic or anaerobic conditions (Dang et al., 2008). Rhodobacter sphaeroides, first
isolated from deep lakes and stagnant waters (Choudhary and Kaplan, 2000), is
remarkable for two unique characteristics—— an innate oxygen sensing system based on
invaginations and two sets of chromosomes responsible for distinct functions. Versatility
of Rhodobacterales species in metabolism enables them to dominate many ecological
niches, especially abundant in oceans (Oh and Kaplan, 2001).
Caulobacterales is typically found in low-nutrient aquatic environments such as lakes
and rivers (Riemann et al., 2008). They have a featured stalk that can anchor the surfaces
of organisms nearby (Poindexter and Staley, 1996). The development of attaching
strategy increases their nutrient uptake since they expose themselves into a continuously
changing flow of fluids (Poindexter and Staley, 1996). Meanwhile Caulobacterales can
exploit the host’s excretions as extra nutrients when environmental nutrients are depleted
(Abraham et al., 2008).
Sphingomonadales are oval or rod-shaped bacteria, which is featured by its
sphingolipids located at the outer membrane of the cell wall (Yabuuchi and Kosako,
2005). Some of them are pleomorphic and the shapes of cells can change through time
while other relatives undertake phototrophic metabolism (Yurkov and Beatty, 1998).
Most Sphingomonadales species are widely spreading in diverse terrestrial and aquatic
habitats due to their ability of surviving in low nutrient environments (Boersma et al.,
3
M. Sc. Thesis—Quan Yao McMaster—Biology
2009). Sphingomonadales can be applied into bioremediation since some of the species
isolated from contaminated environments feed on toxic aromatic compounds as their
main nutrient source (Boersma et al., 2009).
Rhodospirillales comprise 2 distinct families: Acetobacteraceae and
Rhodospirillaceae (Gupta and Mok, 2007a). In Acetobacteraceae, soil bacteria—
Azospirillum employs the nutrients excreted by plants and in exchange fixes nitrogen into
ammonia from atmosphere for host plants (Steenhoudt and Vanderleyden, 2000).
Acetobacter and Gluconobacter are industrially important aerobic organisms widely used
in brewery for the fermentation of wine and vinegar by converting ethyl alcohol into
acetic acid (Gullo and Giudici, 2008). Rhodospirillum is a facultative anaerobic bacteria
(Yildiz et al., 1991). When oxygen is exhausted, Rhodospirillum activates the machinery
of photosynthesis apparatus to acquire nutrition (Yildiz et al., 1991). However the
mechanism of photosynthesis depression under aerobic conditions are poorly understood
(Matsuda et al., 1984).
The order Rickettsiales are mostly composed of human pathogens and marine bacteria
(Fredricks, 2006). The typical genus——Rickettsia are Gram-negative and rod shaped
pathogenic bacteria (Zomorodipour and Andersson, 1999). These obligate intracellular
parasites only reproduce within mammalian cells. Laboratory isolation and purification is
feasible with tissue culture or embryos. Rickettsia enter host cells by inducing
phagocytosis (Sahni and Rydkina, 2009). Once they penetrate into the cytoplasm of the
cell, reproducing of binary fission is conducted to ensure the survival of Rickettsia.
Infection by Rickettsia deteriorates the permeability of blood capillaries, which is
4
M. Sc. Thesis—Quan Yao McMaster—Biology
clinically characterized by spotted rash (Walker et al., 2003). Another obligatory
pathogen of clinical significance is the genus——Ehrlichia. Ehrlichia cause parasitemia
by living in blood cells (Arraga-Alvarado et al., 2003). They are often transmitted from
animals to humans through bites of infected ticks, which eventually result in ehrlichoisis
(Arraga-Alvarado et al., 2003). Apart from their pathogenic features, Rickettsiales are
also the closest relatives of Eukaryotic mitochondria organelles based on high genomic
similarity (Gray, 2012).
1.2 Conserved signature proteins as phylogenetic markers
Conserved signature proteins (CSPs) are a type of rare genomic changes (RGC) often
applied into phylogenetic analysis and taxonomic classification, because they are whole
proteins uniquely present in certain groups of bacteria but not found anywhere else (Gao
et al., 2006; Gupta and Lorenzini, 2007). Although most identified CSPs are of unknown
functions, their distribution pattern at different phylogenetic depths provides reliable
evidences to distinguish taxonomically coherent clades (Bhandari et al., 2012). Like other
RGCs, CSPs are mostly inherited vertically rather than horizontally, CSPs are applied to
elucidate the evolutionary relationships among closely-related clades (Bhandari et al.,
2012). Recent studies proved that these CSPs could be identified in newly-sequenced
species (Bhandari et al., 2012; Gao and Gupta, 2012). Due to their clade specificity and
conservative property, it is postulated that the CSPs may be present in uncharacterized
Alphaproteobacterial species. Environmental Alphaproteobacterial species may also carry
such molecular markers to demonstrate their affiliation to their laboratory relatives.
Previous analysis of approximate 60 Alphaproteobacteria genomes has identified 265
5
M. Sc. Thesis—Quan Yao McMaster—Biology
CSPs specific to different phylogenetic clades (Gupta and Mok, 2007a). Serving as
reliable molecular markers, these CSPs are utilized to predict the presence of
Alphaproteobacteria species in environment samples if similar sequences are identified.
1.3 Standards for taxonomic hierarchy
The most reputable criterion currently used for taxonomic purpose is based on the
branching pattern of 16S rRNA trees (Nguimbi et al., 2003). Because 16S rRNA gene is
universally present in almost all bacteria species and is featured by its dual-characteristics
that both conserved and variant regions are alternately located on this gene (Nguimbi et
al., 2003). The conserved regions of 16S rRNA are used to infer the common ancestor of
them while the variant region differentiate one species from the other (Moine et al.,
2000). Nowadays, Bacteria domain is classified into 23 major groups according to the
phylogenetic tree of 16S rRNA (Ludwig et al., 1998). However, the numbers of species in
different phyla are not evenly distributed but are biased by the fact that some genera may
be studied more intensively than others. For instance, Proteobacteria, Actinobacteria,
Firmicutes, Cyanobacteria and Bacteroidetes are the 5 largest phyla, which comprise
90~95% of all known bacteria in laboratory (Binnewies et al., 2006). While some other
small phyla such as Ignavibacteriae, Caldiserica, Chrysiogenetes, Dictyoglomi and
Themodesulfobacteria only account for less than 1% of the bacteria studied (Binnewies et
al., 2006). Furthermore, due to the low resolution capacity of the 16S rRNA gene marker
below genus level, phylogenetic trees based on a single gene cannot robustly resolve all
the issues regarding evolutionary events of different bacterial species (Kunisawa, 2007).
6
M. Sc. Thesis—Quan Yao McMaster—Biology
Hence, the taxonomic hierarchy of Bacteria domain is primarily subjective and there is, as
yet, no consistent agreement on their phylogeny (Gupta, 2005a).
To describe the evolutionary relationships of bacteria appropriately, phylogenetic
trees based on topological models such as rooted tree, unrooted tree and bifurcating tree
can be determined (Williams et al., 2011). In an idealized rooted phylogenetic tree, all
bacteria are derived from a common ancestor bacterium and the earliest bacterium is
found at the foot of the phylogenetic tree (Arisue et al., 2005). Each branch indicates the
divergence of a large bacterial clade such as phylum or class in evolutionary history. The
closer a branch is to the foot, the earlier the divergence event occurred. Recent branches
denote the further evolution of different sub-clades such as order, family, genus and
species. Bacteria on the same branch have more characteristics in common than the ones
on different branches (Doolittle and Bapteste, 2007).
The purpose of identifying CSPs is to provide reliable evidence for each node of
phylogenetic tree and support the validity of determined branching pattern of
phylogenetic tree (Gupta and Griffiths, 2002). Previous studies have identified a myriad
of CSPs specific to different clades within Alphaproteobacteria. These molecular markers
well resolved the phylogeny of Alphaproteobacterial species (Gupta, 2005b; Gupta and
Mok, 2007b; Kainth and Gupta, 2005). With increased availability of large datasets,
sufficient CSPs can construct a comprehensive and reliable phylogenetic tree for both
Alphaproteobacteria and the whole Bacteria kingdom.
1.4 Project objectives
7
M. Sc. Thesis—Quan Yao McMaster—Biology
Alphaproteobacteria-specific CSP have been proved to be useful in inferring
phylogenetic trees and branching patterns within Alphaproteobacteria clades (Gupta and
Mok, 2007a). Although the majority of CSPs are of hypothetical proteins, these proteins
may assign certain functions or characteristic to distinguish species belonging to
Alphaproteobacteria clades from all others. The aim of this project is to confirm the
specificity of previous identified Alphaproteobacteria specific CSPs at different
phylogenetic depths by performing BLAST searches against the latest nr protein database.
Then, according to the distribution of the CSPs in bacterial taxonomy, all determined
CSPs are grouped based on their specificity. Finally, a CSPs database that represents
different clades of Alphaproteobacteria from class level to family level will be
constructed to serve as signature markers for bacteria diagnosis in environments.
8
M. Sc. Thesis—Quan Yao McMaster—Biology
Chapter 2 Materials and methods
2.1 Confirmation of the uniqueness of CSPs
In view of the large increase in the number of sequenced bacterial genomes in the past
6 years, current CSPs may be identified in new species, no matter whether they are
members of Alphaproteobacteria or not. So, systematic BLASTp searches (Altschul et al.,
1990) were performed on each CSP against the NCBI non-redundant protein sequences
(nr) database (all non-redundant GenBank CDS translations + PDB + SwissProt + PIR +
PRF excluding environmental samples from WGS projects) with an E-value threshold of
1x10-e04 to confirm their specificity. Meanwhile, a parallel BLASTn search was
conducted on the nucleotide sequences of corresponding CSPs to compare the uniqueness
between amino acid sequences and nucleotide sequences. By convention, Blast hits with
associated E-values >1e-04 do not support orthology, thus the hits exceeding this E-value
threshold are excluded from phylogenetic analysis. However, in some cases, when query
proteins are too short to yield sufficient information (bits of information) to determine
discriminating E-value, higher E-values can be employed (Sharon et al., 2005). A
potential CSP is considered to be clade specific if all significant Blast analysis hits are
derived from within a monophyletic clade of Alphaproteobacteria or if there is a large
difference in the determined E-value of the last hit belonging to Alphaproteobacterial
relatives to the first identified hit of non-Alphaproteobacteria (Gupta and Mok, 2007a).
All significant hits of CSPs meeting these criteria described above were further analyzed
as described below.
9
M. Sc. Thesis—Quan Yao McMaster—Biology
2.2 Grouping of CSP into Taxonomic levels
We determined the taxonomic placement of significant hits for each CSP from
BLASTp searches. A CSP should have multiple, similar sequences that are shared among
several closely related species. The taxonomic report produced by BLASTp searches
yield a distribution of query CSP in all Bacteria. The lowest common ancestor (LCA) of
reported taxa was identified. LCA analysis indicates the most recent taxon from which all
descendant organisms are derived (Travers et al., 2004). For example, if a CSP is
identified in 50 species, and these species belong to 3 genera X, Y, Z under 2 families M,
N under 1 order A, this CSP will be defined as order A-specific CSP. It will not be named
as genus X specific or family M specific CSP because this marker is not uniquely present
in a single genus or family but also present in genera Y, Z and family N. Principles of
LCA analysis yield the most parsimonious definition for the specificity of this CSP
(Travers et al., 2004). A few organisms out of the clade may also share some CSPs found
within a monophyletic clade of Alphaproteobacteria. These are likely due to lateral gene
transfer (LGT) event but these protein markers may still be regarded as clade-specific
markers (Beiko and Ragan, 2008). Occasionally, very few CSP might be found
sporadically distributed in several distantly related bacteria clades. These signature
markers are likely to be misdiagnosed due to the limited number of sequenced
Alphaproteobacterial species at that time, and non-specific markers will be excluded from
CSPs database.
10
M. Sc. Thesis—Quan Yao McMaster—Biology
Table 1: Alphaproteobacteria specificity and predictive ability of CSPs identified in
2007 and 2013
Clade Specificity
# of sequenced genomes
# of identified CSPs
Accession # and
other information
2007
2013
2007
2013
Alphaproteobacteria
60
250
4
4
Table 3A
Alphaproteobacteria
except Rickettsiales
45
180
7
7
Table 3B
Rhizobiales
24
96
3
3
Table 4A
Clade 1 Rhizobiales
14
58
16
16
Table 4B
Rhizobiaceae and
Phyllobacteriaceae
6
30
18
18
Table 5C
Bradyrhizobiaceae
Xanthobacteraceae
10
20
74
74
Table 5A, 5B
Rhodobacterales
8
26
35
35
Table 6A
Rhodobacteraceae
3
4
13
13
Table 6B
Caulobacterales
3
7
11
11
Table 7
Sphingomonadales
5
14
31
31
Table 8
Rhodospirillales
5
27
4
0
Acetobacteraceae
3
17
14
17
Table 9A
Rhodospirillaceae
2
10
14
14
Table 9B
15
69
3
2
Table 10A
Anaplasmataceae
7
23
15
16
Table 10B
Rickettsiaceae
7
45
3
3
Table 10C
Rickettsiales
N/A
Note: The values underlined highlight the changes of CSP specificity during the periods
11
M. Sc. Thesis—Quan Yao McMaster—Biology
Table 2: Comparison of the Results of BLAST Search with Protein and Nucleotide
Sequences
Accession
# of Hits1 Protein Specificity
Gene ID
# of Hits2 Nucleotide Specificity
NP_422086
621
α-proteobacteria
943808
8
Caulobacteraceae
NP_105743
276
Clade1 Rhizobiales
1228404
13
Mesorhizobium and
Sinorhizobium
NP_102577
76
Rhizobiaceae
1225240
2
Mesorhizobium
YP_317328
32
Bradyrhizobiaceae
3674956
2
Nitrobacter
YP_611978
92
Rhodobacterales
4075456
1
Ruegeria sp. TM1040
YP_614100
21
Rhodobacteraceae
4077857
1
Silicibacter sp.
TM1040
YP_495301
76
Sphingomonadales
3916060
1
Novosphingobium
aromaticivorans
AAW62008
45
Acetobacteraceae
3249894
1
Gluconobacter
oxydans
YP_428643
23
Rhodospirillaceae
3837017
2
Rhodospirillum
rubrum
NP_220498
92
Rickettsiales
883719
42
Rickettsia
1. Significant hits (hits with E-values below 1e-04) of protein sequences were obtained
using BLASTp
2. Significant hits (hits with E-values below 1e-04) of nucleotide sequences were obtained
using BLASTn
12
M. Sc. Thesis—Quan Yao McMaster—Biology
Chapter 3. Results
3.1 Confirmation of the uniqueness of CSPs
Most CSP were found to be specific to their original taxa given that the sequenced
Alphaproteobacteria species have increased almost 4 times (Table 1). In the CSPs
database, 4 Alphaproteobacteria-specific CSPs used to be shared by 60 sequenced
Alphaproteobacteria species are now uniquely shared by more than 250
Alphaproteobacteria species, including many of the recently sequenced members between
2007~2013. Similar results were also seen in the other 7 Alphaproteobacteria-specific
CSPs (they were absent in Rickettsiales order). The 47 Rhizobiales-specific CSPs were
also confirmed to be specific for most Rhizobiales species. For example, 3 Rhizobiales
specific CSPs have been identified in almost all 96 sequenced Rhizobiales species. In
detail, 16 Clade 1 Rhizobiales were commonly shared by 11 Rhizobiaceae species, 8
Phyllobacteriaceae species, 2 Aurantimonadaceae species, 16 Brucellaceae species and
12 Bartonellaceae species (another 18 CSPs were only present in Rhizobiaceae and
Phyllobacteriaceae species). Likewise, another important clade of Bradyrhizobiaceae and
Xanthobacteraceae yielded a similar pattern. 74 CSPs were identified present in 18
Bradyrhizobiaceae and 3 Xanthobacteraceae species. Blast searches results for other 4
important orders under Alphaproteobacteria also validated the prediction that previousidentified CSPs based on limited number of sequenced Alphaproteobacteria were present
in newly sequenced Alphaproteobacterial species. 35 Rhodobacterales specific CSPs
were highly conserved in 41 Rhodobacterales species, while 13 previous Silicibacter and
Roseobacter specific CSPs were present in other Rhodobacteraceae sp., such as
13
M. Sc. Thesis—Quan Yao McMaster—Biology
Phaeobacter and Ruegeria species. These 13 CSPs are now defined as Rhodobacteraceae
specific CSPs. 11 Caulobacterales specific CSPs were found unique to 9 Caulobacterales
species and 4 Hyphomonadaceae species. 31 Sphingomonadales specific CSPs are now
uniquely present in 3 Erythrobacteraceae species and 17 Sphingomonadaceae species.
Most Rhodospirillales-specific CSPs and Rickettsiales-specific CSPs were conserved
within their group. However, 4 Rhodospirillales-specific CSPs were proved to be only
specific to Acetobacteraceae and 1 Rickettsiales-specific CSP was proved to be specific
to Anaplasmataceae species (underlined in Table 1). Only 1 non-specific CSP was
identified (Accession number: AAW61951), which used to be specific to
Acetobacteraceae. This was the only CSP that did not meet the classification criterion and
as a result the CSP database contained 264 qualified CSPs in total.
Important differences were observed in the clade specificity of the same genes. When
Blast searches were performed using the nucleotide sequence data versus the protein
sequence data (Table 2). For example, for two of the signature proteins, which were
specific for the family Anaplasmataceae (viz. NP_966526 and NP_965909), when Blastp
searches were carried out using the amino acid sequence data, significant hits were
observed for all of the sequenced species from the family Anaplasmataceae (e.g.
Wolbachia, Anaplasma, Ehrlichia, etc.). In contrast, when the Blast searches were carried
out using the gene sequence for the same proteins, then depending upon whether the
searches were carried out with the Wolbachia or Anaplasma gene sequences, all
significant hits obtained were only for the Wolbachia or the Anaplasma species.
Similarly, for a signature protein that is specific for Caulobacterales (viz. NP_419305),
14
M. Sc. Thesis—Quan Yao McMaster—Biology
the Blastp search with its amino acid sequence identified >30 significant hits covering all
of the sequenced Caulobacterales species, while blastn search with its nucleotide
sequence identified only 6 significant hits most of which were from the genus
Caulobacter. Similar differences are observed in the results of blast searches for the
signature proteins for other bacterial clades. Thus, the use of gene sequences as marker
genes may grossly underestimates the taxonomic diversity of microbial species in
environments than as revealed by the use of CSPs.
3.2 Grouping of CSP into Taxonomic levels
Once we filtered all qualified CSP, it is possible to group them together based on their
taxonomic specificity. All these CSPs are specific to either Alphaproteobacteria class or
different orders and families within Alphaproteobacteria. In the CSPs database, they are
divided into 8 major groups, including 11 Alphaproteobacteria specific CSPs (Table 3).
47 Clade-1 Rhizobiales specific CSPs (Table 4), 74 Bradyrhizobiaceae and
Xanthobacteraceae specific CSPs (Table 5), 48 Rhodobacterales specific CSPs (Table 6),
11 Caulobacterales specific CSPs (Table 7), 31 Sphingomonadales specific CSPs (Table
8), 31 Rhodospirillales specific CSPs (Table 9) and 21 Rickettsiales specific CSPs (Table
10).
15
M. Sc. Thesis—Quan Yao McMaster—Biology
Table 3: CSPs specific to Alphaproteobacteria
Gene ID
Accession #
Length
Gene ID
Accession #
Length
A. CSPs unique to all Alphaproteobacteria
CC2102
NP_420905
162
CC3319
NP_422113
89
CC3292
NP_422086
224
CC1365
NP_420178
161
B. CSPs unique to Alphaproteobacteria except Rickettsiales
CC1211
NP_420025
167
CC0520
NP_419339
284
CC1886
NP_420693
223
CC3010
NP_421804
216
CC2245
NP_421048
190
CC0100
NP_418919
576
CC3470
NP_422264
253
16
M. Sc. Thesis—Quan Yao McMaster—Biology
Table 4 CSPs specific to Rhizobiales
Gene ID
Accession #
Length
Gene ID
Accession #
Length
BQ12030
YP_032733
91
A. CSPs unique to Rhizobiales
BQ00720
YP_031797
83
BQ07670
YP_032395
336
B. CSPs unique to Brucellaceae, Bartonellaceae, Phyllobacteriaceae, Rhizobiaceae and
Aurantimonadaceae
mll0062
NP_101943
107
mll1268
NP_102895
108
mll4068
NP_105027
144
mll2847
NP_104087
186
mll7791
NP_108034
263
mll2898
NP_104130
144
mlr0777
NP_102510
186
mll4298
NP_105201
171
mlr0789
NP_102519
207
mll5001
NP_105743
324
mlr3016
NP_104217
166
mll8359
NP_108472
415
msl6526
NP_107016
80
mlr1823
NP_103319
198
mll0122
NP_101988
349
mlr0094
NP_101965
299
C. CSPs unique to Rhizobiaceae and Phyllobacteriaceae
mll0080
NP_101954
172
mll0459
NP_102252
108
mll0867
NP_102577
168
mll1779
NP_103286
141
mll9619
NP_109472
296
mll6195
NP_106741
174
mlr5174
NP_105883
181
mll8758
NP_106740
205
mll6303
NP_106835
292
mlr3037
NP_104236
281
mll6703
NP_107159
198
mll2007
NP_103455
289
mlr1904
NP_103376
146
mlr1999
NP_103450
111
mlr3274
NP_104418
461
mlr2029
NP_103476
238
17
M. Sc. Thesis—Quan Yao McMaster—Biology
Gene ID
Accession #
Length
Gene ID
Accession #
Length
mlr4951
NP_105704
84
mlr6601
NP_107075
141
18
M. Sc. Thesis—Quan Yao McMaster—Biology
Table 5: CSPs specific to Bradyrhizobiaceae and Xanthobacteraceae
Gene ID
Accession #
Length
Gene ID
Accession #
Length
A. CSPs unique to Bradyrhizobiaceae and Xanthobacteraceae
bll6014
NP_772654
193
Nwi_1674
YP_318287
185
Nwi_1093
YP_317707
195
Nwi_1705
YP_318318
63
Nwi_1227
YP_317841
106
Nwi_1711
YP_318324
77
Nwi_1786
YP_318399
126
Nwi_1785
YP_318398
422
Nwi_1788
YP_318401
190
Nwi_1793
YP_318406
165
Nwi_2147
YP_318753
82
Nwi_1800
YP_318413
84
B. CSPs unique to Bradyrhizobiaceae
Nwi_2179
YP_318785
161
Nwi_2021
YP_318632
172
Nwi_2432
YP_319038
110
Nwi_2063
YP_318673
186
Nwi_2476
YP_319081
85
Nwi_2064
YP_318674
148
Nwi_2572
YP_319177
171
Nwi_2163
YP_318769
156
Nwi_2623
YP_319228
87
Nwi_2173
YP_318779
109
Nwi_2707
YP_319312
198
Nwi_2183
YP_318789
129
bll5899
NP_772539
131
Nwi_2208
YP_318814
174
blr6106
NP_772746
141
Nwi_2244
YP_318850
164
Nwi_0278
YP_316897
398
Nwi_2247
YP_318853
230
Nwi_0503
YP_317122
108
Nwi_2379
YP_318985
450
Nwi_0528
YP_317147
66
Nwi_2381
YP_318987
63
Nwi_0605
YP_317224
71
Nwi_2414
YP_319020
89
Nwi_0710
YP_317328
248
Nwi_2489
YP_319094
259
Nwi_0925
YP_317539
86
Nwi_2492
YP_319097
122
19
M. Sc. Thesis—Quan Yao McMaster—Biology
Gene ID
Accession #
Length
Gene ID
Accession #
Length
Nwi_0966
YP_317580
260
Nwi_2500
YP_319105
152
Nwi_1084
YP_317698
385
Nwi_2506
YP_319111
72
Nwi_1092
YP_317706
145
Nwi_2509
YP_319114
98
Nwi_1107
YP_317721
121
Nwi_2531
YP_319136
96
Nwi_1108
YP_317722
121
Nwi_2575
YP_319180
399
Nwi_1336
YP_317949
146
Nwi_2577
YP_319182
135
Nwi_1139
YP_317753
321
Nwi_2588
YP_319193
62
Nwi_1247
YP_317861
113
Nwi_2630
YP_319235
141
Nwi_1270
YP_317883
137
Nwi_2676
YP_319281
217
Nwi_1275
YP_317888
126
Nwi_2677
YP_319282
102
Nwi_1454
YP_318067
160
Nwi_2769
YP_319374
127
Nwi_1498
YP_318111
142
Nwi_2789
YP_319394
112
Nwi_1512
YP_318125
409
Nwi_2984
YP_319586
68
Nwi_1581
YP_318194
99
Nwi_2959
YP_319561
87
Nwi_1582
YP_318195
83
Nwi_3035
YP_319637
582
Nwi_1586
YP_318199
182
Nwi_3140
YP_319739
156
Nwi_1649
YP_318262
101
Nwi_3141
YP_319740
104
20
M. Sc. Thesis—Quan Yao McMaster—Biology
Table 6 CSPs specific to Rhodobacterales
Gene ID
Accession #
Length
Gene ID
Accession #
Length
A. CSPs unique to Rhodobacterales
TM1040_0093
YP_612088
168
TM1040_1988
YP_613982
105
TM1040_0184
YP_612179
289
TM1040_2263
YP_614257
761
TM1040_0236
YP_612231
270
TM1040_2370
YP_614364
221
TM1040_0471
YP_612466
179
TM1040_2425
YP_614419
278
TM1040_0586
YP_612581
329
TM1040_2466
YP_614460
241
TM1040_0587
YP_612582
291
TM1040_2487
YP_614481
272
TM1040_0697
YP_612692
80
TM1040_2582
YP_614576
122
TM1040_0750
YP_612745
154
TM1040_2999
YP_614993
121
TM1040_0752
YP_612747
130
TM1040_3077
YP_611313
175
TM1040_1063
YP_613058
112
TM1040_3749
YP_611978
343
TM1040_1064
YP_613059
135
TM1040_3759
YP_611988
207
TM1040_1247
YP_613242
161
TM1040_3764
YP_611993
276
TM1040_1350
YP_613345
179
TM1040_1558
YP_613553
70
TM1040_1406
YP_613401
181
TM1040_1735
YP_613730
138
TM1040_1567
YP_613562
351
TM1040_2157
YP_613732
360
TM1040_1842
YP_613837
148
TM1040_2443
YP_613733
212
TM1040_1967
YP_613961
732
TM1040_2680
YP_613734
202
TM1040_1844
YP_613839
256
B. CSPs unique to Rhodobacteraceae
TM1040_1099
YP_613094
149
TM1040_3189
YP_611425
93
TM1040_1423
YP_613418
124
TM1040_3202
YP_611438
109
21
M. Sc. Thesis—Quan Yao McMaster—Biology
Gene ID
Accession #
Length
Gene ID
Accession #
Length
TM1040_1451
YP_613446
194
TM1040_3208
YP_611444
100
TM1040_1986
YP_613980
193
TM1040_3226
YP_611462
270
TM1040_2106
YP_614100
105
TM1040_3529
YP_611763
288
TM1040_2139
YP_614133
102
TM1040_3626
YP_611855
192
TM1040_3075
YP_611311
84
22
M. Sc. Thesis—Quan Yao McMaster—Biology
Table 7: CSPs specific to Caulobacterales
Gene ID
Accession #
Length
Gene ID
Accession #
Length
CC0486
NP_419305
258
CC1066
NP_419882
126
CC2480
NP_421283
253
CC1586
NP_420397
214
CC2764
NP_421560
415
CC2207
NP_421010
222
CC3101
NP_421895
379
CC2628
NP_421428
147
CC0512
NP_419331
289
CC2639
NP_421438
309
CC1064
NP_419880
296
23
M. Sc. Thesis—Quan Yao McMaster—Biology
Table 8: CSPs specific to Sphingomonadales
Gene ID
Accession #
Length
Gene ID
Accession #
Length
Saro_0018
YP_495301
300
Saro_0044
YP_495327
129
Saro_0052
YP_495335
193
Saro_0154
YP_495437
97
Saro_0087
YP_495370
221
Saro_0415
YP_495697
140
Saro_0150
YP_495433
133
Saro_0458
YP_495740
319
Saro_0232
YP_495514
448
Saro_1078
YP_496357
223
Saro_0409
YP_495691
175
Saro_1126
YP_496405
286
Saro_1088
YP_496367
220
Saro_1160
YP_496439
103
Saro_1144
YP_496423
243
Saro_1163
YP_496442
70
Saro_1291
YP_496569
190
Saro_1748
YP_497022
221
Saro_1378
YP_496656
227
Saro_1785
YP_497059
117
Saro_1914
YP_497188
156
Saro_1972
YP_497246
72
Saro_2130
YP_497403
184
Saro_2036
YP_497309
414
Saro_2788
YP_498058
296
Saro_2037
YP_497310
99
Saro_2958
YP_498227
251
Saro_2333
YP_497604
568
Saro_3138
YP_498407
159
Saro_2548
YP_497818
290
Saro_3213
YP_498482
246
24
M. Sc. Thesis—Quan Yao McMaster—Biology
Table 9 CSPs specific to Rhodospirillales
Gene ID
Accession #
Length
Gene ID
Accession #
Length
A. CSPs unique to Acetobacteraceae
GOX0633
AAW60410
347
GOX1222
AAW60983
304
GOX0695
AAW60472
165
GOX1224
AAW60985
207
GOX0963
AAW60735
311
GOX2275
AAW62008
201
GOX1258
AAW61019
186
GOX2316
AAW62049
628
GOX0143
AAW59936
198
GOX2452
AAW62183
143
GOX1616
AAW61357
430
GOX2454
AAW62185
466
GOX0343
AAW60126
232
GOX1233
AAW60994
272
GOX1212
AAW60973
472
GOX2456
AAW62187
497
GOX1215
AAW60976
133
B. CSPs unique to Rhodospirillaceae
Rru_A0125
YP_425217
449
Rru_A2592
YP_427676
231
Rru_A0152
YP_425244
138
Rru_A2828
YP_427912
169
Rru_A0531
YP_425622
588
Rru_A3562
YP_428643
349
Rru_A1689
YP_426776
178
Rru_A3636
YP_428717
464
Rru_A1756
YP_426843
139
Rru_A3662
YP_428743
119
Rru_A2112
YP_427199
237
Rru_A3739
YP_428820
464
Rru_A2510
YP_427597
184
Rru_A3800
YP_428881
153
25
M. Sc. Thesis—Quan Yao McMaster—Biology
Table 10: CSPs specific to Rickettsiales
Gene ID
Accession #
Length
Gene ID
Accession #
Length
70
WD0715
NP_966474
94
A. CSPs unique to Rickettsiales
WD0161
NP_965979
B. CSPs unique to Anaplasmataceae
WD0083
NP_965909
271
WD0821
NP_966574
156
WD0827
NP_966580
191
WD0863
NP_966613
147
WD0157
NP_965975
242
WD0771
NP_966526
460
WD0148
NP_965966
139
WD0764
NP_966520
138
WD0772
NP_966527
202
WD1025
NP_966750
97
WD0412
NP_966202
143
WD1056
NP_966779
92
WD0467
NP_966253
106
WD1220
NP_966932
204
WD0757
NP_966513
290
WD1230
NP_966942
243
RP187
NP_220576
194
C. CSPs unique to Rickettsiaceae
RP030
NP_220424
219
RP192
NP_220581
128
26
M. Sc. Thesis—Quan Yao McMaster—Biology
Chapter 4 Discussion
4.1 Confirmation of the uniqueness of CSP
The purpose of this study was to determine if the CSPs identified in earlier studies
could still be regarded as specific for the desired group so that the results obtained with
them in metagenomic analysis will be reliable. The results of re-BLAST studies indicate
that most of these CSPs are still specific for the previously reported taxonomic units but
there are a small number of exceptions (Table 1). For example, among all the 265 CSPs
examined in this study, only 6 proteins are no longer diagnostic. One Rickettsiales
specific CSP (Accession No.: NP_966526) becomes Anaplasmataceae specific CSP in
this study (Table 10B). Similarly, four CSPs, which were previously-regarded as unique
to Rhodospirillales order (Gupta and Mok, 2007a), have now been determined to be
uniquely present in either Glucobacter (Accession No.: AAW60410, AAW60472) or
Acetobacteraceae (Accession No.: AAW60735, AAW61019) (Table 9A). Thus, no
Rhodospirillales CSP has yet been identified. Another Acetobacteraceae specific CSP
(accession number: AAW61951) is found to be sporadically distributed protein present in
some other distantly related bacterial cohorts including Verrucomicrobia and
Planctomycetes. These CSPs were probably misidentified earlier due to the limited
number of sequenced Rhodospirillales species available (3 Acetobacteraceae species and
2 Rhodospirillaceae species available in 2007) (Gupta and Mok, 2007a). The majority of
CSPs maintain their original taxonomic specificity, which has been identified in desired
bacterial species that were fully sequenced after 2007 (Table 1).
27
M. Sc. Thesis—Quan Yao McMaster—Biology
4.2 Grouping of CSP into Taxonomic levels
The CSPs used in this work were first identified when information was only available
for a limited number of Alphaproteobacterial species (Gupta and Mok, 2007a). Hence, an
initial undertaking in this work was to confirm their group specificities. Blast searches
results again confirmed that most of these proteins were still specific for the originally
indicated taxonomic clades despite many fold increase in the number of sequenced
Alphaproteobacteria genomes (Table 1). Most of these signature markers are present in
the genomes of newly sequenced Alphaproteobacteria species, belonging to the
appropriate taxonomic groupings, but not in any other bacteria. Based upon their
observed specificities for different clades of Alphaproteobacteria, these CSPs are
endowed with distinctive characteristics to indicate the divergence of Alphaproteobacteria
clades in evolutionary history. And these molecular markers provide reliable evidence to
support the branching pattern of Alphaproteobacteria in a revolutionary context.
The grouping of molecular markers is based on phylogenetic analysis of CSPs’
specificity. Each molecular marker is shared by several closely related taxa at any
taxonomic ranking such as class, order and family. Phylum or genus specific markers
were not considered in this study. Since there are sufficient CSPs that have been
identified previously and they represent almost all major clades of Alphaproteobacteria
and thus these CSP can be divided into 8 groups based on their taxonomic rankings.
They are either specific to Alphaproteobacteria or sub-clades of Alphaproteobacteria.
CSP database consists of three tiers. Tier 1 CSPs are specific to Alphaproteobacteria
class. Tier 2 CSPs are specific to different orders of Alphaproteobacteria such as
28
M. Sc. Thesis—Quan Yao McMaster—Biology
Rhizobiales, Rhodobacterales, and Caulobacterales. Tier 3 CSPs are specific to
constituent families within these orders. With all these three tiers of CSPs, it is possible to
diagnose the presence of organisms in a hierarchical manner. Tier 2 CSPs are not evenly
distributed in all 6 different orders of Alphaproteobacteria. The largest order Rhizobiales
contains 121 CSPs, which comprise almost 45% of all CSPs while Caulobacterales
embody merely 11 CSPs. The disparity of CSP volume in different orders results from the
bias of fully sequenced Alphaproteobacterial genomes. Pathogenic and agricultural
Alphaproteobacterial species are studied more extensively. Apart from those CSPs unique
to class, order and family level, phylum specific and genus specific CSPs are also
available for Proteobacteria and Brucella. Since this project mainly concentrates on
Alphaproteobacteria class, CSPs specific to Betaproteobacteria/Gammaproteobacteria are
not considered during database construction. As Brucella is an intracellular pathogen, it
is likely that Brucella specific CSPs cannot be readily detected in environmental samples
and thus they are not included in the CSPs database.
4.3 Future experiments
The next objective of my project is to detect the presence of different
Alphaproteobacteria clades in metagenomic samples. More experiments need to be
designed as follows:
(i) Selection of suitable metagenome for Alphaproteobacteria detection. Parameters
such as the relative abundances of Alphaproteobacteria in metagenomic datasets will be
taken into account for metagenomes selection. Qualified metagenomes will be used for
organism identification.
29
M. Sc. Thesis—Quan Yao McMaster—Biology
(ii) Application of CSP database into metagenomes. This will test if the CSP database
can be used to identify environmental bacteria
(iii) Comparative analysis of metagenomes for taxonomical profiling of
Alphaproteobacteria. Experiment results from CSPs will be compared to verify whether
CSP based similarity search produces reliable results like traditional similarity-based
binning
All these experiments described above, once accomplished, are expected to address the
issues and objectives of this project.
30
M. Sc. Thesis—Quan Yao McMaster—Biology
Part II Identification of Alphaproteobacteria specific CSPs in metagenomic samples
Chapter 1 introduction
1.1 Metagenome, environmental genomes
Metagenome is a composite genomes of all organisms from an environmental sample,
(Thomas et al., 2012). It investigates microbial world by applying sequencing method
and bioinformatics technologies to the environmental microbial communities,
overlooking the need of isolation and culturing of individual microbial members
(Ghazanfar et al., 2010). Only 1.0% of all micro-organisms on the earth could be cultured
successful in artificial media (Ferrari et al., 2005). For instance, soil microbial
communities are estimated to comprise 5000~20000 different species, however only
50~200 of them can be isolated and cultured (Handelsman, 2004). Metagenomic studies
may provide more microbial diversity information from the environment (Gilbert and
Dupont, 2011).
All sequence-based metagenomic studies follow similar procedure:
(1) Total genomic DNA from all environmental samples such as soil, permafrost,
marine water, termite gut, human intestine are extracted directly without isolation and
culturing (Solonenko et al., 2013). Before sequencing, quality control (QC) and duplicate
clustering (DC) are performed to reduce potential artificial sequences present in
unassembled raw read data. QC filter calculates the average quality score of each read.
According to the statistical analysis on the input reads, the overall quality performance
and the high quality reads are fetched for further analysis (Lindner et al., 2013). Duplicate
clustering is another important preparatory step to identify duplicates from raw data read.
31
M. Sc. Thesis—Quan Yao McMaster—Biology
These duplicates are mainly sequencing artifacts in metagenomic library such as vectors
and plasmids. Duplicate clustering also reduces the redundancy of metagenomic reads to
yield a non-redundant dataset (Li et al., 2012). Since raw metagenomic reads are almost
non-redundant due to the complexity of environmental bacterial communities, DC does
not biased the results for subsequent experiments (Lindner et al., 2013). However, most
duplicates in transcriptomes are not nonsense sequences, so it is not suggested to run DC
workflow for meta-transcriptomic datasets (Li et al., 2012).
(2) Metagenomic samples are sequenced either through vector sequencing or direct
sequencing (Morgan et al., 2010). In the former protocol, environmental DNAs are
fragmented into small pieces, which are subsequently inserted into the vectors of
Escherichia coli to build metagenomic library (Lussier et al., 2011). Direct sequencing
skips the step for metagenomic library construction and sequence original microbial
fragmented genomes in environmental samples (Kisand et al., 2012).
(3) The purpose of metagenomic assembly is to assemble similar sequences from
related genomes while prevent assembly of similar sequences from irrelevant genomes
(Ruby et al., 2013). The metagenomic reads are assembled into contigs and scaffolds
(Nijkamp et al., 2013). However, metagenomic sequence assembly is a major bottleneck
in metagenomic studies. Repeats lead to the ambiguity genome recovery. Deficient
coverage generates many gaps in between genomes. Sequencing errors become an
inherent blemish preceding any bioinformatic analysis (Huang et al., 2012). In many
metagenomic studies, direct analysis is implemented on raw reads without sequencing
assembly (Takacs-Vesbach et al., 2013).
32
M. Sc. Thesis—Quan Yao McMaster—Biology
(4) RNA and open reading frames (ORFs) prediction are performed through basic local
alignment search tool (BLAST) (Altschul et al., 1990). It is an algorithm used to compare
the extent of similarity between two sequences, and both amino acid sequences or
nucleotide sequences applies. BLAST search compares the query sequences to a database
of sequences to identify known sequences relative to query sequences above a cutoff
threshold (Altschul et al., 1990). Apart from sequence alignment similarity search by
BLAST, Hidden Markov Model pattern is an alternative solution to predict rRNAspecific structures and six-reading frame translation and it is applied to identify all
potential ORFs within a DNA sequence of any size (Siepel and Haussler, 2004). Gene
prediction of RNA and ORFs excavates taxonomic information and functional categories
in metagenomic reads (Leimena et al., 2013).
(5) After predicting the phylogeny of tRNA and ORFs of proteins, all annotated
sequences are classified according to their most-likely taxonomic origin and functional
category (Strous et al., 2012). For taxonomic clustering, all metagenomic reads showing
similar phylogenetic affiliations are emplaced on a certain taxon in bacterial taxonomy
(Dröge and McHardy, 2012). There are two algorithm to calculate the phylogenetic
affiliation of metagenomic sequences. One of which depend on the best hit of BLAST
search to determine the taxonomic origin of reads, while another method, which is more
parsimonious and reliable, takes the lowest common ancestor of all significant hits above
threshold to affirm the taxonomic placement of metagenomic reads (Albertsen et al.,
2013). As for functional binning, all annotated gene are mapped to databases resources
such as Kyoto Encyclopedia of Genes and Genomes (KEGG) and SEED classifications
33
M. Sc. Thesis—Quan Yao McMaster—Biology
based on higher functional categories and subordinate biological subsystems (Mitra et al.,
2011).
Unveiling the taxonomic and functional diversity of microbial community in particular
environment enables us to answer 2 questions: “Who is there?” and “What are they
doing?” (Handelsman, 2004). Through constructing the networks between environmental
sequences and microbial attributes, it is feasible to predict the potential presence of
similar or identical species and functional pathways in other similar environments (Ghai
et al., 2013). Understanding the composition of microbial communities and their
interaction networks allows identification of the core bacterial metabolic pathways
implemented to sustain a balanced development of bacterial communities, thus providing
valuable information for environmentalists to inhibit the production of toxics or enhance
the production of beneficial metabolites for the well-being of ecosystem (Brennerova et
al., 2009).
1.2 Taxonomic classification of metagenomic reads: methods and challenges
Measuring species diversity in metagenomes provides the answer for “who is there”
(Chistoserdova, 2013). In order to connect each metagenomic sequence to a certain taxon,
Binning is a necessary process, and traditional binning process consists of two
approaches: composition based binning and similarity based binning (Dröge and
McHardy, 2012).
In composition based binning, metagenomic softwares are developed to unearth the
inherent features of sequences, such as GC content, codon usage bias and tetra-nucleotide
frequency (Roller et al., 2013; Teeling et al., 2004). These approaches identify the
34
M. Sc. Thesis—Quan Yao McMaster—Biology
differentiation of new species in environment, the so-called operational taxonomic unit
(OTU), because most species in natural environments are not successfully cultured and
beyond laboratory characterization (Wooley et al., 2010).
Similarity based binning, also called alignment based binning, matches metagenomic
sequences to referenced databases, methods such as BLAST (Altschul et al., 1990),
PhymmBL (Brady and Salzberg, 2009) and MetaPhlAn (Segata et al., 2012) are
employed in metagenomic researches. These methods not only identify and measure the
relative abundance and diversity of known microbial organisms in environment, but also
reveal functional impact of bacteria communities in environments because extensive
studies on individual microbial species in laboratory have well characterized the function
of genes and proteins within these cultivable species (Leung et al., 2011).
Both binning strategies are important and complement each other in metagenomic
taxonomic profiling. The former discovers unknown species in wild environments while
the latter investigates known species in wild environments (Wu and Ye, 2011). However,
neither of them could fully reveal environmental species diversity given that 99% of
environmental microbes haven’t been cultured yet (Schloss and Handelsman, 2005).
Usually, similarity based binning is more accurate and sensitive compared to composition
based binning, but the performance is highly subject to the reference resources (Xia et al.,
2011). Composition based binning clusters all sequences into groups. But it fails to build
an association between metagenomic reads and bacterial individuals (Thomas et al.,
2012).
35
M. Sc. Thesis—Quan Yao McMaster—Biology
Other issues such as time expense and computing requirement are haunting problems
waiting to be resolved (Thomas et al., 2012). BLASTX analysis was once performed on
permafrost soil samples including 176 million Illumina DNA reads (Mackelprang et al.,
2011), which eventually cost 800000 CPU hours on a similar work station server (64
cores, 512 GB main memory) (Huson and Xie, 2013). Regarding all these inevitable
limitations above, a hybrid of these two approaches is preferable for accurate estimation
of taxonomic classification (Mohammed et al., 2011).
1.3 Application of metagenomics
Metagenomics have a broad range of potential applications to transfer current
knowledge into solving practical issues. Some pioneering attempts have been proved
successful in fields such as energy, agriculture, environmental, medicine, and engineering
(National Research Council (US) Committee on Metagenomics: Challenges and
Functional and Functional, 2007).
Microbial communities in humans guts body have an essential impact on human
health. However, the composition of gastrointestinal microbes and the mechanism by
which they use to influence human body remains to be cryptic (Bäckhed et al., 2012). In
view of this, metagenomic technology is utilized to characterize human microbiome
(Lepage et al., 2013). One of the largest project involving human gut microbiome is
initiated by European Commission ——Metagenomics of the Human Intestinal Tract
(MetaHIT) to explore the relationships between the changes of human microbiome and
human health by gathering genomic sequences of all microbial organisms on 15~18
different body sites from 250 european individuals (Qin et al., 2010). The primary goal of
36
M. Sc. Thesis—Quan Yao McMaster—Biology
MetaHIT project is to determine a core set of human microbiome maintaining the health
of mankind (Ursell et al., 2012). Another clinical research as part of MetaHIT project is to
classify the profound phylogenetic variation of gastrointestinal microbes between health
people and patient suffering from diseases and disorders such as Crohn’s disease, irritable
bowel syndrome (IBS) disease and obesity (Moloney et al., 2013). The results elucidated
that two bacterial phyla, Bacteroidetes and Firmicutes dominate the distal gut by
comprising >90% of known bacteria (Le Chatelier et al., 2013). Gene frequency profiling
identifies 1244 metagenomic functional clusters of crucial importance to the health of
human intestinal tract, from which functions are divided into two categories: house
keeping cluster and intestine specific cluster (Qin et al., 2010). Housekeeping functions
are indispensable in human gut and required by all other microbial members around them
because they play a key role in main metabolic pathways including citric acid cycle and
amino acid synthesis. While gut specific functions cope with host protein adhesion and
sugar harvesting (Qin et al., 2010). One of the discoveries regarding IBS is that the genes
of microbiome in patients are 25% lower than healthy controls, and the bacterial diversity
is also lower in IBS patients. It is strongly indicated that gut associated disease and
obesity results from the reduction of gut microbiome diversity (Qin et al., 2010). Despite
of the potential application in the study of human gut metagenome, It is notable that only
7.6~21.2% of the metagenomic reads can be matched to bacterial genomes on Genebank,
and There are much more novel bacterial species in human distal gut that haven’t be
researched yet (Qin et al., 2010). The characterization of unknown microbiome may
37
M. Sc. Thesis—Quan Yao McMaster—Biology
throw light on new medical therapy dealing with human gastrointestinal diseases (Kinross
et al., 2011).
Metagenomics also advances the knowledge in exploring new green energies.
Bioenergy is expected to be the next generation fuel that could replace the status of fossil
fuels (Hess et al., 2011). They are derived from biomass conversion,which transfer plant
material such as grain, starch, sugar, oil, cellulose, hemicellulose, and lignin into
cellulosic ethanol methane and hydrogen (van der Lelie et al., 2012). The transformation
process relies upon microbial cohorts from host associated habitats ranging from
herbivore mammals, insects, birds to rainforest soils (Allgaier et al., 2010). The microbial
communities in these habitats share core cellulosic genes coding for enzymes that degrade
biomass (Scully et al., 2013). In view of the importance of such enzymes and the
inexhaustible pool of environmental microbes, metagenomics aims at analyzing
sophisticated microbial consortia that allows for the production of novel enzymes
fulfilling the industrial requirements——biomass deconstructing enzymes with higher
productivity and lower cost (Hess et al., 2011). Meanwhile, metagenomic technologies
permit comparative analysis between convergent microbial ecosystems, which in return
improves the understanding of differentiated biomass degradation mechanisms (Lu et al.,
2012). Metagenomic approaches not only identify the diversified enzymes of interest, but
also control the activation and depression of these catalyzing process (Zhang et al., 2013).
Industrialization of massive biofuel production may likely reduce the release of
greenhouse gases and promotes environmental qualities (Sommer et al., 2010).
38
M. Sc. Thesis—Quan Yao McMaster—Biology
Microbial communities in soils are recognized as the most diverse and complex
bacterial ecosystem, with 109~1010 microbial cells in one gram of soil (Vogel et al.,
2009). In spite of the enormous sequence information per unit soil holds (one gig abase
per gram of soil), the taxonomic composition and functional categories are poorly
understood (Vogel et al., 2009). Many bacteria develop a stable symbiotic relationships
with specific plants and provide diverse ecological services as symbionts or epibionts for
plant growth, including atmospheric nitrogen fixation, nutrient circulation, pathogen
resistance, and trace elements enrichment (Rascovan et al., 2013). Functional
metagenomic pipelines seek to decipher the sophisticated interactions and
communications between soil microbes and plants through screening novel genes of
interest in microbial communities (Rout and Callaway, 2012). Insight into the rare
uncultivable bacterial members responsible for mutualism and intra-species competitive
inhibition also offers a new angle of view for floral disease resistance and farming
practice enhancement (Rosen et al., 2009). The application of metagenomic techniques
into agriculture enables the improvement and maintenance of crop health if only the
dynamic equilibrium between microbes and plants are under the control (Rascovan et al.,
2013).
Apart from the applications described above, metagenomic approaches also tackle
environmental issues. In the field of environmental remediation, new policies and
strategies based on metagenomic principles are advocated for monitoring the impact of
pollutants and cleaning up environmental contamination (Yergeau et al., 2012). One of
the metagenomic projects targets wastewater treatment plant where microbial organisms
39
M. Sc. Thesis—Quan Yao McMaster—Biology
remove excessive inorganic phosphate from wastewater. The treatment process is called
enhanced biological phosphorus removal (EBPR) (Nielsen et al., 2012). Another
wastewater treatment project in a common effluent treatment plant (CEPT) investigates
the activated biomass occupied by particular microbial communities in this niche (Kapley
et al., 2007). Metagenomic studies aim at identification of novel bacteria members in
these niches and exploring new catabolic pathways that help reduce the chemical oxygen
demand (COD) so that treated wastewater by activated sludge process can be
subsequently released into environment safely (Ravi P More, 2013). Although the
metabolic traits of this process are not well understood yet, increased understanding of
how microbial communities deal with pollutants provides theoretical foundations for
environmentalists to assess the potential sites vulnerable to contaminants, as well as
developing appropriate strategies to increase the chance of removing pollution for habitat
rehabilitation (Gomez-Alvarez et al., 2012).
1.4 Project objectives
For the sake of profiling the bacteria diversity and functional category, it is necessary
to map metagenomic sequences to reference databases (Mitra et al., 2011). Since 99% of
environmental microbial species cannot be cultured in laboratories, it is a big challenge to
identify wild type species based on limited genomic information from sequenced
domesticated individuals (Albertsen et al., 2013). Based on the results of the first project,
several molecular markers have been found to be present in newly sequenced species of
different clades within Alphaproteobacteria. Given that these molecular markers are
40
M. Sc. Thesis—Quan Yao McMaster—Biology
ubiquitously present in all potential Alphaproteobacteria species, It is highly possible that
environmental Alphaproteobacteria may also carry these signatures as well.
The second project consists of 3 related components. The first step is to choose several
metagenomic samples that may contain potential Alphaproteobacteria. A large scale
screening test performed on 200 metagenomic samples. Subsequently, a more detailed
and comprehensive profiling of Alphaproteobacteria clades was performed on those
selected metagenomes. Finally, a comparative analysis will be carried out to compare the
relative abundance of Alphaproteobacteria among selected metagenomes. Once the
experiments are completed, an overall performance of molecular signatures in identifying
environmental microorganisms can be assessed and a new molecular marker based
method can be developed to determine the taxonomic classification of metagenomes.
41
M. Sc. Thesis—Quan Yao McMaster—Biology
Chapter 2 Materials and methods
2.1 Metagenome selection
A systematic tBLASTn search was conducted with threshold of 1x10-e04 on 4
Alphaproteobacteria class specific CSPs against 201 metagenomes. These 201
metagenomic samples consist of 15049531 genomic sequences in total and are divided
into either ecological metagenomes or organismal metagenomes on NCBI genomic
BLAST webpage. All significant hits above the threshold were collected and a
metagenome taxonomy report was carried out to demonstrate the distribution of CSPs in
potential metagenomes. Metagenomes with highly similar sequences to CSPs suggest that
Alphaproteobacteria might be abundant in those habitats, thus are preferable in this study.
According to the similarity of sequences and the amount of positive hits discovered in
candidate metagenomes, qualified metagenomic projects are selected for
Alphaproteobacteria profiling later.
2.2 Identification of CSP in metagenomic samples
A systematic tBLASTn search was performed on all 264 CSPs against 10 qualified
metagenomic projects. An E-value threshold of 1x10-e04 was employed in this experiment
with default filter (low complexity regions) to eliminate statistically significant but
biologically uninteresting hits from the BLAST output (Coletta et al., 2010). Then, best
bit scores and positive hit numbers of all CSPs were collected to evaluate the quality and
quantity information derived from BLAST results for further analysis. The bit score is a
numerical value that describes the overall quality of an alignment, which indicates how
42
M. Sc. Thesis—Quan Yao McMaster—Biology
ideal the alignment results are. The higher the score is, the better the alignment is
(Altschul et al., 1990).
2.3 Comparative analysis of Alphaproteobacteria in metagenomes
The distribution of Alphaproteobacteria clades was plotted based on the best bit scores
and the amount of positive hits for all CSPs. Then a heatmap was created according to the
distribution pattern of 264 CSPs. Meanwhile, the average bit score and total amounts of
positive hits for all CSPs in each metagenome were calculated to indicate the overall
relative abundance of Alphaproteobacteria in each metagenome. Afterwards, a
comparative analysis of CSPs was conducted to demonstrate the detailed proportion of
different Alphaproteobacteria clades between 10 metagenomes. A comparison of relative
abundance between CSPs-based in this study and similarity-based taxonomic
classification on public metagenomic server was performed to validate the reliability of
CSPs-based methodology. Taxonomical hits distribution of the 10 metagenomes were
accessible from either Metagenome Rapid Annotation using Subsystem Technology
server (MG-RAST) (Meyer et al., 2008) or Integrated Microbial Genomes with
Microbiome Samples (IMG/M) (Markowitz et al., 2012).
43
M. Sc. Thesis—Quan Yao McMaster—Biology
Table 11 Characteristics of Metagenomic Datasets Investigated in this Study
Metagenomic project
Wastewater
# of
Contigs1
Total length
(Mb)
Average
length (bp)
Raw sequencing
data (Gb)
α-proteobacteria
%2
172,804
421.6
2,440
157.50
16.8%
54,509
77.4
1,420
1.17
30.7%
Bioreactor
748,672
317.9
425
1.44
17.2%
Compost
218,885
104.9
479
0.28
27.0%
Activated sludge
36,270
27.9
769
Whale fall
84,317
89.6
1,062
0.14
23.8%
Freshwater sediment
252,427
214.8
850
8.20
5.3%
Microbial mat
112,984
84.2
745
0.12
21.9%
Hydrothermal vent
26,573
24.9
937
0.03
19.8%
Groundwater
37,367
104.7
2801
7.20
4.6%
Marine
N/A
1. Contigs are assembled metagenomic sequences
2. The percentage of α-proteobacteria is calculated based on the ratio of reads annotated
to α-proteobacteria to all metagenomic reads
44
N/A
M. Sc. Thesis—Quan Yao McMaster—Biology
Chapter 3 Results
3.1 Metagenome selection
201 metagenomic datasets were accessible in NCBI genomic BLAST webpage. To
determine which metagenomic datasets were dominated by Alphaproteobacteria, a largescale BLAST search was undertaken with Alphaproteobacteria specific CSPs. The
experiment results indicated that Alphaproteobacteria were most likely present in roughly
10 metagenomic projects. The 4 CSPs used for preliminary screening has been confirmed
to be the most conserved and specific molecular signatures shared by all available
Alphaproteobacteria species (Table 3A). The tBLASTn results of these CSPs indicated
that 220 BLAST hits were identified in 10 metagenomic projects (the accession # on
NCBI and project ID on MG-RAST and IMG/M servers are listed in brackets). They
were Microbial mat metagenome (PRJNA29795, 4440964.3) (Harris et al., 2012), Marine
metagenome (PRJNA16339, 4443701.3) (DeLong et al., 2006), Wastewater metagenome
(PRJNA167559, 4455295.3) (Mielczarek et al., 2013), Freshwater sediment metagenome
(PRJNA30541, 2006543005) (Kalyuzhnaya et al., 2008), Hydrothermal vent metagenome
(PRJNA37895, 4461585.3) (Brazelton and Baross, 2009), Bioreactor metagenome
(PRJNA73603, 20220044000) (van der Lelie et al., 2012), Activated sludge metagenome
(PRJNA61401, N/A) (Kapley et al., 2007), Compost metagenome (PRJNA41493,
4446153.3) (Allgaier et al., 2010), Whale fall metagenome (PRJNA81625, 4441619.3)
(Tringe et al., 2005) and Groundwater metagenome (PRJNA114691, 3300000815)
(Wrighton et al., 2012). Although positive hits were also sporadically distributed in other
metagenomic projects such as freshwater metagenome and mosquito metagenome, these
45
M. Sc. Thesis—Quan Yao McMaster—Biology
two metagenomic projects were not pursued further because BLAST analysis indicated
that neither bit score nor hits number were sufficient to classify these metagenomic
datasets as Alphaproteobacteria abundant metagenomes.
The 10 metagenomic projects described above were composed of more than 50
metagenomic samples. So one most representative metagenomic sample was selected
from each metagenomic project. Given that 6 of the 10 metagenomic projects
(wastewater metagenome, hydrothermal vent metagenome, activated sludge metagenome,
compost metagenome, bioreactor metagenome and groundwater metagenome) contained
only one sample respectively, They automatically became the representative
metagenomic sample for corresponding metagenomic projects. Whale fall metagenome,
freshwater metagenome and microbial mat metagenome were made up of 3 datasets, 5
datasets and 10 datasets respectively. Since the samples in each project were
concentrating on a certain topic, metagenomic datasets in each project could be combined
as a single sample for research. Marine metagenome comprised 35 metagenomic samples
gathered from all around the world. Further analysis indicated that most significant hits of
Alphaproteobacteria specific CSP were identified in North Pacific Subtropical Gyre
Planktonic Microbial Community, so it was selected as the representative marine
metagenome sample in this project.
The sample size of each metagenomic project, the number of contigs and total length
of all reads were collected from WGS master webpage (shotgun assembly sequences for
genome and transcriptome). From Table 11, it can be seen that the number of assembled
sequences between metagenomic projects ranges from ten thousands to hundreds of
46
M. Sc. Thesis—Quan Yao McMaster—Biology
thousands of sequences. The total length of metagenomic reads are limited within tens of
million base pairs to hundreds of million base pairs. The average length of metagenomic
read can be calculated based the total length divided by the quantity of contigs. The
average length of metagenomic reads ranges between roughly 500 bp to 2500 bp. The
metagenomic reads are appropriate for CSP based similarity search because the average
length of metagenomic reads are comparable to the length of CSPs. To have an overall
understanding of how many Alphaproteobacteria are assumed to be present in these
metagenomic projects. Organism abundance of metagenomes were searched in MGRAST and IMG/M server. The numbers of reads annotated to Alphaproteobacteria were
collected for calculating the relative proportion of all Alphaproteobacteria species in each
metagenome (Table 11). And the relative abundance of Alphaproteobacteria based on the
proportion of related metagenomic reads ranges from 5% to 30%, which indicates the fact
that Alphaproteobacteria is one of the major groups in selected metagenomic projects.
3.2 Identification of CSPs in metagenomic samples
After selecting 10 appropriate metagenomes, the distribution of all CSPs in these
metagenomic reads was investigated. The bit scores, as well as the number of significant
hits obtained from 16 CSPs unique to different clades of Alphaproteobacteria were
tabulated in Figure 1. Equally important, two heatmaps were built to depict the detailed
distribution of Alphaproteobacterial clades in 10 metagenomes (Figure 2 and 3). In this
study, 11 CSPs specific for Alphaproteobacteria at class level (i.e. they are specifically
found in all or most Alphaproteobacteria) were applied to identify the presence of
Alphaproteobacteria in metagenomic samples. Significant hits of these CSPs with high bit
47
M. Sc. Thesis—Quan Yao McMaster—Biology
scores were identified in the metagenomic datasets and in most cases multiple
metagenomic reads were found to exhibit positive hits. However, the total number of
significant hits for these 11 CSPs in different metagenomic datasets showed considerable
variation, as well as the bit score, It is notable that bioreactor, wastewater and whale fall
metagenomes have more Alphaproteobacterial sequences than the other 7 metagenomes
(Figure 2 and 3). These differences may be related to the size of the datasets themselves
as well as the relative abundance of Alphaproteobacteria in these metagenomes. Based on
these findings, it is indicated that CSPs specific for the class Alphaproteobacteria are
ubiquitously present in 10 metagenomic datasets, particularly enriched in three of them.
Multiple CSPs that are specific for either all Rhizobiales or two major clades within
this order have been identified, which contains 3 CSPs specific for Rhizobiales, 16 CSPs
specific for Brucellaceae, Bartonellaceae, Phyllobacteriaceae, Rhizobiaceae and
Aurantimonadaceae (called clade-1 Rhizobiales) and 18 CSPs specific for Rhizobiaceae
and Phyllobacteriaceae. The results of tBLASTn searches with these CSPs demonstrated
that the significant hits of these CSPs were highly concentrated in wastewater
metagenome, followed by bioreactor, compost and whale fall metagenomes. At the same
time, These CSPs were either sporadically distributed or totally absent in other
metagenomes studied (Figure 2 and 3).
Another important sub-clade within Rhizobiales is the Bradyrhizobiaceae and
Xanthobacteraceae group. All 74 CSPs were examined to be consistently present in either
Bradyrhizobiaceae family or both Bradyrhizobiaceae and Xanthobacteraceae families.
The tBLASTn results indicated that their distribution differed somewhat from those of the
48
M. Sc. Thesis—Quan Yao McMaster—Biology
Clade 1 Rhizobiales-specific CSPs. The maximum number of BLAST hits was observed
in this case for the bioreactor metagenome, while the marine and the compost
metagenomes also yielded equivalent significant hits (Figure 2 and 3).
The distribution of CSPs specific for the order Rhodobacterales, Caulobacterales,
Sphingomonadales, Rhodospirillales and Rickettsiales were investigated respectively. Of
the 35 Rhodobacterales specific CSPs, multiple significant hits were detected in the
following 6 metagenomes: wastewater, marine, microbial mat, hydrothermal vent, whale
fall and groundwater metagenomes. Furthermore, the average bit scores of CSPs in these
6 metagenomes were higher than those in the other 4 metagenomes, which gave more
confidence in the reliability of these results and indicated that Rhodobacterales species
were important constituents of these metagenomes (Table 11). Also, the distribution of
significant BLAST hits based on 11 Caulobacterales specific CSPs indicated that the
Caulobacterales were likely enriched in bioreactor, wastewater and whale fall
metagenomes (Figure 2 and 3).
The result of tBLASTn searches regarding the distribution of 31 Sphingomonadales
specific CSPs in 10 metagenomes was displayed in Figure 2 and 3. These CSPs were
highly concentrated in bioreactor, wastewater metagenomes, and moderately scattered in
marine and whale fall metagenomes. It was inferred that Sphingomonadales species might
prefer either engineered or marine habitats than any other environments examined in this
study.
The analyses of tBLASTn results with Rhodospirillales specific CSPs indicated that
Rhodospirillales were most abundant in the bioreactor metagenome, admitting that
49
M. Sc. Thesis—Quan Yao McMaster—Biology
correlated CSPs were present with low amount in the marine, compost, freshwater
sediment and whale fall metagenomes (Figure 2 and 3).
21 CSPs specific for the order Rickettsiales were also employed to detect potential
pathogens in environmental datasets. Only 2 significant hits were observed in wastewater
and freshwater sediment metagenomes respectively (Figure 2 and 3). It is probably
because intracellular pathogenic bacteria were not common in environmental
metagenomic samples.
3.3 Comparative analysis of Alphaproteobacteria in metagenomes
The best bit scores and the number of significant hits from the BLAST search
results of all CSPs were collected. In summary, 4 metagenomic datasets enriched by
Alphaproteobacteria were identified. They were bioreactor metagenome wastewater
metagenome, marine metagenome and whale fall metagenome. The experimental results
for other 7 metagenomes were shown in Figure 4. All these significant hits were derived
from either Alphaproteobacteria class specific CSPs or clade specific CSPs. For instance,
among all the 410 hits found in bioreactor metagenome, 125 of them were from 11
Alphaproteobacteria class specific CSPs, 73 were from Bradyrhizobiaceae and 98 were
from Sphingomonadales. In wastewater metagenome, the 551 significant hits discovered
by CSPs were mainly from Alphaproteobacteria class (179 hits), Rhizobiales (130 hits)
Rhodobacterales (87 hits) and Sphingomonadales (109 hits). As for whale fall
metagenome, more than 75% of the significant hits were derived from
Alphaproteobacteria specific CSPs (109 hits) and Rhodobacterales (114 hits).
50
M. Sc. Thesis—Quan Yao McMaster—Biology
By calculating the total number of significant hits discovered for All CSPs and
Grouping them based on orders, a comparative analysis was made to demonstrate the
detailed distribution of Alphaproteobacteria clades in each metagenome. According to
Figure 5, Alphaproteobacteria was most abundant in bioreactor, wastewater and whale
fall metagenomes, not only for the whole class, but also for different orders of
Alphaproteobacteria. For example, in bioreactor metagenome, the relative abundance of
Rhizobiales, Bradyrhizobiaceae, Sphingomonadales and Rhodospirillales were higher
compared to the other metagenomes. Though wastewater metagenome was also enriched
with Alphaproteobacteria, the composition of concentrated organism was different from
bioreactor metagenome. In wastewater metagenome, Rhizobiales, Rhodobacterales and
Sphingomonadales were the most dominant groups of Alphaproteobacteria, but the
concentration of Bradyrhizobiaceae was lower than bioreactor. According to CSPs
distribution, only Rhodobacterales and Sphingomonadales were abundant in whale fall
metagenomes, admitting the fact that other clades were also moderately present in this
metagenome.
To compare the organism abundance between CSPs-based binning and similaritybased binning, the taxonomic classification from MG-RAST and IMG/M server (Figure
6) was collected. 4 metagenomes (bioreactor, wastewater, whale fall and marine) were
compared in this study. In the bioreactor metagenome, the relative abundance of
Alphaproteobacteria clades based on CSP distribution were demonstrated as 5% for
Rhizobiales, 11% for Bradyrhizobiaceae, 2% for Rhodobacterales, 3% for
Caulobacterales, 14% for Sphingomonadales and 7% for Rhodospirillales. The relative
51
M. Sc. Thesis—Quan Yao McMaster—Biology
proportion of the same metagenome derived from IMG/M server were 7% Rhizobiales,
11% Bradyrhizobiaceae, 2% Rhodobacterales, 6% for Caulobacterales, 11% for
Sphingomonadales and 10% for Rhodospirillales. The results were highly correlated to
each other. The organism abundance for the other 3 metagenomes on MG-RAST server
was also similar to the results based on CSPs search (Figure 6)
52
Sphingomonadales
Rhodospirillales
Rickettsiales
Clade specificity
Alphaproteobacteria
Rhizobiales
Bradyrhizobiaceae
Rhodobacterales
Caulobacterales
Sphingomonadales
Rhodospirillales
Rickettsiales
CSP
NP_422086
NP_420178
YP_031797
YP_032395
YP_317328
YP_317580
YP_614257
YP_611978
NP_419305
NP_421895
YP_495301
YP_496569
AAW62049
YP_425217
NP_965979
NP_966474
212
94
51
0
118
132
102
67
107
145
206
83
0
0
0
190
80
0
0
0
180
61
53
162
0
191
92
0
0
0
209
118
56
106
0
143
96
0
0
0
128
0
111
147
0
0
0
0
0
0
0
0
82
127
116
110
70
173
0
0
721
273
62
291
216
139
0
0
0
0
0
249
85
73
115
70
471
177
0
0
0
0
0
0
0
0
0
54
0
0
0
183
0
0
0
0
0
68
0
0
333
0
0
0
0
0
0
0
0
0
199
0
0
115
0
0
0
0
0
0
330
0
0
0
0
105
0
0
0
0
311
286
81
119
225
77
158
171
0
0
719
196
0
0
0
0
0
0
0
0
23
11
3
4
2
18
18
11
11
0
7
6
1
0
1
2
3
2
5
1
3
2
0
0
0
8
9
1
2
0
1
2
0
0
0
Activated sludge
205
124
77
265
0
Hydrothermal vent
203
263
70
169
125
Microbial mat
Groundwater
Caulobacterales
Whale fall
Rhodobacterales
Freshwater sediment
Bradyrhizobiaceae
Compost
Rhizobiales
Marine
Alphaproteobacteria
CSP
NP_422086
NP_420178
YP_031797
YP_032395
YP_317328
YP_317580
YP_614257
YP_611978
NP_419305
NP_421895
YP_495301
YP_496569
AAW62049
YP_425217
NP_965979
NP_966474
Wastewater
Clade specificity
Bioreactor
M. Sc. Thesis—Quan Yao McMaster—Biology
Best bit score
Significant hits
3
3
0
0
0
2
2
0
0
0
1
2
1
1
0
2
0
1
1
0
0
0
0
0
0
0
0
2
11
4
3
1
9
0
0
1
1
1
6
5
5
0
0
0
0
0
2
1
1
2
1
2
2
0
0
0
0
0
0
0
0
0
1
0
0
0
2
0
0
0
0
0
1
0
0
1
0
0
0
0
0
0
0
0
0
1
0
0
1
0
0
0
0
0
0
2
0
0
0
0
1
0
0
0
0
3
3
3
1
3
1
1
2
0
0
1
1
0
0
0
0
0
0
0
0
53
M. Sc. Thesis—Quan Yao McMaster—Biology
Figure 1: Summary heatmap of 16 Alphaproteobacteria specific CSPs in 10
metagenomes
The upper heatmap specifies the best bit score within each metagenome assigned to the
listed taxa. The lower heatmap indicates the numbers of significant hits within each
metagenome that are assigned to the listed taxa. Color formatting indicates high and low
values. Zero values are in green. Values between 1~10 are in yellow. Red indicates the
highest values in the chart.
54
CSP
Bioreactor
Wastewater
Marine
Compost
Microbial Mat
Hydrothermal Vent
Activated Sludge
Freshwater Sediment
Whale fall
Groundwater
Rhizobiales
α-proteobacteria
M. Sc. Thesis—Quan Yao McMaster—Biology
NP_420905
NP_422086
NP_422113
NP_420178
NP_420025
NP_420693
NP_421048
NP_422264
NP_419339
NP_421804
NP_418919
YP_031797
YP_032733
YP_032395
NP_101943
NP_105027
NP_108034
NP_102510
NP_102519
NP_104217
NP_107016
NP_101988
NP_102895
NP_104087
NP_104130
NP_105201
NP_105743
NP_108472
15
23
8
11
9
6
14
6
10
15
8
3
1
4
0
0
4
0
0
0
2
4
0
0
2
0
4
3
17
18
20
18
17
13
17
18
11
21
9
11
5
11
4
3
8
3
5
4
4
8
5
1
6
4
5
4
3
7
1
6
5
5
5
1
0
8
4
1
0
0
0
0
0
0
1
1
0
0
0
0
0
0
0
0
3
2
2
3
1
1
3
1
0
3
4
2
1
5
0
0
2
0
1
0
0
1
0
1
0
0
1
0
2
3
2
3
1
1
1
0
2
0
2
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
2
2
2
1
1
1
1
0
0
2
0
0
0
0
1
0
0
0
0
1
1
0
0
0
0
1
0
2
1
2
2
3
1
3
1
1
1
1
1
0
1
0
0
1
0
0
0
0
0
0
0
0
0
0
0
2
3
0
2
0
0
0
1
2
1
3
0
1
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
7
8
8
9
11
5
11
8
12
9
21
1
1
2
0
0
2
0
0
0
0
1
0
0
1
0
1
0
2
1
1
2
0
1
0
2
1
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
55
Bradyrhizobiaceae and Xanthobacteraceae
M. Sc. Thesis—Quan Yao McMaster—Biology
NP_103319
NP_101965
NP_101954
NP_102577
NP_109472
NP_105883
NP_106835
NP_107159
NP_103376
NP_104418
NP_105704
NP_102252
NP_103286
NP_106741
NP_106740
NP_104236
NP_103455
NP_103450
NP_103476
NP_107075
NP_772654
YP_317707
YP_317841
YP_318399
YP_318401
YP_318753
YP_318785
YP_319038
YP_319081
YP_319177
YP_319228
YP_319312
NP_772539
NP_772746
YP_316897
YP_317122
YP_317147
1
4
0
1
0
0
1
1
1
0
0
0
0
0
0
2
0
0
0
0
0
0
1
3
3
0
0
0
3
1
1
6
0
0
2
0
0
6
4
4
1
2
5
0
1
9
1
1
0
1
2
0
0
0
0
2
0
1
1
4
0
0
0
0
0
2
1
2
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
2
0
0
4
0
0
56
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
2
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
M. Sc. Thesis—Quan Yao McMaster—Biology
YP_317224
YP_317328
YP_317539
YP_317580
YP_317698
YP_317706
YP_317721
YP_317722
YP_317949
YP_317753
YP_317861
YP_317883
YP_317888
YP_318067
YP_318111
YP_318125
YP_318194
YP_318195
YP_318199
YP_318262
YP_318287
YP_318318
YP_318324
YP_318398
YP_318406
YP_318413
YP_318632
YP_318673
YP_318674
YP_318769
YP_318779
YP_318789
YP_318814
YP_318850
YP_318853
YP_318985
YP_318987
1
2
0
2
0
0
1
0
1
0
0
0
0
0
0
0
1
0
2
1
2
6
1
0
3
0
0
0
0
0
1
0
0
0
2
20
0
0
0
1
0
4
0
1
1
1
1
0
0
0
0
0
0
1
0
2
0
0
1
0
0
2
0
0
0
0
0
0
0
0
0
0
1
0
0
1
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
8
0
0
1
1
1
1
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
4
0
0
0
1
0
0
0
0
1
0
5
0
57
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
2
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
1
6
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
6
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
Rhodobacterales
M. Sc. Thesis—Quan Yao McMaster—Biology
YP_319020
YP_319094
YP_319097
YP_319105
YP_319111
YP_319114
YP_319136
YP_319180
YP_319182
YP_319193
YP_319235
YP_319281
YP_319282
YP_319374
YP_319394
YP_319586
YP_319561
YP_319637
YP_319739
YP_319740
YP_612088
YP_612179
YP_612231
YP_612466
YP_612581
YP_612582
YP_612692
YP_612745
YP_612747
YP_613058
YP_613059
YP_613242
YP_613345
YP_613401
YP_613562
YP_613837
YP_613961
0
0
2
0
0
0
1
0
0
0
0
0
0
0
0
0
1
0
1
2
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
5
1
0
1
1
1
1
1
1
4
2
0
1
1
1
2
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
3
0
0
0
0
0
0
0
1
0
1
0
0
1
0
1
0
0
0
0
0
0
1
0
0
2
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
58
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
2
0
0
0
0
0
4
0
0
1
2
1
4
1
3
0
2
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
3
2
1
0
1
2
0
0
1
1
0
2
1
0
1
1
4
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
3
5
2
2
4
3
2
1
0
2
4
0
2
0
4
2
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
1
2
1
0
0
1
0
0
1
1
1
0
0
0
Caulobacterales
M. Sc. Thesis—Quan Yao McMaster—Biology
YP_613982
YP_614257
YP_614364
YP_614419
YP_614460
YP_614481
YP_614576
YP_614993
YP_611313
YP_611978
YP_611988
YP_611993
YP_613553
YP_613730
YP_613732
YP_613733
YP_613734
YP_613731
YP_613094
YP_611425
YP_613418
YP_613446
YP_613980
YP_614100
YP_614133
YP_611311
YP_611438
YP_611444
YP_611462
YP_611763
YP_611855
NP_419305
NP_421283
NP_421560
NP_421895
NP_419331
0
0
0
0
0
0
0
0
0
0
0
0
0
0
3
0
4
4
0
0
0
0
0
0
0
0
0
0
0
0
0
2
3
0
11
0
1
1
1
2
1
2
1
1
1
1
2
2
1
1
5
0
0
39
0
0
0
0
0
0
0
0
0
1
0
0
0
1
0
0
6
3
0
0
0
1
0
0
0
2
0
2
2
0
0
0
1
0
13
17
0
2
0
0
0
0
0
0
0
0
0
0
0
1
0
0
1
0
59
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
2
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
1
0
0
1
0
0
0
2
2
1
1
0
0
0
2
2
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
2
1
2
1
1
1
1
2
1
0
1
2
2
0
4
1
2
3
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
2
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
2
0
0
0
0
0
0
0
0
0
0
0
0
0
0
5
5
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
3
3
2
5
4
3
3
5
1
3
2
5
1
2
5
3
3
20
0
0
0
2
0
2
0
0
0
0
0
0
0
3
3
1
1
0
0
1
0
1
0
0
1
1
0
1
0
1
1
1
0
0
3
13
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
Sphingomonadales
M. Sc. Thesis—Quan Yao McMaster—Biology
NP_419880
NP_419882
NP_420397
NP_421010
NP_421428
NP_421438
YP_495301
YP_495335
YP_495370
YP_495433
YP_495514
YP_495691
YP_496367
YP_496423
YP_496569
YP_496656
YP_497188
YP_497403
YP_498058
YP_498227
YP_498407
YP_498482
YP_495327
YP_495437
YP_495697
YP_495740
YP_496357
YP_496405
YP_496439
YP_496442
YP_497022
YP_497059
YP_497246
YP_497309
YP_497310
YP_497604
YP_497818
0
0
0
2
0
0
4
3
1
0
10
4
8
1
3
1
2
2
1
4
2
1
1
1
9
1
5
0
6
0
4
0
0
0
4
16
4
2
1
0
1
0
2
5
5
3
2
4
4
5
2
5
2
4
4
3
4
3
3
1
3
5
3
4
3
4
3
7
1
2
1
2
6
6
0
1
0
1
0
0
2
0
1
1
2
0
0
0
1
0
1
0
0
0
0
0
3
0
0
1
1
2
0
0
0
0
1
0
0
2
4
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
60
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
2
1
0
1
0
0
0
0
1
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
1
0
0
0
0
0
0
1
2
0
0
0
1
3
0
0
1
0
2
2
0
1
0
0
1
2
0
1
1
0
2
2
3
1
0
3
2
4
1
0
2
1
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
Rickettsiales
Rhodospirillales
M. Sc. Thesis—Quan Yao McMaster—Biology
AAW60410
AAW60472
AAW60735
AAW61019
AAW59936
AAW61357
AAW60126
AAW60973
AAW60976
AAW60983
AAW60985
AAW62008
AAW62049
AAW62183
AAW62185
AAW60994
AAW62187
YP_425217
YP_425244
YP_425622
YP_426776
YP_426843
YP_427199
YP_427597
YP_427676
YP_427912
YP_428643
YP_428717
YP_428743
YP_428820
YP_428881
NP_965979
NP_966474
NP_965909
NP_966580
NP_965975
NP_965966
0
1
1
2
1
0
0
0
0
0
0
0
1
0
3
0
1
9
0
1
1
1
2
4
2
1
8
2
4
0
2
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
2
0
0
0
0
2
0
0
1
1
0
1
0
0
2
0
0
0
2
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
3
3
0
0
0
0
1
0
0
0
0
0
0
0
0
0
61
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
1
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
1
0
0
0
0
1
0
0
1
0
0
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
2
0
1
0
0
0
0
0
0
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
M. Sc. Thesis—Quan Yao McMaster—Biology
NP_966527
NP_966202
NP_966253
NP_966513
NP_966574
NP_966613
NP_966526
NP_966520
NP_966750
NP_966779
NP_966932
NP_966942
NP_220581
NP_220424
NP_220576
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
Figure 2: Alphaproteobacteria specific CSPs identified in 10 metagenomes
Numbers of significant hits within each metagenome are assigned to the listed taxa. Color
formatting indicates high and low values. Negative results are in green. Positive results
are in yellow. Red indicates the highest values in the chart.
62
CSP
Bioreactor
Wastewater
Marine
Compost
Microbial Mat
Hydrothermal Vent
Activated Sludge
Freshwater Sediment
Whale fall
Groundwater
Rhizobiales
α-proteobacteria
M. Sc. Thesis—Quan Yao McMaster—Biology
NP_420905
NP_422086
NP_422113
NP_420178
NP_420025
NP_420693
NP_421048
NP_422264
NP_419339
NP_421804
NP_418919
YP_031797
YP_032733
YP_032395
NP_101943
NP_105027
NP_108034
NP_102510
NP_102519
NP_104217
NP_107016
NP_101988
NP_102895
NP_104087
NP_104130
NP_105201
128
203
67
263
94
100
117
155
147
98
117
70
63
169
0
0
174
0
0
0
82
154
0
0
78
0
130
205
98
124
110
75
138
196
177
188
140
77
81
265
171
104
231
189
173
112
133
411
110
55
200
207
63
212
77
94
85
72
89
58
0
84
87
51
0
0
0
0
0
0
49
60
0
0
0
0
0
0
98
132
67
102
73
71
69
69
0
83
75
67
61
107
0
0
196
0
53
0
0
112
0
81
0
0
93
206
57
83
48
49
47
0
119
0
69
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
85
190
55
80
72
72
91
60
0
0
115
0
0
0
0
80
0
0
0
0
128
69
0
0
0
0
87
180
71
61
109
78
105
59
159
90
118
53
0
162
0
0
107
0
0
0
0
0
0
0
0
0
78
191
0
92
0
0
0
53
113
59
222
0
48
0
0
0
90
0
0
0
0
0
0
0
0
0
108
209
66
118
103
78
122
130
161
135
132
56
77
106
0
0
186
0
0
0
0
202
0
0
115
0
85
143
65
96
0
71
0
60
158
72
56
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
63
Bradyrhizobiaceae and Xanthobacteraceae
M. Sc. Thesis—Quan Yao McMaster—Biology
NP_105743
NP_108472
NP_103319
NP_101965
NP_101954
NP_102577
NP_109472
NP_105883
NP_106835
NP_107159
NP_103376
NP_104418
NP_105704
NP_102252
NP_103286
NP_106741
NP_106740
NP_104236
NP_103455
NP_103450
NP_103476
NP_107075
NP_772654
YP_317707
YP_317841
YP_318399
YP_318401
YP_318753
YP_318785
YP_319038
YP_319081
YP_319177
YP_319228
YP_319312
213
102
115
128
0
51
0
0
54
50
65
0
0
0
0
0
0
127
0
0
0
0
0
0
65
112
156
0
0
0
60
127
56
117
487
462
184
410
126
164
238
189
0
63
200
107
120
0
107
112
0
0
0
0
47
0
91
56
47
0
0
0
0
0
61
140
93
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
163
0
0
0
0
0
0
0
0
176
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
95
0
76
64
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
92
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
49
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
56
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
204
0
0
0
0
0
0
0
0
0
0
291
0
167
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
M. Sc. Thesis—Quan Yao McMaster—Biology
NP_772539
NP_772746
YP_316897
YP_317122
YP_317147
YP_317224
YP_317328
YP_317539
YP_317580
YP_317698
YP_317706
YP_317721
YP_317722
YP_317949
YP_317753
YP_317861
YP_317883
YP_317888
YP_318067
YP_318111
YP_318125
YP_318194
YP_318195
YP_318199
YP_318262
YP_318287
YP_318318
YP_318324
YP_318398
YP_318406
YP_318413
YP_318632
YP_318673
YP_318674
0
0
216
0
0
46
125
0
128
0
0
61
0
49
0
0
0
0
0
0
0
66
0
103
46
57
90
51
0
96
0
0
0
0
0
73
0
0
0
0
0
54
0
65
0
131
68
56
235
0
0
0
0
0
0
58
0
63
0
0
53
0
0
120
0
0
0
0
0
0
0
0
0
0
118
0
111
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
148
0
0
0
145
54
147
55
0
0
0
79
0
0
0
0
0
0
0
0
0
0
0
0
63
0
0
54
0
0
0
52
65
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
176
0
0
0
0
0
0
56
0
0
0
159
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
92
0
0
0
0
0
0
0
0
0
0
0
0
0
57
0
0
0
105
88
0
0
0
0
0
0
0
0
0
0
71
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
87
0
0
0
0
0
49
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
Rhodobacterales
M. Sc. Thesis—Quan Yao McMaster—Biology
YP_318769
YP_318779
YP_318789
YP_318814
YP_318850
YP_318853
YP_318985
YP_318987
YP_319020
YP_319094
YP_319097
YP_319105
YP_319111
YP_319114
YP_319136
YP_319180
YP_319182
YP_319193
YP_319235
YP_319281
YP_319282
YP_319374
YP_319394
YP_319586
YP_319561
YP_319637
YP_319739
YP_319740
YP_612088
YP_612179
YP_612231
YP_612466
YP_612581
YP_612582
0
78
0
0
0
90
339
0
0
0
108
0
0
0
52
0
0
0
0
0
0
0
0
0
73
0
102
92
0
0
0
0
0
0
0
0
0
0
0
0
59
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
125
383
0
131
365
183
0
0
0
0
0
0
441
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
107
0
0
0
0
0
0
0
0
0
55
0
169
0
0
0
0
0
0
48
0
0
46
0
0
0
0
0
0
52
0
0
0
0
0
0
0
0
0
0
66
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
68
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
65
405
60
0
114
124
0
0
0
0
0
0
0
0
0
0
0
0
0
56
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
376
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
49
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
119
236
108
137
357
170
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
119
0
0
107
352
196
M. Sc. Thesis—Quan Yao McMaster—Biology
YP_612692
YP_612745
YP_612747
YP_613058
YP_613059
YP_613242
YP_613345
YP_613401
YP_613562
YP_613837
YP_613961
YP_613982
YP_614257
YP_614364
YP_614419
YP_614460
YP_614481
YP_614576
YP_614993
YP_611313
YP_611978
YP_611988
YP_611993
YP_613553
YP_613730
YP_613732
YP_613733
YP_613734
YP_613731
YP_613094
YP_611425
YP_613418
YP_613446
YP_613980
YP_614100
YP_614133
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
54
0
99
159
0
0
0
0
0
0
0
92
55
138
113
103
0
172
81
218
124
136
148
721
271
192
156
459
111
139
75
273
143
231
123
194
57
0
0
405
0
0
0
0
0
0
0
0
0
114
0
83
0
0
104
0
150
0
0
0
0
199
0
0
0
178
0
249
122
0
0
0
54
0
211
209
0
45
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
60
0
57
135
0
0
0
0
0
0
0
67
90
0
0
85
78
94
147
97
194
0
131
152
0
115
0
0
158
0
0
0
183
165
227
117
0
0
0
60
140
0
0
0
0
0
0
0
0
0
97
117
0
63
161
0
184
84
152
159
333
293
153
202
460
102
176
72
0
160
259
125
0
105
129
68
204
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
98
0
0
0
199
0
0
0
0
0
0
0
0
0
0
0
0
61
0
0
86
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
330
0
0
0
0
0
0
0
0
0
0
0
0
0
0
115
136
60
0
0
0
0
0
0
86
130
0
109
93
0
114
0
213
135
142
151
311
237
232
139
496
102
174
55
286
198
184
112
203
103
99
82
221
0
0
0
142
0
151
0
0
0
115
0
0
50
197
92
0
0
0
0
719
0
225
0
0
124
156
0
196
0
225
125
195
0
0
115
318
0
0
0
0
0
0
0
Sphingomonadales
Caulobacterales
M. Sc. Thesis—Quan Yao McMaster—Biology
YP_611311
YP_611438
YP_611444
YP_611462
YP_611763
YP_611855
NP_419305
NP_421283
NP_421560
NP_421895
NP_419331
NP_419880
NP_419882
NP_420397
NP_421010
NP_421428
NP_421438
YP_495301
YP_495335
YP_495370
YP_495433
YP_495514
YP_495691
YP_496367
YP_496423
YP_496569
YP_496656
YP_497188
YP_497403
YP_498058
YP_498227
YP_498407
YP_498482
YP_495327
YP_495437
YP_495697
YP_495740
0
0
0
0
0
0
82
181
0
127
0
0
0
0
104
0
0
116
132
64
0
174
175
93
57
110
69
110
77
68
141
88
73
74
74
119
60
0
0
50
0
0
0
62
0
0
291
100
51
71
0
75
0
94
216
87
309
112
464
206
284
276
139
192
142
202
225
311
145
246
134
89
156
73
0
0
0
0
0
0
85
0
0
73
0
0
51
0
100
0
0
115
0
168
82
61
0
0
0
70
0
49
0
0
0
0
0
89
0
0
48
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
68
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
60
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
84
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
115
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
50
0
82
0
0
0
0
248
0
0
0
0
0
0
105
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
81
191
56
119
0
88
63
0
0
0
95
225
0
0
65
0
127
215
0
77
0
0
112
92
0
53
103
0
66
57
114
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
Rhodospirillales
M. Sc. Thesis—Quan Yao McMaster—Biology
YP_496357
YP_496405
YP_496439
YP_496442
YP_497022
YP_497059
YP_497246
YP_497309
YP_497310
YP_497604
YP_497818
AAW60410
AAW60472
AAW60735
AAW61019
AAW59936
AAW61357
AAW60126
AAW60973
AAW60976
AAW60983
AAW60985
AAW62008
AAW62049
AAW62183
AAW62185
AAW60994
AAW62187
YP_425217
YP_425244
YP_425622
YP_426776
YP_426843
YP_427199
YP_427597
YP_427676
YP_427912
YP_428643
122
0
70
0
74
0
0
0
80
147
130
0
64
52
53
78
0
0
0
0
0
0
0
70
0
162
0
70
173
0
94
104
155
144
181
227
55
410
240
168
76
61
191
67
64
285
85
663
305
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
78
94
0
0
0
0
50
0
0
303
308
0
0
0
0
0
0
0
0
0
0
0
0
471
0
0
0
0
177
0
0
67
62
0
125
0
0
342
159
0
0
0
0
0
0
0
0
0
0
0
0
0
56
0
0
0
0
0
0
0
0
0
0
0
0
0
54
0
0
0
53
87
0
0
0
0
69
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
68
0
0
0
0
0
0
59
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
127
0
0
53
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
68
0
62
0
0
0
0
0
0
0
0
0
0
68
0
0
0
0
0
0
56
0
0
0
0
69
0
0
88
0
0
78
60
0
0
0
59
0
102
47
117
68
0
99
73
149
164
0
0
0
0
0
0
0
0
0
0
0
0
158
0
0
0
0
171
0
89
0
0
0
0
0
0
195
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
Rickettsiales
M. Sc. Thesis—Quan Yao McMaster—Biology
YP_428717
YP_428743
YP_428820
YP_428881
NP_965979
NP_966474
NP_965909
NP_966580
NP_965975
NP_965966
NP_966527
NP_966202
NP_966253
NP_966513
NP_966574
NP_966613
NP_966526
NP_966520
NP_966750
NP_966779
NP_966932
NP_966942
NP_220581
NP_220424
NP_220576
78
114
0
86
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
56
0
0
0
0
0
0
0
0
0
0
0
0
69
0
0
0
0
0
0
0
0
0
0
0
82
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
74
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
83
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
82
72
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
Figure 3: Similarity of significant hits in 10 metagenomes
Best bit score within each metagenome are assigned to the listed taxa. Color formatting
indicates high and low values. Negative results are in green. Positive results are in yellow.
Red indicates the highest values in the chart.
70
M. Sc. Thesis—Quan Yao McMaster—Biology
Figure 4: Overall relative abundance of Alphaproteobacteria based on CSP
distribution in 10 metagenomes
Note: The dots indicate the average bit score obtained from BLAST search and
demonstrates the average extent of similarity between metagenomic reads and CSPs
71
M. Sc. Thesis—Quan Yao McMaster—Biology
Figure 5: The relative abundance of Alphaproteobacteria and its different subclades in the studied metagenomes based upon BLASTp searches with CSPs
Note: The colored bars indicate the numbers of significant hits that were detected in each
metagenomes with CSPs, which are specific for different groups of Alphaproteobacteria.
72
M. Sc. Thesis—Quan Yao McMaster—Biology
Figure 6: Comparative results of Alphaproteobacteria distribution in 4
metagenomes derived from (A) CSPs-based binning and (B) similarity-based
binning.
Note: The lower piecharts are obtained from MG-RAST and IMG/M databases. The color
scheme to denote different Alphaproteobacteria subgroups is shown below.
73
M. Sc. Thesis—Quan Yao McMaster—Biology
Chapter 4 Discussion
4.1 Metagenome selection
The 10 metagenomes selected in this work from NCBI metagenomic database
represent different environmental habitats around the world and cover all 3 metagenomic
ecosystems: 4 from engineered ecosystems (bioreactor, compost, wastewater and
activated sludge metagenome), 5 from environmental ecosystems (groundwater,
freshwater sediment, microbial mat, marine and hydrothermal vent metagenomes) and 1
from host-associated ecosystem (whale fall metagenome). The habitats for
Alphaproteobacterial microbial communities are highly divergent, including saline water,
sediment, marine, fossil, green-waste compost, wastewater treatment plant and so forth.
Public taxonomical classification of these metagenomes, either form MG-RAST and JGI
platform, identified a myriad of Alphaproteobacteria associated sequences, further
suggesting that Alphaproteobacteria can adapt to diverse environments described above.
It is notable that the size of metagenomic projects varies to each other regarding total
length, # of contigs and average length (Table 11). These statistical differences have a
remarkable influence on downstream bioinformatics analysis. For instance, the total
length of whole genome shotgun sequences (WGS) in this study spans from 24.9 Mbases
to 421.6 Mbases. However, corresponding raw sequencing data reaches up to tens or
hundreds of Gbases. Data loss occurs in bioinformatics analysis such as quality control
and duplicate clustering. Likewise, enormous amount of metagenomic sequences are
discarded because the ever-increasing size of the metagenomic projects have surpassed
the volume of any existing public database so that they cannot be matched to any
74
M. Sc. Thesis—Quan Yao McMaster—Biology
reference sequence in public database (Thomas et al., 2012). The number of contigs
assembled in these metagenomes ranges from 26573 to 748672. Sequence coverage is a
key factor for producing assembled contigs. However, mixture of genomes casts
challenge on assembly process, leading to the low yield of assembled contigs because
metagenomic sequences are less redundant than single genome sequences. Lastly, the
average length of contigs in each metagenomic project ranges from 425bp to 2801bp. The
longer a metagenomic sequence is, the higher the mapping accuracy is (Wommack et al.,
2008). The depth of sequencing determines the length of assembled contigs. So it is
possible to plot a single draft genome from metagenomic sequences if only sequencing is
deep enough to provide sufficient folds of coverage for splicing DNA fragments.
However, sequencing technology merely unveils a small portion of microbes in
environments because incomplete sequencing is a major and inevitable limitation of most
metagenomic studies. As a consequence, the species that could be predicted from
metagenomic datasets are still very limited and are likely biased by information
asymmetry between database and metagenomes (Wooley et al., 2010).
4.2 Identification of CSPs in metagenomic samples
An important advantage of using CSPs for metagenomic profiling is that the presence
of these protein markers can be more reliably detected than the corresponding gene
markers. When gene markers of corresponding CSPs are used in similar studies, the
number of significant hits obtained is much less than that obtained using protein markers
(Table 2). This can be explained by both the redundancy of genetic code and the variation
of gene sequences in metagenomes (Kembel et al., 2011). In view of this, CSPs may be
75
M. Sc. Thesis—Quan Yao McMaster—Biology
able to decipher taxonomic origin of some unassigned metagenomic sequences beyond
what nucleotide markers can do. Different from MetaPhlAn which is a similarity based
binning software relying on unique clade-specific gene marker, CSPs-based methodology
emphasize on identifying and exploiting microbial clade specific protein markers ranging
from phyla to genera. However, the number of genera specific CSPs identified within
Alphaproteobacteria before is very poor, leading to the low resolution of taxonomic
profiling at lower taxonomic levels for metagenomic projects. More genera specific CSPs
will be identified if more reference genomes are available for public.
According to the heat-map distribution of 264 Alphaproteobacterial CSPs, most of
the class specific CSPs could be detected in all 10 metagenomic projects, whose habitats
abound with Alphaproteobacteria. Compared with the formidable task that aims at
assembling each individual genome in metagenomic samples, it is more feasible to build a
molecular marker database that contains all commonly shared genes within
Alphaproteobacteria. The results also indicate that CSPs at higher taxonomic level such
as class level and phylum level tend to be discovered effortlessly in metagenomic
datasets. Alphaproteobacteria is enriched in 3 metagenomic projects (bioreactor,
wastewater, whale fall metagenome) (Figure 4). The detailed distribution of 6 major
orders under Alphaproteobacteria indicates that Rhodobacterales is the most abundant
clade in 6 metagenomic projects studied (Figure 5). The relative abundance of
Alphaproteobacteria from MG-RAST and IMG/M also support the dissertation in 4
metagenomes based on CSP searches (Figure 6). The proportion of Alphaproteobacteria
clades in these 4 metagenomic projects is highly correlated to the proportion predicted by
76
M. Sc. Thesis—Quan Yao McMaster—Biology
Alphaproteobacteria specific CSPs (Figure 6). This important discovery indicates a
potential application of CSPs----Alphaproteobacteria specific CSPs are able to predict the
distribution pattern of Alphaproteobacterial clades in metagenomic samples.
4.3 Comparative analysis of Alphaproteobacteria in metagenomes
The heatmaps in Figure 2 and 3 reflect the overall distribution pattern of all 264 CSPs
in 10 metagenomes. It is noticeable that the 11 CSPs unique to Alphaproteobacteria class
are ubiquitously present in all 10 metagenomes. The average bit scores are higher than the
other order specific CSPs and the total number of significant hits identified outweigh all
other CSPs. Based on this finding, It is concluded that class specific CSPs are much more
easily to be discovered than order or family specific CSPs. This is because all potential
Alphaproteobacteria species are assumed to contribute class specific CSPs into
metagenomic datasets. So the predictive ability of Alphaproteobacteria class specific
CSPs are much stronger than order specific or family specific CSPs. The best bit scores of
Rhodobacterales specific CSPs in 6 metagenomic samples are very high, compared to
other clade specific CSPs. It is suggested that Rhodobacterales is the dominant
Alphaproteobacteria member in those metagenomes and the high concentration of
Rhodobacterales increases the coverage during sequence assembly, thus produces more
complete genomic sequences of Rhodobacterales. Similar results are also seen in
Sphingomonadales specific CSPs and Rhizobiales specific CSPs regarding wastewater
metagenome. In summary, the occurrence rate of Alphaproteobacteria specific CSPs is
influenced by the specificity of CSPs as well as the concentration of corresponding
bacteria clade.
77
M. Sc. Thesis—Quan Yao McMaster—Biology
A comprehensive profiling of Alphaproteobacteria was performed based on CSPs
distribution in metagenomic projects. Alphaproteobacteria dominates 3 metagenomes. It
is indicated that in these three metagenomes, different clades of Alphaproteobacteria may
exert certain functions respectively to maintain the balance and well development for
each habitat. Alphaproteobacteria is less abundant in other 7 metagenomes, which are
either occupied by 1 order together with other orders in low concentration (Figure 5).
Microbial mat, hydrothermal vent, groundwater metagenomes are three typical habitats
that are mainly composed of Rhodobacterales only. These habitats are characterized by
extreme environment conditions such as hypersaline, high temperature and exposure to
radiation. The discovery of Rhodobacterales specific CSPs suggests that they have very
strong adaptive abilities to adopt harsh environments. For the rest 4 metagenomes: marine
metagenome, compost metagenome, activated sludge metagenome and freshwater
sediment metagenome, although the overall concentration of Alphaproteobacteria in these
metagenomes are not very high, but the diversity of Alphaproteobacteria is higher than
the three metagenomes discussed above. Several Alphaproteobacterial clades are existent
with low concentration in these metagenomes. It is suggested that Alphaproteobacteria
are in charge of some auxiliary functions to maintain the equilibrium of the habitat. In
brief, different environments featured by unique growth conditions are preferred by
different Alphaproteobacteria species. The nexus between organism and environment
may predict the presence of similar lineages before fieldwork and laboratory experiments
are accomplished.
78
M. Sc. Thesis—Quan Yao McMaster—Biology
An important goal in this study is to validate the methodology of CSPs in organism
identification and abundance prediction. The comparison of relative abundance between
CSPs-based binning and traditional similarity-based binning from public metagenomic
server shows high correlation. The distribution of Alphaproteobacteria and its sub-clades
in bioreactor metagenome matches perfectly with the proportion on IMG/M server
(Figure 6). In wastewater metagenome, 3 groups of Alphaproteobacteria: Rhizobiales,
Rhodobacterales and Sphingomonadales, are proved to be the major members both by
CSPs searches and similarity searches from MG-RAST server. Comparison between
CSPs-based binning and similarity based binning on MG-RAST for whale fall
metagenome and marine metagenome also shows similar results. So, CSP-based binning
is reliable to predict the relative abundance of Alphaproteobacteria species in
metagenomic samples. Since the database constructed is smaller but more unique than the
NCBI non-redundant database, it is more accurate and fast to achieve taxonomic
clustering in environmental datasets.
4.4 Overall conclusions
In the previous centuries, the study of microbiology was mainly restricted to single
species in laboratory culture (Madigan et al., 2008). Since the vast majority of microbes
cannot be grown in the laboratory, researches on microbial community interactions
beyond the substrates fall behind (Hugenholtz et al., 1998). Nevertheless, in environment
conditions, all microbial activities, such as photosynthesis, organic degradation, and
fixation of nitrogen, are conducted by complex microbial communities----those that have
evolved for millions of years to adapt to different habitats and ecosystems (Davey and
79
M. Sc. Thesis—Quan Yao McMaster—Biology
O’toole, 2000). In order to understand the complex mutual effects within microbial
cohort, it is necessary to explore the species diversity as well as their relative abundance
in environment (Kuramitsu et al., 2007). In this study, 264 CSPs were utilized to
investigate the Alphaproteobacterial diversity in 10 metagenomic projects. The results
indicate that most CSPs could be detected in different metagenomic projects. Through
analyzing and comparing the distribution of bit score and significant hit number, a
comprehensive profiling of Alphaproteobacterial species diversity in metagenomic
datasets was plotted. Basically, CSPs-based binning is a refinement of traditional
similarity-based binning, which enhances the efficiency and effectiveness of
performance. Computational expense is reduced while the accuracy of mapping increase.
Although CSP cannot robustly resolve the issue such as bacterial quantification or
species/strains diagnosis, it sheds light upon bacterial clades profiling above species level
and provides a new way to predict the relative abundance of microbial clades in different
metagenomes with clade specific protein markers.
4.5 Future directions
Apart from the projects accomplished here, there are some experiments that can be
done to expand the results above:
CSPs specific to other bacterial phyla such as Actinobacteria, Cyanobacteria and
Bacteroidetes have already been identified in previous studies. With these molecular
markers, it is possible to forecast the presence and relative abundance of corresponding
bacteria in more metagenomic projects. By constructing a database that contains all CSPs
unique to every taxon from genus level to phylum level in Bacteria domain a
80
M. Sc. Thesis—Quan Yao McMaster—Biology
comprehensive blueprint of metagenomic taxonomic classification can be created to
profile the presence and relative abundance of all microorganisms in metagenomic
datasets.
81
M. Sc. Thesis—Quan Yao McMaster—Biology
References
Abraham, W.-R., Macedo, A.J., Lünsdorf, H., Fischer, R., Pawelczyk, S., Smit, J., and
Vancanneyt, M. (2008). Phylogeny by a polyphasic approach of the order Caulobacterales,
proposal of Caulobacter mirabilis sp. nov., Phenylobacterium haematophilum sp. nov.
and Phenylobacterium conjunctum sp. nov., and emendation of the genus
Phenylobacterium. Int. J. Syst. Evol. Microbiol. 58, 1939–1949.
Albertsen, M., Hugenholtz, P., Skarshewski, A., Nielsen, K.L., Tyson, G.W., and Nielsen,
P.H. (2013). Genome sequences of rare, uncultured bacteria obtained by differential
coverage binning of multiple metagenomes. Nat. Biotechnol. 31, 533–538.
Allgaier, M., Reddy, A., Park, J.I., Ivanova, N., D’haeseleer, P., Lowry, S., Sapra, R.,
Hazen, T.C., Simmons, B. a, VanderGheynst, J.S., et al. (2010). Targeted discovery of
glycoside hydrolases from a switchgrass-adapted compost community. PLoS One 5,
e8812.
Alsmark, C.M., Frank, A.C., Karlberg, E.O., Legault, B.A., Ardell, D.H., Canback, B.,
Eriksson, A.S., Naslund, A.K., Handley, S.A., Huvet, M., et al. (2004). The louse-borne
human pathogen Bartonella quintana is a genomic derivative of the zoonotic agent
Bartonella henselae. Proc.Natl.Acad.Sci.U.S.A 101, 9716–9721.
Altschul, S.F., Gish, W., Miller, W., Myers, E.W., and Lipman, D.J. (1990). Basic local
alignment search tool. J. Mol. Biol. 215, 403–410.
Andersson, S.G., and Kempf, V.A. (2004). Host cell modulation by human, animal and
plant pathogens. Int.J.Med.Microbiol. 293, 463–470.
Arisue, N., Hasegawa, M., and Hashimoto, T. (2005). Root of the Eukaryota tree as
inferred from combined maximum likelihood analyses of multiple molecular sequence
data. Mol. Biol. Evol. 22, 409–420.
Arraga-Alvarado, C., Palmar, M., Parra, O., and Salas, P. (2003). Ehrlichia platys
(Anaplasma platys) in dogs from Maracaibo, Venezuela: an ultrastructural study of
experimental and natural infections. Vet. Pathol. 40, 149–156.
Bäckhed, F., Fraser, C.M., Ringel, Y., Sanders, M.E., Sartor, R.B., Sherman, P.M.,
Versalovic, J., Young, V., and Finlay, B.B. (2012). Defining a healthy human gut
microbiome: current concepts, future directions, and clinical applications. Cell Host
Microbe 12, 611–622.
82
M. Sc. Thesis—Quan Yao McMaster—Biology
Beiko, R.G., and Ragan, M.A. (2008). Detecting lateral genetic transfer  : a phylogenetic
approach. Methods Mol.Biol. 452, 457–469.
Bhandari, V., Naushad, H.S., and Gupta, R.S. (2012). Protein based molecular markers
provide reliable means to understand prokaryotic phylogeny and support Darwinian mode
of evolution. Front. Cell. Infect. Microbiol. 2, 98.
Binnewies, T.T., Motro, Y., Hallin, P.F., Lund, O., Dunn, D., La, T., Hampson, D.J.,
Bellgard, M., Wassenaar, T.M., and Ussery, D.W. (2006). Ten years of bacterial genome
sequencing: comparative-genomics-based discoveries. Funct.Integr.Genomics 6, 165–185.
Boersma, F.G.H., Warmink, J.A., Andreote, F.A., and van Elsas, J.D. (2009). Selection of
Sphingomonadaceae at the base of Laccaria proxima and Russula exalbicans fruiting
bodies. Appl. Environ. Microbiol. 75, 1979–1989.
Bowman, D.D. (2011). Introduction to the alpha-proteobacteria: Wolbachia and
Bartonella, Rickettsia, Brucella, Ehrlichia, and Anaplasma. Top. Companion Anim. Med.
26, 173–177.
Brady, A., and Salzberg, S.L. (2009). Phymm and PhymmBL: metagenomic phylogenetic
classification with interpolated Markov models. Nat. Methods 6, 673–676.
Brazelton, W.J., and Baross, J. a (2009). Abundant transposases encoded by the
metagenome of a hydrothermal chimney biofilm. ISME J. 3, 1420–1424.
Breitschwerdt, E.B., and Kordick, D.L. (2000). Bartonella Infection in Animals:
Carriership, Reservoir Potential, Pathogenicity, and Zoonotic Potential for Human
Infection. Clin. Microbiol. Rev. 13, 428–438.
Brennerova, M. V, Josefiova, J., Brenner, V., Pieper, D.H., and Junca, H. (2009).
Metagenomics reveals diversity and abundance of meta-cleavage pathways in microbial
communities from soil highly contaminated with jet fuel under air-sparging
bioremediation. Environ. Microbiol. 11, 2216–2227.
Campagne, S., Damberger, F.F., Kaczmarczyk, A., Francez-Charlot, A., Allain, F.H.-T.,
and Vorholt, J.A. (2012). Structural basis for sigma factor mimicry in the general stress
response of Alphaproteobacteria. Proc. Natl. Acad. Sci. U. S. A. 109, E1405–14.
Carvalho, F.M., Souza, R.C., Barcellos, F.G., Hungria, M., and Vasconcelos, A.T.R.
(2010). Genomic and evolutionary comparisons of diazotrophic and pathogenic bacteria
of the order Rhizobiales. BMC Microbiol. 10, 37.
83
M. Sc. Thesis—Quan Yao McMaster—Biology
Le Chatelier, E., Nielsen, T., Qin, J., Prifti, E., Hildebrand, F., Falony, G., Almeida, M.,
Arumugam, M., Batto, J.-M., Kennedy, S., et al. (2013). Richness of human gut
microbiome correlates with metabolic markers. Nature 500, 541–546.
Chistoserdova, L. (2013). Is metagenomics resolving identification of functions in
microbial communities? Microb. Biotechnol.
Choudhary, M., and Kaplan, S. (2000). DNA sequence analysis of the photosynthesis
region of Rhodobacter sphaeroides 2.4.1. Nucleic Acids Res. 28, 862–867.
Coletta, A., Pinney, J.W., Solís, D.Y.W., Marsh, J., Pettifer, S.R., and Attwood, T.K.
(2010). Low-complexity regions within protein sequences have position-dependent roles.
BMC Syst. Biol. 4, 43.
Dang, H., Li, T., Chen, M., and Huang, G. (2008). Cross-ocean distribution of
Rhodobacterales bacteria as primary surface colonizers in temperate coastal marine
waters. Appl. Environ. Microbiol. 74, 52–60.
Davey, M.E., and O’toole, G.A. (2000). Microbial biofilms: from ecology to molecular
genetics. Microbiol. Mol. Biol. Rev. 64, 847–867.
DeLong, E.F., Preston, C.M., Mincer, T., Rich, V., Hallam, S.J., Frigaard, N.-U.U.,
Martinez, A., Sullivan, M.B., Edwards, R., Brito, B.R., et al. (2006). Community
genomics among stratified microbial assemblages in the ocean’s interior. Science 311,
496–503.
Doolittle, W.F., and Bapteste, E. (2007). Pattern pluralism and the Tree of Life hypothesis.
Proc.Natl.Acad.Sci.U.S.A 104, 2043–2049.
Dröge, J., and McHardy, A.C. (2012). Taxonomic binning of metagenome samples
generated by next-generation sequencing technologies. Brief. Bioinform. 13, 646–655.
Dumler, J.S., Barbet, A.F., Bekker, C.P., Dasch, G.A., Palmer, G.H., Ray, S.C., Rikihisa,
Y., and Rurangirwa, F.R. (2001). Reorganization of genera in the families Rickettsiaceae
and Anaplasmataceae in the order Rickettsiales: unification of some species of Ehrlichia
with Anaplasma, Cowdria with Ehrlichia and Ehrlichia with Neorickettsia, descriptions of
six new species combi. Int. J. Syst. Evol. Microbiol. 51, 2145–2165.
English, C.K. (1988). Cat-Scratch Disease. JAMA 259, 1347.
Ferrari, B.C., Binnerup, S.J., and Gillings, M. (2005). Microcolony cultivation on a soil
substrate membrane system selects for previously uncultured soil bacteria. Appl. Environ.
Microbiol. 71, 8714–8720.
84
M. Sc. Thesis—Quan Yao McMaster—Biology
Fischer, H.M. (1996). Environmental regulation of rhizobial symbiotic nitrogen fixation
genes. Trends Microbiol. 4, 317–320.
Fredricks, D.N. (2006). Introduction to the Rickettsiales and other intracellular
prokaryotes. In The Prokaryotes: A Handbook on the Biology of Bacteria, M. Dworkin, S.
Falkow, E. Rosenberg, K.H. Schleifer, and E. Stackebrandt, eds. (New York: Springer),
pp. 457–466.
Gao, B., and Gupta, R.S. (2012). Microbial systematics in the post-genomics era. Antonie
Van Leeuwenhoek 101, 45–54.
Gao, B., Parmanathan, R., and Gupta, R.S. (2006). Signature proteins that are distinctive
characteristics of Actinobacteria and their subgroups. Antonie Van Leeuwenhoek 90, 69–
91.
Ghai, R., Mizuno, C.M., Picazo, A., Camacho, A., and Rodriguez-Valera, F. (2013).
Metagenomics uncovers a new group of low GC and ultra-small marine Actinobacteria.
Sci. Rep. 3, 2471.
Ghazanfar, S., Azim, A., Ghazanfar, M.A.M.A., Iqbal, M., and Anjum, I.B. (2010).
Metagenomics and its application in soil microbial community studies: biotechnological
prospects. J. Anim. … 6, 611–622.
Gilbert, J.A., and Dupont, C.L. (2011). Microbial Metagenomics: Beyond the Genome.
Ann. Rev. Mar. Sci. 3, 347–371.
Gomez-Alvarez, V., Revetta, R.P., and Santo Domingo, J.W. (2012). Metagenome
analyses of corroded concrete wastewater pipe biofilms reveal a complex microbial
system. BMC Microbiol. 12, 122.
Gray, M.W. (2012). Mitochondrial evolution. Cold Spring Harb. Perspect. Biol. 4,
a011403.
Gullo, M., and Giudici, P. (2008). Acetic acid bacteria in traditional balsamic vinegar:
phenotypic traits relevant for starter cultures selection. Int. J. Food Microbiol. 125, 46–53.
Gupta, R.S. (2000). The phylogeny of proteobacteria: relationships to other eubacterial
phyla and eukaryotes. FEMS Microbiol. Rev. 24, 367–402.
Gupta, R.S. (2005a). Critical issues in prokaryotic phylogeny and taxonomy. ASM News
71, 393–394.
85
M. Sc. Thesis—Quan Yao McMaster—Biology
Gupta, R.S. (2005b). Protein signatures distinctive of alpha proteobacteria and its
subgroups and a model for alpha-proteobacterial evolution. Crit Rev.Microbiol. 31, 101–
135.
Gupta, R.S., and Griffiths, E. (2002). Critical issues in bacterial phylogeny.
Theor.Popul.Biol. 61, 423–434.
Gupta, R.S., and Lorenzini, E. (2007). Phylogeny and molecular signatures (conserved
proteins and indels) that are specific for the Bacteroidetes and Chlorobi species. BMC
Evol.Biol. 7, 71.
Gupta, R.S., and Mok, A. (2007a). Phylogenomics and signature proteins for the alpha
proteobacteria and its main groups. BMC Microbiol. 7, 106.
Gupta, R.S., and Mok, A. (2007b). Phylogenomics and signature proteins for the alpha
Proteobacteria and its main groups. BMC Microbiol. 7, 106.
Hallez, R., Bellefontaine, A.-F., Letesson, J.-J., and De Bolle, X. (2004). Morphological
and functional asymmetry in alpha-proteobacteria. Trends Microbiol. 12, 361–365.
Handelsman, J. (2004). Metagenomics: application of genomics to uncultured
microorganisms. Microbiol. Mol. Biol. Rev. 68, 669–685.
Harris, J.K., Caporaso, J.G., and Walker, J.J. (2012). Phylogenetic stratigraphy in the
Guerrero Negro hypersaline microbial mat. ISME … 1–11.
Hess, M., Sczyrba, A., Egan, R., Kim, T.-W., Chokhawala, H., Schroth, G., Luo, S., Clark,
D.S., Chen, F., Zhang, T., et al. (2011). Metagenomic discovery of biomass-degrading
genes and genomes from cow rumen. Science 331, 463–467.
Holley, H.P. (1991). Successful Treatment of Cat-scratch Disease With Ciprofloxacin.
JAMA J. Am. Med. Assoc. 265, 1563.
Huang, W.E., Zhou, J., Scholz, M.B., Lo, C.-C., and Chain, P.S. (2012). Next generation
sequencing and bioinformatic bottlenecks: the current state of metagenomic data analysis.
Curr. Opin. Biotechnol. 23, 9–15.
Hugenholtz, P., Goebel, B.M., and Pace, N.R. (1998). Impact of Culture-Independent
Studies on the Emerging Phylogenetic View of Bacterial Diversity. J. Bacteriol. 180,
4765–4774.
Huson, D.H., and Xie, C. (2013). A poor man’s BLASTX--high-throughput metagenomic
protein database search using PAUDA. Bioinformatics.
86
M. Sc. Thesis—Quan Yao McMaster—Biology
Kainth, P., and Gupta, R.S. (2005). Signature proteins that are distinctive of alpha
proteobacteria. BMC Genomics 6, 94.
Kalyuzhnaya, M.G., Lapidus, A., Ivanova, N., Copeland, A.C., McHardy, A.C., Szeto, E.,
Salamov, A., Grigoriev, I. V, Suciu, D., Levine, S.R., et al. (2008). High-resolution
metagenomics targets specific functional types in complex microbial communities. Nat.
Biotechnol. 26, 1029–1034.
Kang, I., Oh, H.-M., Vergin, K.L., Giovannoni, S.J., and Cho, J.-C. (2010). Genome
sequence of the marine alphaproteobacterium HTCC2150, assigned to the Roseobacter
clade. J. Bacteriol. 192, 6315–6316.
Kapley, A., De Baere, T., and Purohit, H.J. (2007). Eubacterial diversity of activated
biomass from a common effluent treatment plant. Res. Microbiol. 158, 494–500.
Kembel, S.W., Eisen, J.A., Pollard, K.S., and Green, J.L. (2011). The Phylogenetic
Diversity of Metagenomes. PLoS One 6, 9.
Kersters, K., Devos, P., Gillis, M., Swings, J., Vandamme, P., and Stackebrandt, E.
(2006). Introduction to the Proteobacteria. In The Prokaryotes: A Handbook on the
Biology of Bacteria, M. Dworkin, S. Falkow, E. Rosenberg, K.H. Schleifer, and E.
Stackebrandt, eds. (New York: Springer), pp. 3–37.
Kinross, J.M., Darzi, A.W., and Nicholson, J.K. (2011). Gut microbiome-host interactions
in health and disease. Genome Med. 3, 14.
Kisand, V., Valente, A., Lahm, A., Tanet, G., and Lettieri, T. (2012). Phylogenetic and
functional metagenomic profiling for assessing microbial biodiversity in environmental
monitoring. PLoS One 7, e43630.
Kunisawa, T. (2007). Gene arrangements characteristic of the phylum Actinobacteria.
Antonie Van Leeuwenhoek 92, 359–365.
Kuramitsu, H.K., He, X., Lux, R., Anderson, M.H., and Shi, W. (2007). Interspecies
interactions within oral microbial communities. Microbiol. Mol. Biol. Rev. MMBR 71,
653–670.
Leimena, M.M., Ramiro-Garcia, J., Davids, M., van den Bogert, B., Smidt, H., Smid, E.J.,
Boekhorst, J., Zoetendal, E.G., Schaap, P.J., and Kleerebezem, M. (2013). A
comprehensive metatranscriptome analysis pipeline and its validation using human small
intestine microbiota datasets. BMC Genomics 14, 530.
Van der Lelie, D., Taghavi, S., McCorkle, S.M., Li, L.-L.L., Malfatti, S. a, Monteleone,
D., Donohoe, B.S., Ding, S.-Y.Y., Adney, W.S., Himmel, M.E., et al. (2012). The
87
M. Sc. Thesis—Quan Yao McMaster—Biology
metagenome of an anaerobic microbial community decomposing poplar wood chips.
PLoS One 7, e36740.
Lepage, P., Leclerc, M.C., Joossens, M., Mondot, S., Blottière, H.M., Raes, J., Ehrlich, D.,
and Doré, J. (2013). A metagenomic insight into our gut’s microbiome. Gut 62, 146–158.
Leung, H.C.M., Yiu, S.M., Yang, B., Peng, Y., Wang, Y., Liu, Z., Chen, J., Qin, J., Li, R.,
and Chin, F.Y.L. (2011). A robust and accurate binning algorithm for metagenomic
sequences with arbitrary species abundance ratio. Bioinformatics 27, 1489–1495.
Li, W., Fu, L., Niu, B., Wu, S., and Wooley, J. (2012). Ultrafast clustering algorithms for
metagenomic sequence analysis. Brief. Bioinform. 13, 656–668.
Lindner, M.S., Kollock, M., Zickmann, F., and Renard, B.Y. (2013). Analyzing genome
coverage profiles with applications to quality control in metagenomics. Bioinformatics 29,
1260–1267.
Lu, H.-P., Wang, Y., Huang, S.-W., Lin, C.-Y., Wu, M., Hsieh, C., and Yu, H.-T. (2012).
Metagenomic analysis reveals a functional signature for biomass degradation by cecal
microbiota in the leaf-eating flying squirrel (Petaurista alborufus lena). BMC Genomics
13, 466.
Ludwig, W., Strunk, O., Klugbauer, S., Klugbauer, N., Weizenegger, M., Neumaier, J.,
Bachleitner, M., and Schleifer, K.H. (1998). Bacterial phylogeny based on comparative
sequence analysis. Electrophoresis 19, 554–568.
Lussier, F.-X., Chambenoit, O., Côté, A., Hupé, J.-F., Denis, F., Juteau, P., Beaudet, R.,
and Shareck, F. (2011). Construction and functional screening of a metagenomic library
using a T7 RNA polymerase-based expression cosmid vector. J. Ind. Microbiol.
Biotechnol. 38, 1321–1328.
Mackelprang, R., Waldrop, M.P., DeAngelis, K.M., David, M.M., Chavarria, K.L.,
Blazewicz, S.J., Rubin, E.M., and Jansson, J.K. (2011). Metagenomic analysis of a
permafrost microbial community reveals a rapid response to thaw. Nature 480, 368–371.
Madigan, M.T., Martinko, J.M., Dunlap, P. V, and Clark, D.P. (2008). Brock Biology of
Microorganisms (12th Edition) (Benjamin Cummings).
Markowitz, V.M., Chen, I.-M.A., Chu, K., Szeto, E., Palaniappan, K., Grechkin, Y.,
Ratner, A., Jacob, B., Pati, A., Huntemann, M., et al. (2012). IMG/M: the integrated
metagenome data management and comparative analysis system. Nucleic Acids Res. 40,
D123–D129.
88
M. Sc. Thesis—Quan Yao McMaster—Biology
Matsuda, H., Nishi, N., Tsuji, K., Tanaka, K., Kakuno, T., Yamashita, J., and Horio, T.
(1984). Reconstruction of photosynthetic, cyclic electron transport system from
photoreaction unit, ubiquinone-10 protein, cytochrome c2 and polar lipids purified from
Rhodospirillum rubrum. J. Biochem. 95, 431–442.
Meyer, F., Paarmann, D., D’Souza, M., Olson, R., Glass, E.M., Kubal, M., Paczian, T.,
Rodriguez, a, Stevens, R., Wilke, A., et al. (2008). The metagenomics RAST server - a
public resource for the automatic phylogenetic and functional analysis of metagenomes.
BMC Bioinformatics 9, 386.
Mielczarek, a T., Saunders, a M., Larsen, P., Albertsen, M., Stevenson, M., Nielsen, J.L.,
and Nielsen, P.H. (2013). The Microbial Database for Danish wastewater treatment plants
with nutrient removal (MiDas-DK) - a tool for understanding activated sludge population
dynamics and community stability. Water Sci. Technol. 67, 2519–2526.
Mitra, S., Rupek, P., Richter, D.C., Urich, T., Gilbert, J.A., Meyer, F., Wilke, A., and
Huson, D.H. (2011). Functional analysis of metagenomes and metatranscriptomes using
SEED and KEGG. BMC Bioinformatics 12 Suppl 1, S21.
Mohammed, M.H., Ghosh, T.S., Singh, N.K., and Mande, S.S. (2011). SPHINX--an
algorithm for taxonomic binning of metagenomic sequences. Bioinformatics 27, 22–30.
Moine, H., Squires, C.L., Ehresmann, B., and Ehresmann, C. (2000). In vivo selection of
functional ribosomes with variations in the rRNA-binding site of Escherichia coli
ribosomal protein S8: evolutionary implications. Proc.Natl.Acad.Sci.U.S.A 97, 605–610.
Moloney, R.D., Desbonnet, L., Clarke, G., Dinan, T.G., and Cryan, J.F. (2013). The
microbiome: stress, health and disease. Mamm. Genome.
Morgan, J.L., Darling, A.E., and Eisen, J. a (2010). Metagenomic sequencing of an in
vitro-simulated microbial community. PLoS One 5, e10209–e10209.
National Research Council (US) Committee on Metagenomics: Challenges and
Functional, and Functional, N.R.C. (US) C. on M.C. and (2007). THE NEW SCIENCE
OF METAGENOMICS Revealing the Secrets of Our Microbial Planet (The National
Academies Press).
Nguimbi, E., Li, Y.Z., Gao, B.L., Li, Z.F., Wang, B., Wu, Z.H., Yan, B.X., Qu, Y.B., and
Gao, P.J. (2003). 16S-23S ribosomal DNA intergenic spacer regions in cellulolytic
myxobacteria and differentiation of closely related strains. Syst.Appl.Microbiol. 26, 262–
268.
Nielsen, P.H., Saunders, A.M., Hansen, A.A., Larsen, P., and Nielsen, J.L. (2012).
Microbial communities involved in enhanced biological phosphorus removal from
89
M. Sc. Thesis—Quan Yao McMaster—Biology
wastewater--a model system in environmental biotechnology. Curr. Opin. Biotechnol. 23,
452–459.
Nijkamp, J.F., Pop, M., Reinders, M.J.T., and de Ridder, D. (2013). Exploring variationaware contig graphs for (comparative) metagenomics using MaryGold. Bioinformatics 29,
2826–2834.
Oh, J.I., and Kaplan, S. (2001). Generalized approach to the regulation and integration of
gene expression. Mol. Microbiol. 39, 1116–1123.
Olson, J.B., Harmody, D.K., and McCarthy, P.J. (2002). Alpha-proteobacteria cultivated
from marine sponges display branching rod morphology. FEMS Microbiol. Lett. 211,
169–173.
Poindexter, J.S., and Staley, J.T. (1996). Caulobacter and Asticcacaulis stalk bands as
indicators of stalk age. J. Bacteriol. 178, 3939–3948.
Qin, J., Li, R., Raes, J., Arumugam, M., Burgdorf, K.S., Manichanh, C., Nielsen, T., Pons,
N., Levenez, F., Yamada, T., et al. (2010). A human gut microbial gene catalogue
established by metagenomic sequencing. Nature 464, 59–65.
Raoult, D., Fournier, P.-E., Vandenesch, F., Mainardi, J.-L., Eykyn, S.J., Nash, J., James,
E., Benoit-Lemercier, C., and Marrie, T.J. (2003). Outcome and Treatment of Bartonella
Endocarditis. Arch. Intern. Med. 163, 226.
Rascovan, N., Carbonetto, B., Revale, S., Reinert, M.D., Alvarez, R., Godeas, A.M.,
Colombo, R., Aguilar, M., Novas, M., Iannone, L., et al. (2013). The PAMPA datasets: a
metagenomic survey of microbial communities in Argentinean pampean soils.
Microbiome 1, 21.
Rathsack, K., Reitner, J., Stackebrandt, E., and Tindall, B.J. (2011). Reclassification of
Aurantimonas altamirensis (Jurado et al. 2006), Aurantimonas ureilytica (Weon et al.
2007) and Aurantimonas frigidaquae (Kim et al. 2008) as members of a new genus,
Aureimonas gen. nov., as Aureimonas altamirensis gen. nov., comb. nov. Int. J. Syst.
Evol. Microbiol. 61, 2722–2728.
Ravi P More, S.M. (2013). Mining and assessment of catabolic pathways in the
metagenome of a common effluent treatment plant to induce the degradative capacity of
biomass. Bioresour. Technol.
Riemann, L., Leitet, C., Pommier, T., Simu, K., Holmfeldt, K., Larsson, U., and
Hagström, A. (2008). The native bacterioplankton community in the central baltic sea is
influenced by freshwater bacterial species. Appl. Environ. Microbiol. 74, 503–515.
90
M. Sc. Thesis—Quan Yao McMaster—Biology
Roller, M., Lucić, V., Nagy, I., Perica, T., and Vlahovicek, K. (2013). Environmental
shaping of codon usage and functional adaptation across microbial communities. Nucleic
Acids Res. 41, 8842–8852.
Rosen, G.L., Sokhansanj, B.A., Polikar, R., Bruns, M.A., Russell, J., Garbarine, E.,
Essinger, S., and Yok, N. (2009). Signal Processing for Metagenomics: Extracting
Information from the Soup. Curr. Genomics 10, 493–510.
Rout, M.E., and Callaway, R.M. (2012). Interactions between exotic invasive plants and
soil microbes in the rhizosphere suggest that “everything is not everywhere”. Ann. Bot.
110, 213–222.
Ruby, J.G., Bellare, P., and Derisi, J.L. (2013). PRICE: software for the targeted
assembly of components of (Meta) genomic sequence data. G3 (Bethesda). 3, 865–880.
Sahni, S.K., and Rydkina, E. (2009). Host-cell interactions with pathogenic Rickettsia
species. Future Microbiol. 4, 323–339.
Schloss, P.D., and Handelsman, J. (2005). Metagenomics for studying unculturable
microorganisms: cutting the Gordian knot. Genome Biol. 6, 229.
Scully, E.D., Geib, S.M., Hoover, K., Tien, M., Tringe, S.G., Barry, K.W., Glavina del
Rio, T., Chovatia, M., Herr, J.R., and Carlson, J.E. (2013). Metagenomic profiling reveals
lignocellulose degrading system in a microbial community associated with a woodfeeding beetle. PLoS One 8, e73827.
Segata, N., Waldron, L., Ballarini, A., Narasimhan, V., Jousson, O., and Huttenhower, C.
(2012). Metagenomic microbial community profiling using unique clade-specific marker
genes. Nat. Methods 9, 811–814.
Sharon, I., Birkland, A., Chang, K., El-Yaniv, R., and Yona, G. (2005). Correcting
BLAST e-Values for Low-Complexity Segments. J. Comput. Biol. a J. Comput. Mol.
Cell Biol. 12, 980–1003.
Siepel, A., and Haussler, D. (2004). Combining phylogenetic and hidden Markov models
in biosequence analysis. J. Comput. Biol. 11, 413–428.
Solonenko, S.A., Ignacio-Espinoza, J.C., Alberti, A., Cruaud, C., Hallam, S.,
Konstantinidis, K., Tyson, G., Wincker, P., and Sullivan, M.B. (2013). Sequencing
platform and library preparation choices impact viral metagenomes. BMC Genomics 14,
320.
91
M. Sc. Thesis—Quan Yao McMaster—Biology
Sommer, M.O.A., Church, G.M., and Dantas, G. (2010). A functional metagenomic
approach for expanding the synthetic biology toolbox for biomass conversion. Mol. Syst.
Biol. 6, 360.
Sowell, S.M., Norbeck, A.D., Lipton, M.S., Nicora, C.D., Callister, S.J., Smith, R.D.,
Barofsky, D.F., and Giovannoni, S.J. (2008). Proteomic analysis of stationary phase in the
marine bacterium “Candidatus Pelagibacter ubique”. Appl. Environ. Microbiol. 74, 4091–
4100.
Steenhoudt, O., and Vanderleyden, J. (2000). Azospirillum, a free-living nitrogen-fixing
bacterium closely associated with grasses: genetic, biochemical and ecological aspects.
FEMS Microbiol. Rev. 24, 487–506.
Strous, M., Kraft, B., Bisdorf, R., and Tegetmeyer, H.E. (2012). The binning of
metagenomic contigs for microbial physiology of mixed cultures. Front. Microbiol. 3,
410.
Takacs-Vesbach, C., Inskeep, W.P., Jay, Z.J., Herrgard, M.J., Rusch, D.B., Tringe, S.G.,
Kozubal, M.A., Hamamura, N., Macur, R.E., Fouke, B.W., et al. (2013). Metagenome
sequence analysis of filamentous microbial communities obtained from geochemically
distinct geothermal channels reveals specialization of three aquificales lineages. Front.
Microbiol. 4, 84.
Teeling, H., Waldmann, J., Lombardot, T., Bauer, M., and Glöckner, F.O. (2004).
TETRA: a web-service and a stand-alone program for the analysis and comparison of
tetranucleotide usage patterns in DNA sequences. BMC Bioinformatics 5, 163.
Thomas, T., Gilbert, J., and Meyer, F. (2012). Metagenomics - a guide from sampling to
data analysis. Microb. Inform. Exp. 2, 3.
Travers, S.A.A., Clewley, J.P., Glynn, J.R., Fine, P.E.M., Crampin, A.C., Sibande, F.,
Mulawa, D., McInerney, J.O., and McCormack, G.P. (2004). Timing and reconstruction
of the most recent common ancestor of the subtype C clade of human immunodeficiency
virus type 1. J. Virol. 78, 10501–10506.
Tringe, S.G., von Mering, C., Kobayashi, A., Salamov, A. a, Chen, K., Chang, H.W.,
Podar, M., Short, J.M., Mathur, E.J., Detter, J.C., et al. (2005). Comparative
metagenomics of microbial communities. Science 308, 554–557.
Ursell, L.K., Metcalf, J.L., Parfrey, L.W., and Knight, R. (2012). Defining the human
microbiome. Nutr. Rev. 70 Suppl 1, S38–44.
92
M. Sc. Thesis—Quan Yao McMaster—Biology
Vogel, T.M., Simonet, P., Jansson, J.K., Hirsch, P.R., Tiedje, J.M., van Elsas, J.D., Bailey,
M.J., Nalin, R., and Philippot, L. (2009). TerraGenome: a consortium for the sequencing
of a soil metagenome. Nat. Rev. Microbiol. 7, 252–252.
Walker, D.H., Valbuena, G.A., and Olano, J.P. (2003). Pathogenic mechanisms of
diseases caused by Rickettsia. Ann. N. Y. Acad. Sci. 990, 1–11.
Williams, D., Fournier, G.P., Lapierre, P., Swithers, K.S., Green, A.G., Andam, C.P., and
Gogarten, J.P. (2011). A rooted net of life. Biol.Direct. 6, 45.
Williams, K.P., Sobral, B.W., and Dickerman, A.W. (2007). A Robust Species Tree for
the Alphaproteobacteria. J. Bacteriol. 189, 4578–4586.
Wommack, K.E., Bhavsar, J., and Ravel, J. (2008). Metagenomics: Read Length Matters.
Appl. Environ. Microbiol. 74, 1453–1463.
Wooley, J.C., Godzik, A., and Friedberg, I. (2010). A primer on metagenomics. PLoS
Comput. Biol. 6, e1000667–e1000667.
Wrighton, K.C., Thomas, B.C., Sharon, I., Miller, C.S., Castelle, C.J., VerBerkmoes,
N.C., Wilkins, M.J., Hettich, R.L., Lipton, M.S., Williams, K.H., et al. (2012).
Fermentation, hydrogen, and sulfur metabolism in multiple uncultivated bacterial phyla.
Science 337, 1661–1665.
Wu, Y.-W., and Ye, Y. (2011). A novel abundance-based algorithm for binning
metagenomic sequences using l-tuples. J. Comput. Biol. 18, 523–534.
Xia, L.C., Cram, J.A., Chen, T., Fuhrman, J.A., and Sun, F. (2011). Accurate genome
relative abundance estimation based on shotgun metagenomic reads. PLoS One 6, e27992.
Yabuuchi, E., and Kosako, Y. (2005). Order IV. Sphingomonadales ord. nov. In Bergey’s
Manual of Systematic Bacteriology, D.J. Brenner, N.R. Krieg, and J.T. Staley, eds. (New
York: Springer), pp. 230–258.
Yergeau, E., Sanschagrin, S., Beaumier, D., and Greer, C.W. (2012). Metagenomic
analysis of the bioremediation of diesel-contaminated Canadian high arctic soils. PLoS
One 7, e30058.
Yildiz, F.H., Gest, H., and Bauer, C.E. (1991). Attenuated effect of oxygen on
photopigment synthesis in Rhodospirillum centenum. J. Bacteriol. 173, 5502–5506.
Yurkov, V. V, and Beatty, J.T. (1998). Aerobic anoxygenic phototrophic bacteria.
Microbiol.Mol.Biol.Rev. 62, 695–724.
93
M. Sc. Thesis—Quan Yao McMaster—Biology
Zhang, W., Wang, Y., Lee, O.O., Tian, R., Cao, H., Gao, Z., Li, Y., Yu, L., Xu, Y., and
Qian, P.-Y. (2013). Adaptation of intertidal biofilm communities is driven by metal ion
and oxidative stresses. Sci. Rep. 3, 3180.
Zomorodipour, A., and Andersson, S.G. (1999). Obligate intracellular parasites:
Rickettsia prowazekii and Chlamydia trachomatis. FEBS Lett. 452, 11–15.
94