A phylogeny-driven genomic encyclopaedia of Bacteria and Archaea

Vol 462 | 24/31 December 2009 | doi:10.1038/nature08656
LETTERS
A phylogeny-driven genomic encyclopaedia of
Bacteria and Archaea
Dongying Wu1,2, Philip Hugenholtz1, Konstantinos Mavromatis1, Rüdiger Pukall3, Eileen Dalin1, Natalia N. Ivanova1,
Victor Kunin1, Lynne Goodwin4, Martin Wu5, Brian J. Tindall3, Sean D. Hooper1, Amrita Pati1, Athanasios Lykidis1,
Stefan Spring3, Iain J. Anderson1, Patrik D’haeseleer1,6, Adam Zemla6, Mitchell Singer2, Alla Lapidus1, Matt Nolan1,
Alex Copeland1, Cliff Han4, Feng Chen1, Jan-Fang Cheng1, Susan Lucas1, Cheryl Kerfeld1, Elke Lang3,
Sabine Gronow3, Patrick Chain1,4, David Bruce4, Edward M. Rubin1, Nikos C. Kyrpides1, Hans-Peter Klenk3
& Jonathan A. Eisen1,2
Sequencing of bacterial and archaeal genomes has revolutionized our
understanding of the many roles played by microorganisms1. There
are now nearly 1,000 completed bacterial and archaeal genomes
available2, most of which were chosen for sequencing on the basis
of their physiology. As a result, the perspective provided by
the currently available genomes is limited by a highly biased phylogenetic distribution3–5. To explore the value added by choosing
microbial genomes for sequencing on the basis of their evolutionary
relationships, we have sequenced and analysed the genomes of 56
culturable species of Bacteria and Archaea selected to maximize
phylogenetic coverage. Analysis of these genomes demonstrated
pronounced benefits (compared to an equivalent set of genomes
randomly selected from the existing database) in diverse areas including the reconstruction of phylogenetic history, the discovery of
new protein families and biological properties, and the prediction
of functions for known genes from other organisms. Our results
strongly support the need for systematic ‘phylogenomic’ efforts to
compile a phylogeny-driven ‘Genomic Encyclopedia of Bacteria
and Archaea’ in order to derive maximum knowledge from existing microbial genome data as well as from genome sequences to
come.
Since the publication of the first complete bacterial genome,
sequencing of the microbial world has accelerated beyond expectations. The inventory of bacterial and archaeal isolates with complete or
draft sequences is approaching the two thousand mark2. Most of these
genome sequences are the product of studies in which one or a few
isolates were targeted because of an interest in a specific characteristic
of the organism. Although large-scale multi-isolate genome sequencing studies have been performed, they have tended to be focused on
particular habitats or on the relatives of specific organisms. This overall lack of broad phylogenetic considerations in the selection of microbial genomes for sequencing, combined with a cultivation bottleneck6,
has led to a strongly biased representation of recognized microbial
phylogenetic diversity3–5. Although some projects have attempted to
correct this (for example, see ref. 5), they have all been small in scope.
To evaluate the potential benefits of a more systematic effort, we
embarked on a pilot project to sequence approximately 100 genomes
selected solely for their phylogenetic novelty: the ‘Genomic
Encyclopedia of Bacteria and Archaea’ (GEBA).
Organisms were selected on the basis of their position in a phylogenetic tree of small subunit (SSU) ribosomal RNA, the best sampled
gene from across the tree of life7. Working from the root to the tips of
the tree, we identified the most divergent lineages that lacked representatives with sequenced genomes (completed or in progress)8 and
for which a species has been formally described9 and a type strain
designated and deposited in a publicly accessible culture collection10.
From hundreds of candidates, 200 type strains were selected both to
obtain broad coverage across Bacteria and Archaea and to perform
in-depth sampling of a single phylum. The Gram-positive bacterial
phylum Actinobacteria was chosen for the latter purpose because of
the availability of many phylogenetically and phenotypically diverse
cultured strains, and because it had the lowest percentage of sequenced
isolates of any phylum (1% versus an average of 2.3%)11. Of the 200
targeted isolates, 159 were designated as ‘high’ priority primarily on the
basis of phylum-level novelty and the ability to obtain microgram quantities of high quality DNA. The genomes of these 159 are being
sequenced, assembled, annotated (including recommended metadata12)
and finished, and relevant data are being released through a dedicated
Integrated Microbial Genomes database portal13 and deposited into
GenBank. Currently, data from 106 genomes (62 of which are finished)
are available.
To assess the ramifications of this tree-based selection of organisms,
we focused our analyses on the first 56 genomes for which the shotgun
phase of sequencing was completed. The 53 bacteria and 3 archaea
(Supplementary Table 1) represent both a broad sampling of bacterial
diversity and a deeper sampling of the phylum Actinobacteria (26
GEBA genomes). An initial question we addressed was whether selection on the basis of phylogenetic novelty of SSU rRNA genes reliably
identifies genomes that are phylogenetically novel on the basis of other
criteria. This question arises because it is known that single genes, even
SSU rRNA genes, do not perfectly predict genome-wide phylogenetic
patterns14,15. To investigate this, we created a ‘genome tree’ (ref. 16) of
completed bacterial genomes (Fig. 1) and then measured the relative
contribution of the GEBA project using the phylogenetic diversity
metric17. We found that the 53 GEBA bacteria accounted for 2.8–4.4
times more phylogenetic diversity than randomly sampled subsets of
53 non-GEBA bacterial genomes. A similar degree of improvement in
phylogenetic diversity was seen for the more intensively sampled actinobacteria (Table 1). These analyses indicate that although SSU rRNA
genes are not a perfect indicator of organismal evolution, their phylogenetic relationships are a sound predictor of phylogenetic novelty
within the universal gene core present in bacterial genomes.
1
DOE Joint Genome Institute, Walnut Creek, California 94598, USA. 2University of California, Davis, Davis, California 95616, USA. 3DSMZ, German Collection of Microorganisms and
Cell Cultures, 38124 Braunschweig, Germany. 4DOE Joint Genome Institute-Los Alamos National Laboratory, Los Alamos, California 87545, USA. 5University of Virginia,
Charlottesville, Virginia 22904, USA. 6Lawrence Livermore National Laboratory, Livermore, California 94550, USA.
1056
©2009 Macmillan Publishers Limited. All rights reserved
LETTERS
Clostridium botulinum B str Eklund 17B
Clostridium perfringens str 13
555
Clostridium kluyveri DSM
Maree
linum A3 str Loch ni E88
Clostridium botu
Clostridium teta C 824
ATC
acetobutylicum
m novyi NT
Clostridium
Clostridiu ntans ISDg
ferme
MF
QY
s
um phyto
As
Clostridi s metalliredigen
dii OhIL 0
ilu
Alkaliph hilus oremlandifficile 63 ii
Alkalip ostridium us prevot
Cl
cocc C 29328
Anaero
03
TC
A
89
agna
DSMs MB4
ldia m
si
ticus
05
Finego ccharolygcongen C 274
ns
ATC xida
r ten
tor sa
sirup robacte cellum aceto s MI 1 I
o
ellulo
n
S
ae
lum
Caldic hermoan m thermmacu reducenicum 4C
T
pio P10 1
ridiu ulfoto
lum
0
M
Clost Des tomacu rmopro
tor Z 29 51
Y
the xvia ans
ulfo
Des culum is audaoform iense Ice13
a
hafn ldum 907 n
tom orud gen
Pelo Desulf hydro terium estica CC 3ttinge 3
6
s
T
d
c
e
atus thermu toba m mo tica Atr Go 148 LF
did
lfi
M
Can oxydo Desu cteriu oaceolfei s m IA WNvula
a
u NM par G2
rm
e
Carb
p w hil
liob
He rella thsubs rmop s JWnella tiae Pe 53
i
e
o
c ia IP
ilu
Mo wolfe m th oph Veillo gala ynov B CT 3 1
a s A 8L K
s
riu rm
ona bacte s the
ma a U 15 63 8
om
las sm nis is e 1 44
ph mbio robiu
op pla o id il 7 37
tro
yc co ulm rit ob e G 9
y
e
2
ia
h
S
M
na
My a p art a m on lium M1 R
Syn
tra
sm ma sm um a e um 70
Na
pla s pla ne nit nia ic 9 2
co opla co op ge o pt 00 F 1
My yc My a hy sma eum llise C 7 ns Hr PG L1 li
M
t
n a C
sm pla a p a g AT tra s um ma B
pla co m m tr ne SC or a W A
co My plas plas 3 s a pe es a fl sm AY 8
id m la a G
My
o o r
yc yc rova lasmyco las top sm ii P
M
M
a
se cop p m sop Phy opl dlaw
um My ubs Me tus hyt lai
rv
s
a p a
pa
id m sm
es
a
d
nd roo la
oi
sm
Ca s b olep
yc
pla
m
ea
he ch
a
Ur
m
itc A
as
w
s
w
lo
el
ry
te
As
yc
M
l
op
Fu
so
ba
ct
er
iu
m
nu
cle
5
1
37
3 30
P1
84 6
6
CM
S C
98
IE C
rC
P1
N sP
1 s st
M
1
a
7
u
CC
92 inu
os at 0
in ng C3 02
IT ar
str
ug elo RC C99 311 tr M p m L2A ris
r
ae s sp C 9 s bs T sto
is cu s p C us su NA pa
st oc cu s s p C rin s tr p 41
cy c c u s a inu s s bs 99
ro ho co cc us m ar u su
ic ec ho co cc us m rin s SM
M yn ec ho co occ cus ma arinu us D
S yn ec ho oc c us m hil
S yn ec lor oco cc us op
S yn ch lor co cc lan sei
S ro ch ro co xy oe m ns
P ro chlo ro ter r w ulu uce
P ro hlo ac cte arv ed
P roc rob iba p inir
s
m
P ub ex ium liotr ta urtu xidan 27
R on ob he len m c roo 08 2705
C top ia lla riu fer TW CC 1
09
A lack rthe cte m plei m N 41 M 71 129
S gge oba robiu hip ngu m K DS C 13
E rypt ic a w lo ikeiu cum NCT
C cidimerym erium m je alyti riae 14
A ph act teriu ure hthe YS 3 0216
Troifidobebac teriumm dip iens SRS3
B oryn bac teriu effic rans
C ryne bac rium tole
Co ryne bacte radio ernae
Co ryne ccus cav ena
Co eoco bergia flavig silytica
Kineuten onas ellulo
B llulom nas c ddieii
Ce nimo cter ke ans
is
Xyla guiba nitrific
rius
ganens
San esia de sedenta ium
michi
Jon ococcus rium faec nsis subsp
Kyt chybacte ichigane str CTCB07
Bra bacter m bsp xyli
33209
Clavi onia xyli su ila DC2201
ATCC
Leifs ria rhizoph lmoninarum
Kocu acterium sa 24
Renibobacter sp FB nes KPA171202
ac
Arthr
ium
ter
ac
Propionibflavida
Kribbella es sp JS614
Nocardioid acidiphila
Catenulispora coelicolor A3 2
Streptomyces
11B
Acidothermus cellulolyticus
Thermobispora bispora
Streptosporangium roseum
Thermomonospora curvata
Thermobifida fusca YX
Nocardiopsis dassonvillei
Frankia sp CcI3
Salinispora areni
cola CNS 205
Stackebrand
Geodermatoptia nassauensis
hilus obscur
Nakamure
us
Actinosyn lla multipartita
ne
ma mirum
Saccha
ro
Saccha monospora vir
ropolys
idis
Mycob
po
Myc acterium ra erythraea
NRRL
Myc obacteriu abscessu
2338
Rho obacteriu m gilvum s
No dococcu m leprae PYR GCK
Gordcardia fa s sp RH TN
Tsukaonia b rcinica A1
ro
IF
Lep mure nchia M 101
52
Le tospir lla pa lis
Bra ptospir a bifle urometa
Tre chysp a inte xa sero bola
Tre ponemira mu rrogan var Pa
s
p
to
rd
o
s
a
Bo
oc
erov c str
n
d
ain
ar C
Bo rreli ema entic hii
ope Patoc
R rreli a he pallid ola A
nha
Plahodo a garirmsii D um s TCC
gen 1 Ames
i str
M ncto pirell nii P AH ubsp 35405
Fioc
pall
Ak ethyla myc ula b Bi
ruz
id
um
L1
Op kerm cid es li altic
s
m
a
ip
tr
it
C
n
S
u a
Nic
h
hols
Chandidtus tnsia ilum ophil H 1
Ch lam at err muc infe us
a
u
r
la
in
C
y
n
s e
Ch hlammyddia t Pro PB9 iphila orum
C lo y o ra to 0
V
AT
C hlo ro do phil cho chla 1
CC 4
C hlo ro her ph a a ma m
BA
P hlo ro biu pe ila bo tis ydia
A8
Ch ros rob bac m p ton pne rtus A H am
35
lo the ium ulu ha tha umo S2 AR oeb
R
6
S h r c
m e la
13 op
C al od ob o p p ob ss nia 3
hil
aU
C an inib oth ium chlo hae arv act ium e TW
yt d a e c ris o um er
WE
A
b
oid T 1
op id ct rm h
25
ha atu er us loro vibr acte NC es CC 83
ga s A rub m ch iof ro IB BS 35
hu mo er arin rom orm ides 832 1 110
D
7
tc eb S us at is D
hi o M
DS SM
i
i
ns p 1
Ca M 2
on hil 38
D3 2 66
65
ii us 55
AT as
C iat
C ic
3 3 us
40 5a
6 2
Rh
izo
M
es
o
2
48 3
ns
I5 0
5
ta
VP C 8
en
on C
m
s
icr AT
er ale u
r f u rin sis m is 83 SS
86
te ing pa en tao on W W
02
ac a l he pin aio tas alis ri G
JIP
ob m ter a et dis giv lle a
ad oso ac hag s th es gin ue ce ilum
Dy pir ob op ide roid as ia mchra oph
1
S ed itin ro te on ulc o hr 3 ei19 OP1
P h te ac m S ga syc 80 P 3A
C c b ro s a p 0 m
Ba ara phy atu oph m KT utu p YO
P or did cyt eriu etii min s
0
74 9
P an no ct ors m ium
C ap oba a f biu nib F5 5 2 SM 1 5144
C lav ell ro oge us V 15 s D CC
T
F am ic dr lic SB ne
Gr lusimrihy aeo r sp oge us A 5
E ulfu ex pto ccin patic 669
S uif u su e i 2 1
251
Aqitratir ella cter hpylor C37 4018 SM 1
N olin ba ter NB i RM ns D
381 7
W elico bac sp tzler fica m
9
AA
H elico ovumr bu enitri yianu CC B lei 269
H lfur acte as d dele is AT doy 82 40
s
Su ob on
m
min bsp
tu
Arc lfurim pirillu r ho ni su sp fe
Su uros bacte r jeju s sub 3826
Sulfmpylo bacte r fetu cisus 1
Ca mpylo bacte r con ilus
345
Ca mpylo bacte cetiph um Ellin
Ca mpylo rio a acteri
76
Ca nitrovib teria b s Ellin60 HD100
s
De
bac
tatu
voru
o
si
Acid acter u bacterios DK 1622 5
Solib lovibrio s xanthu Fw109
Bdelococcu acter sp
Myx eromyxob aceum
hr
ce 56
Ana
ium oc
80
sum So
Haliang ium cellulo licus DSM 23
Sorangacter carbino ucens Rf4
Pelob ter uraniired cens PCA
Geobac ter sulfurredu
Geobac
lovleyi SZ
9
Geobacter propionicus DSM 237
Pelobacter
itrophicus SB
Syntrophus acidr fumaroxidans MPOB
Syntrophobacte
rans Hxd3
Desulfococcus oleovo
Desulfotalea psychrophila LSv54
Desulfohalobium retbaense
Desulfomicrobium baculatum
N
Ba itro Br
rt ba ad
Ba one cte yrh
rto lla r w izo
n q in b
biu
rh B ella uin og ium
Pa m
izo ru b ta ra s
Hy rv le
ph iba gu M biu cel acil na s dsk p O
om cu min es m la lifo tr yi R
on lum os orh lot me rm To Nb S2
as
7
a iz i li i u
ne lav rum ob MA ten s K lou 25 8
Ma ptu ame bv ium FF3 sis C58 se 5
Ro
ric niu ntiv vic sp 03 16 3
se Sil
au m o ia B 09 M
ob ic
ac iba Ca lis AT ran e NC 9
ter ct ulo m C s 38 1
D
de er p ba aris C 1 DS 41
Pa inor
No
ra os
J nitr om cte M 54 1
vos
Rh cocc eoba anna ifica ero r sp CS1 44
phin
gob Eryth odob us d cter schians O yi DS K3 0
ium rob acte enit shib s Ch S 1
a
Zym Sp arom cter r sphrifica ae p CC 114 3
li
S
a n D
om hing
a
ona opy ticivotorali eroids PD FL 1 1
s m xis
ran s H es 122 2
Sp obilis alask s DSTCC 2 4 2
Glu
con
Glu hingo subs ensis M 1225941
Gra aceto conob mona p mo RB2 444
b
nuli
bac acter acter s witti bilis Z256
ter b diaz oxy chii M4
ethe otro dan RW
Rho
ph s 6
1
sd
A
Mag dospirill cidiphili ensis Cicus P 21H
neto
u
GDN Al 5
sp m ru um c
Anap irillum brum A ryptum IH1
Anap lasma magneti TCC 1 JF 5
1
p
lasm
a mar hagocytocum AM170
B
ginale
p
Ehrlich
str hilum H 1
Eh
ia rum
inantiurlichia cani St Marie Z
Wolbac
s
m str
s
hia endo
Welgestr Jake
symbion
vond
Wol
t
Neorick strain TRS ofbachia pipien en
ettsia se
Brugia
nn
m tis
Rickettsia etsu str Miya alayi
yama
typhi str
Wilmington
Rickettsia
bellii RML36
Orie
9C
Candidatus Pelantia tsutsugamushi Bor
yong
gibacter ubiq
ue HTC
Magnetococcus C1062
Desulfovibrio desulfuricans
MC 1
subsp desulfuricanssp
str
G20
Desulfovibrio vulgaris subsp vulgaris
DP4
Lawsonia intracellularis PHE MN100
at
St
um
re
pt
s
ob
er De ubs
m th p
Le aci
an io n
ae su uc Se pto llus
Fe
ro lfo lea ba tri m
T
v v t l c o
rv
Th ido The herm ibrio ibrioum dell hia nili
erm ba rm o a p A a t bu fo
os cter oto tog cida eptTCCerm cc rmi
iph ium ga a le m id 2 it ali s
o
m t in ov 5 id s
Pe me nod arit ting ovo ora 586is
tro lan os im ae ra ns
Bu
n
e u
chn
T
M tog sie m a M TM s
e
Bu
De herm Meio eio a m nsis Rt1 SB O
chn ra ap
t
ino us th he ob B 7 B 8
Wig
co th erm rm ilis I4 1
Bu era a hidico
gles
cc er
chn
phid la
us us SJ 29
wort
u
m
s
s
r
e
9
o
ra a ico tr A
rad ph silv ube 5
hia
la
P
p
glo
iod ilus anu r
ssin Buc hidic str S S Ac
ura H s
ola
yr
g
idia hne
ns B8
end ra ap str B Schiz thosip
R1
Bau Cand
osy hid
p B aph ho
id
man
m
nia atus B Cand bion icola s aizon is gra n pis
cica
u
g
t
lo
dell chma idatus of Glotr Cc C ia pis minumm
inic
ola nnia peBloch ssina inara taciae
str H
m
b
n
c
c H nsylva annia revipaedri
oma
lpis
n
fl
S
lo icu orid
Acti
nob higella disca s str B anus
PEN
Haemacillus suflexneri coagula
ta
ophilu ccin 2a s
s du ogene tr 301
c
s1
re
Vibri
Aerom
o ch yi 35000 30Z
P
o
onas
salmonhotobacterVibrio fischlerae O3 HP
ium pr
icida
eri E 95
Shew subsp sa ofundu S114
Shewan anella sedi lmonicidam SS9
minis HA A449
ella frigid
Psychro imarina NCIMW EB3
monas
B 400
Co
Pseudoalte lwellia psychreingrahamii 37
ryt
romonas
haloplanktishraea 34H
TAC125
Idiomarina
ioih
iensis L2T
Pseudoalteromo
nas atlantica R
T6c
Kangiella koreensis
Marinomonas
MWYL1
Chromohalobacter salexigens sp
DSM 3043
Hahella chejuensis KCTC 2396
Marinobacter aquaeolei VT8
Cellvibrio japonicus Ueda107
Saccharophagus degradans 2 40
e pv phaseolicola 1448A4
Pseudomonas syringa
us 273
Psychrobacter arctic
r sp PRwf 1
Psychrobacte r sp ADP1
2
Acinetobacte
mensis SK 1
rku
bo
ax
RSA 33ns
Alcanivor
burnetii
Coxiella mophila str LeHA
lla pneu cius okutanii L 2
Legione
XC
so
a
yo
ogen larctica
sicom
ira crun
atus Ve
ho
Candid Thiomicrosp is subsp S1703A
ula1
larens
sus VC
sella tu cter nodo sa TemecK279a
Franci
idio
th
loba
ilia
Diche ylella fast maltoph s str BSaL1
X
nas psulatu phila E 1
o
m
ca a halo MLH 07
ho
cus
i
7
otrop
Sten thylococ odospir hrliche CC 19 H 1
Me Halorh ola e ni AT s SP 1 2
nic ocea oran EF0 J2
v
lilim
e
Alka coccus acidoisenia rans C118
T
o
e
tia
6
so
Delfbacter alenivucens ii SP 1
Nitro
ro aphth ired lodn m PM 1
h
p
ine nas n x ferr x cho hilu STIR is
rm
ip s
e
ra
ri
o
ns
V
rom odofe ptoth etrole sariu ane 383s
la
s
o
e
L m p ece taiw sp an
P
Rh
ia d N
libiu r n vidus lder oxy 197 1
o ic
thy cte
N
Me leoba upria urkh rsen viump Eb CB 1
C
B sa aa s aR 9
uc
na etell rcus atic ha C 196
lyn
o
o
P
iim ord zoa rom rop 25 259 T
in
t
5 K
C
A
B
a
u
rm
as e TC C 2 s 5
He
on nas is A TC llatu 194 72 1
rom mo rm s A ge P1 124 83 6
hlo roso ltifo can s fla CC C 2 4 4 9
c
De Nit mu itrifi cillu ae N ATC JCM sp 903 y2
P 18
s
ira en a e
sp s d ylob rrho eum ran riumTCCcus isB
i
e
so
ro cillu eth no lac tol cte A h B
Nit ba M go vio dio oba dica trop tris
s
ia
io
l
er ium ra hy in uto lu
Th
iss ter um et sp r a pa
Ne bac teri M sub cte as
n
c
o
ca ba o
m ba
di o m
ro ylo
h
in th do
h
C et
ia an eu
M
ck X ps
o
rin
d
e
o
ij
Rh
Be
Th
Clostridium beijerinckii NCIMB 8052
Alicyclobacillus acidocaldarius
Exiguobacterium sibiricum 255 15
Bacillus halodurans
Bacillus clausii KSM C 125
K16
Oceanobac
Geobacillus illus iheyensis HTE831
kaustophilus
Bacillus
HTA
Bacillus pumilus SAFR 032 426
we
ihenstep
Lysiniba
ha
Staphy cillus sphaeri nensis KBAB4
Listeria lococcus ha cus C3 41
emolytic
welshim
Ente
Lact rococcus eri serova us JCSC1435
Strepococcus faecalis V5 r 6b str SLCC
83
5334
Strep tococc lactis su
Stre tococc us suis 05bsp crem
ZYH33 oris SK
Lac ptococ us mu
11
Lac tobac cus pyo tans UA
Lac tobacillillus sake genes 159
i sub MGA
La tobac us c
La ctoba illus d asei A sp sake S10750
i 23K
La ctoba cillus elbru TCC 3
Le ctob cillus helve eckii s 34
Oe ucon acillus gasse ticus D ubsp b
Lac noco ostoc saliv ri AT PC 4 ulgari
a
C
5
c
cus
to
ri
71
La
C
u
b cu cit
ATC
Pe ctob acil s oe reum s UC 3332
CB
La dio acil lus re ni P KM C118 3
AA
365
L cto coc lus ute SU 1 20
Deacto bacil cus pferme ri F27
Sp halobacil lus p ento ntum 5
s
c lu
la
T h
IF
Hehermaero occo s bre ntaruaceus O 3
9
R rp o ba id v
m
W ATC 56
C os et ba cte es is A
G hlo eifl osip culu r th sp TC CFS1 C 257
S loe ro exu ho m erm BA C 3
45
A yn ob fle s n te o V1 67
Th car ech act xus cast aura rren philu
N e y o e a e n um s
T o rm oc co r v ura nh tia
Sy richstoc os hlor ccu iola ntia olzii cus
y
A
c
D
S n
c
n is s
C yn ec ode pun ec m sp eus us J SM TCC
ya ec ho sm c ho ar JA P
no ho co iu tifo co ina 2 CC 10 1394 237
th cy cc m rm cc M 3B 74 fl
1 79
ec st u e e us BI
2
e is s s ryt PC el C1 a 2 1
sp sp p hr C on 10 13
AT PC PC aeu 73 gat 17
C C C 7 m 10 us
C 6 0 IM 2
BP
51 80 02 S
1
10
14 3
1
2
NATURE | Vol 462 | 24/31 December 2009
Aquificae
Actinobacteria
Synergistetes
Bacteroidetes
Cyanobacteria
Thermotogae
Alphaproteobacteria
Chlorobi
Chloroflexi
Deinococcus/Thermus
Deltaproteobacteria
Chlamydiae/Verrucomicrobia
Firmicutes
Epsilonproteobacteria
Planctomycetes
Tenericutes
Acidobacteria
Spirochaetes
Fusobacteria
Gammaproteobacteria
Betaproteobacteria
Figure 1 | Maximum-likelihood phylogenetic tree of the bacterial domain based on a concatenated alignment of 31 broadly conserved protein-coding
genes16. Phyla are distinguished by colour of the branch and GEBA genomes are indicated in red in the outer circle of species names.
The discovery and characterization of new gene families and their
associated novel functions provide one incentive for sequencing
additional genomes, analysis of which has helped to redefine the
protein family universe18. We explored the quantitative effect of
tree-based genome selection on the pace of discovery of novel proteins
and functions. Specifically, we compared the rate of discovery of
novel protein families when progressively adding more closely related
genomes versus when adding more distantly related ones (Fig. 2).
Granted, many factors contribute to protein family diversity, such
as ecological niche; nevertheless, higher rates of novel protein family
discovery were found in the more phylogenetically diverse taxa
(Fig. 2). In addition, of the 16,797 families identified in the 56
GEBA genomes, 1,768 showed no significant sequence similarity to
any proteins, indicating the presence of novel functional diversity.
These results highlight the utility of tree-based genome selection as
a means to maximize the identification of novel protein families and
argues against lateral gene transfer significantly redistributing genetic
novelty between distantly related lineages.
1057
©2009 Macmillan Publishers Limited. All rights reserved
LETTERS
NATURE | Vol 462 | 24/31 December 2009
11.0
4.3
3.2 6 0.7 (100)
1.4 6 0.3 (100)
2.8–4.4
2.5–3.9
New protein family links
46
3 6 4 (5)
6.6 to .15.3
Genes in new chromosomal cassettes 71,579 16,579 6 5,523 (20) 3.2–6.5
New gene fusions
433
65 6 31 (20)
4.5–12.7
GEBA genomes were compared to equivalently sized random sets of reference genomes to
quantify the effect of phylogenetic selection.
Novel proteins also can serve to link distantly related homologues
whose relatedness would otherwise go undetected. Forty-six such
links were identified in the 56 GEBA genomes compared to an average
of only three new links in equivalent sets of randomly sampled nonGEBA genomes (Table 1). A useful complement to homology-based
predictions of gene function are ‘non-homology methods’ (ref. 19)
such as gene context-based inference that relies on the conserved
clustering of functionally related genes across multiple genomes, often
in operons or as gene fusions20. We identified over 70,000 genes in new
chromosomal cassettes of two or more genes in the GEBA genomes.
This represents a three- to sixfold increase over equivalent sets of nonGEBA genomes (Table 1). Similarly, the number of new gene fusions
identified in the GEBA genomes is 4 to ,13 times greater than in
randomly selected genome sets (Table 1). Because the GEBA data
set produced a several-fold improvement over random sets for all
metrics examined (Table 1), we predict that other aspects of
sequence-based biological discovery will similarly benefit from treebased genome sequencing.
The GEBA genomes also show significant phylogenetic expansions
within known protein families. For example, although only two of the
56 GEBA organisms are known cellulose degraders, we identified in the
set of genomes a variety of glycoside hydrolase (GH) genes that may
participate in the breakdown of cellulose and hemicelluloses. Among
these are 28 and 7 phylogenetically divergent members of the endoglucanase- and processive exoglucanase-containing GH6 and GH48
Protein family number
(including families with single members)
b
Peptidase M4
thermolysin
Hypothetical protein
1.6 kb
Hypothetical protein
1.0 kb
Hypothetical protein
0.5 kb
BARP
1 kb
Phosphatase
c
Oxidoreductase
Conserved hypothetical protein
70,000
Hypothetical protein
Methyltransferase
Bacteria from GEBA project
(1,060.6 new families/genome)
60,000
d
50,000
40,000
Actinobacteria
(649.6 new families/genome)
30,000
Enterobacteriaceae
(307.7 new families/genome)
20,000
10,000
Streptococcus agalactiae
(121.3 new families/genome)
0
a
RNase treated
Genome tree phylogenetic diversity17
Bacteria (domain)
Actinobacteria (phylum)
Random sets
Fold
(number of resamplings) improvement
RNA
GEBA set
DNA
Comparative genomic metric
families, respectively. Halorhabdus utahensis, a halophilic archaeon
known to have b-xylanase and b-xylosidase activities21, has a chromosomal cluster including two GH10 family b-xylanases and six novel
GH5 family proteins of unknown specificity.
The enrichment of genetic diversity is also seen within families of
non-coding RNAs, transposable elements, and other cellular components. For example, the genome of the marine myxobacterium
Haliangium ochraceum contains 807 CRISPR (clustered regularly
interspaced short palindromic repeats) units including the largest
single CRISPR array known, comprising 382 spacer/repeat units.
CRISPR is a newly recognized, but ancient and widespread, system
in bacteria and archaea that confers resistance to viruses and other
invading foreign DNAs22.
Results from the GEBA pilot project challenge our current understanding for the taxonomic distribution of known gene families. The
most striking example of which is the discovery of an actin homologue in H. ochraceum. Actin and its close relatives are structural
components of the eukaryotic cytoskeleton that are found in every
eukaryote and only in eukaryotes. Bacteria and archaea encode
instead the shape-determining protein MreB. Although MreBs have
some functional and structural similarities to eukaryotic actins, they
are regarded, at best, distantly related homologues23 and possibly not
Markers
Table 1 | Effect of SSU rRNA tree-based selection of organisms on comparative genomic metrics
0
10
20
30
40
50
Genome number
60
70
80
Figure 2 | Rate of discovery of protein families as a function of
phylogenetic breadth of genomes. For each of four groupings (species,
different strains of Streptococcus agalactiae; family, Enterobacteriaceae;
phylum, Actinobacteria; domain, GEBA bacteria), all proteins from that
group were compared to each other to identify protein families. Then the
total number of protein families was calculated as genomes were
progressively sampled from the group (starting with one genome until all
were sampled). This was done multiple times for each of the four groups
using random starting seeds; the average and standard deviation were then
plotted.
1C0F_A
BARP
D G E D V Q A L V I D N G S G M C K A G F A GD D A P R A V F P S I VG R P R H T G K DS Y V . . .
S S . . . . P I I I H P G S D T L Q A G L A D E E H P G S I F P N I VG R H K L A G L M E W V D Q R
1C0F_A
BARP
. . . . G D E A Q S K R G I L T L K Y P I E HG I V T N W D D M E K IW H H T F Y N E LR V A P E E
V L C V G Q E A I D Q S A T V L L R H P V W SG I V G D W E A F A A VL R H T F Y R A LW V A P E E
1C0F_A
BARP
H P V L L T E A P L . . . N P K A N R E K M TQ I M F E T F N T P A MY V A I Q A V L SL Y A S G R
H P I V V T E S P H V Y R S F Q L R R E Q L TR L L F E T F H A P Q VA V C S E A A M SL Y A C G L
1C0F_A
BARP
T T G I V M D S G D G V S H T V P I Y E G Y AL P H A I L R L D L A GR D L T D Y M M KI L T E R G
D T G L V V S L G D F V S Y V A P V H R G A IV D A G L T F L E P D GR S I T E Y L S RL L L E R G
1C0F_A
BARP
Y S F T T T A E R E I V R D I K E K L A Y V AL D F E A E M Q T A A SS S A L E K S Y EL P D G Q V
H V F T S P E A L R L V R D I K E T L C Y V AD D V A K E . . A A R NA D S V E A T Y LL P N G E T
1C0F_A
BARP
I T I G N E R F R C P E A L F Q P S F L G M ES A G I H E T T Y N S IM K C D V D I R KD L Y G N V
L V L G N E R F R C P E V L F H P D L L G W ES P G L T D A V C N A IM K C D P S L Q AE L F G N I
1C0F_A
BARP
V L S G G T T M F P G I A D R M N K E L T A LA P S T M K I K I I A PP E R K Y S V W IG G S I L A
V V T G G G S L F P G L S E R L Q R E L E Q RA P A E A P V H L L T RD D R R H L P W KG A A R F A
1C0F_A
BARP
S L S T F Q Q M W I S K E E Y D E S G P S I VH R K C F
R D A Q F A G F A L T R Q A Y E R H G A E L IY Q M . .
Figure 3 | A bacterial homologue of actin. a, Genomic context of the bacterial
actin-related protein (BARP) gene within the genome of the marine
Deltaproteobacterium H. ochraceum. Red, gene encoding BARP; white, genes
encoding hypothetical proteins; black, genes with functional annotations.
b, RT–PCR demonstration of expression of the gene encoding BARP in
H. ochraceum. c, Ribbon plot of the putative structure of BARP. d, Alignment
of BARP with actin from Dictyostelium discoideum29 with similarities in black
shaded text. Secondary structure elements (arrows, beta-strands; bars, alphahelices) are colour-coded as in c. A phylogenetic tree including this protein is in
Supplementary Figure 1.
1058
©2009 Macmillan Publishers Limited. All rights reserved
LETTERS
NATURE | Vol 462 | 24/31 December 2009
even homologous. Like other bacteria, H. ochraceum encodes a bona
fide MreB protein, but in addition, it encodes a protein that is clearly
a member of the actin family, which we have named BARP (bacterial
actin-related protein; Fig. 3). Although we do not yet have evidence
for its precise function, BARP is expressed in H. ochraceum (Fig. 3b).
Assuming that the H. ochraceum mreB orthologue performs the same
function as in other bacteria, and given that the myxobacteria, to
which this species belongs, are known to synthesize actin-targeting
toxins24, we propose that this BARP may be a dominant-negative
inhibitor of eukaryotic actin polymerization. Regardless of its precise
function, this first—and so far only—discovery of an expressed
homologue of eukaryotic actin in a member of the Bacteria highlights
the potential for novel and surprising biological discoveries given a
wider genomic sampling of the tree of life.
We conclude that targeting microorganisms for genome sequencing
solely on the basis of phylogenetic considerations offers significant farreaching benefits in diverse areas. Furthermore, the benefits of phylogenetically driven genome sequencing show no sign of saturating with
these first 56 genomes. A key question then lies in determining how
much bacterial and archaeal diversity remains to be sampled. Using
SSU rRNA gene sequences as a proxy for organismal diversity (Fig. 4),
we estimate that sequencing the genomes of only 1,520 phylogenetically
selected isolates could encompass half of the phylogenetic diversity
represented by known cultured bacteria and archaea. Given the continuing reductions in both the cost and difficulty in sequencing genomes25,
this is certainly a tractable target in the next few years.
However, the great majority of recognized bacterial and archaeal
diversity is not represented by pure cultures and an additional 9,218
genome sequences from currently uncultured species would be
required to capture 50% of this recognized diversity (Fig. 4). Such
an undertaking will require new approaches to culturing or processing of multi-species samples using methods such as metagenomics26
or physical isolation of cells from mixed populations followed by
whole genome amplification methods27. Obtaining reference genomes for the uncultured microbial majority will be a natural extension of the GEBA project, the ultimate goal of which is to provide a
phylogenetically balanced genomic representation of the microbial
tree of life. The pilot study presented here is a dedicated first step in
this direction.
METHODS SUMMARY
Starting with a phylogenetic tree of SSU rRNA genes7, we identified major
branches that had no available genome sequences but for which cultured isolates
were available in the DSMZ or ATCC culture collections. Selected isolates
(Supplementary Table 1a, b) from these branches were grown and DNA isolated
(Supplementary Table 1c) and quality checked. DNA was then used for shotgun
genome sequencing by Sanger/ABI, Roche/454 and/or Illumina/Solexa technologies (Supplementary Table 2). Sequence reads were assembled separately with
different assembly methods and the best draft assembly was used for annotation
and as a starting point for genome completion (current genome status is in
Supplementary Table 2). Annotation (gene identification, functional prediction,
etc.) was performed using the IMG system (http://img.jgi.doe.gov/geba); this was
done both after shotgun sequencing and again after genome completion. For ‘whole
genome tree’ analysis, a PHYML maximum likelihood phylogenetic tree of a concatenated alignment of 31 marker genes was built using AMPHORA16. Phylogenetic
diversity was calculated as the sum of branch lengths in this and other trees. Protein
families were built for various genome sets by using the Markov clustering algorithm (MCL)28 to group proteins on the basis of ‘all versus all’ blastp searches. For
analysis of phylogenetic diversity of organisms, a phylogenetic tree was built for a
combined alignment of SSU rRNA sequences from published genomes and a nonredundant subset of greengenes SSU rRNA7. Further analysis of the genomes was
done using IMG database queries and new computational analyses as described in
the main text, legends and Supplementary Methods.
Received 3 June; accepted 30 October 2009.
1.
2.
3.
4.
5.
6.
7.
1,200
1,000
Phylogenetic diversity
8.
GEBA genomes
Pre-GEBA genomes
Organisms from the greengenes database
Organisms from the greengenes database
(excluding environmental samples)
9.
10.
11.
800
120
12.
100
600
13.
80
60
400
14.
40
15.
20
200
0
0
0
400
800 1,200
16.
17.
0
5,000 10,000 15,000 20,000 25,000 30,000 35,000 40,000
Number of organisms
Figure 4 | Phylogenetic diversity of bacteria and archaea on the basis
of SSU rRNA genes. Using a phylogenetic tree of unique SSU rRNA gene
sequences7, phylogenetic diversity was measured for four subsets of this tree:
organisms with sequenced genomes pre-GEBA (blue), the GEBA organisms
(red), all cultured organisms (dark grey), and all available SSU rRNA genes
(light grey). For each subtree, taxa were sorted by their contribution to the
subtree phylogenetic diversity30 and the cumulative phylogenetic diversity
was plotted from maximal (left) to the least (right). The inset magnifies the
first 1,500 organisms. Comparison of the plots shows the phylogenetic ‘dark
matter’ left to be sampled.
18.
19.
20.
21.
22.
23.
Fraser, C. M., Eisen, J. A. & Salzberg, S. L. Microbial genome sequencing. Nature
406, 799–803 (2000).
Liolios, K., Mavromatis, K., Tavernarakis, N. & Kyrpides, N. C. The Genomes On
Line Database (GOLD) in 2007: status of genomic and metagenomic projects and
their associated metadata. Nucleic Acids Res. 36 (database issue), D475–D479
(2008).
Hugenholtz, P. Exploring prokaryotic diversity in the genomic era. Genome Biol. 3,
REVIEWS0003.1–REVIEWS0003.8 (2002).
Eisen, J. A. Assessing evolutionary relationships among microbes from wholegenome analysis. Curr. Opin. Microbiol. 3, 475–480 (2000).
Wu, D. et al. Complete genome sequence of the aerobic CO-oxidizing
thermophile Thermomicrobium roseum. PLoS One 4, e4207 (2009).
Pace, N. R. A molecular view of microbial diversity and the biosphere. Science 276,
734–740 (1997).
DeSantis, T. Z. et al. Greengenes, a chimera-checked 16S rRNA gene database and
workbench compatible with ARB. Appl. Environ. Microbiol. 72, 5069–5072 (2006).
Bernal, A., Ear, U. & Kyrpides, N. Genomes OnLine Database (GOLD): a monitor of
genome projects world-wide. Nucleic Acids Res. 29, 126–127 (2001).
Lapage, S. P. et al., International Code of Nomenclature of Bacteria, 1990 Revision.
(American Society for Microbiology, 1992).
Ward, N., Eisen, J., Fraser, C. & Stackebrandt, E. Sequenced strains must be saved
from extinction. Nature 414, 148 (2001).
Hugenholtz, P. & Kyrpides, N. C. A changing of the guard. Environ. Microbiol. 11,
551–553 (2009).
Field, D. et al. The minimum information about a genome sequence (MIGS)
specification. Nature Biotechnol. 26, 541–547 (2008).
Markowitz, V. M. et al. The integrated microbial genomes (IMG) system in 2007:
data content and analysis tool extensions. Nucleic Acids Res. 36 (database issue),
D528–D533 (2008).
Achtman, M. & Wagner, M. Microbial diversity and the genetic nature of
microbial species. Nature Rev. Microbiol. 6, 431–440 (2008).
Beiko, R. G., Doolittle, W. F. & Charlebois, R. L. The impact of reticulate evolution
on genome phylogeny. Syst. Biol. 57, 844–856 (2008).
Wu, M. & Eisen, J. A. A simple, fast, and accurate method of phylogenomic
inference. Genome Biol. 9, R151 (2008).
Pardi, F. & Goldman, N. Resource-aware taxon selection for maximizing
phylogenetic diversity. Syst. Biol. 56, 431–444 (2007).
Kunin, V., Cases, I., Enright, A. J., de Lorenzo, V. & Ouzounis, C. A. Myriads of
protein families, and still counting. Genome Biol. 4, 401 (2003).
Marcotte, E. M. et al. Detecting protein function and protein-protein interactions
from genome sequences. Science 285, 751–753 (1999).
Enright, A. J., Iliopoulos, I., Kyrpides, N. C. & Ouzounis, C. A. Protein interaction
maps for complete genomes based on gene fusion events. Nature 402, 86–90
(1999).
Wainø, M. & Ingvorsen, K. Production of b-xylanase and b-xylosidase by the
extremely halophilic archaeon Halorhabdus utahensis. Extremophiles 7, 87–93 (2003).
Barrangou, R. et al. CRISPR provides acquired resistance against viruses in
prokaryotes. Science 315, 1709–1712 (2007).
Doolittle, R. F. & York, A. L. Bacterial actins? An evolutionary perspective.
Bioessays 24, 293–296 (2002).
1059
©2009 Macmillan Publishers Limited. All rights reserved
LETTERS
NATURE | Vol 462 | 24/31 December 2009
24. Sasse, F., Kunze, B., Gronewold, T. M. & Reichenbach, H. The chondramides:
cytostatic agents from myxobacteria acting on the actin cytoskeleton. J. Natl.
Cancer Inst. 90, 1559–1563 (1998).
25. Shendure, J. & Ji, H. Next-generation DNA sequencing. Nature Biotechnol. 26,
1135–1145 (2008).
26. Kunin, V., Copeland, A., Lapidus, A., Mavromatis, K. & Hugenholtz, P. A
bioinformatician’s guide to metagenomics. Microbiol. Mol. Biol. Rev. 72, 557–578
(2008).
27. Ishoey, T., Woyke, T., Stepanauskas, R., Novotny, M. & Lasken, R. S. Genomic
sequencing of single microbial cells from environmental samples. Curr. Opin.
Microbiol. 11, 198–204 (2008).
28. Enright, A. J., Van Dongen, S. & Ouzounis, C. A. An efficient algorithm for
large-scale detection of protein families. Nucleic Acids Res. 30, 1575–1584 (2002).
29. Matsuura, Y. et al. Structural basis for the higher Ca21-activation of the regulated
actin-activated myosin ATPase observed with Dictyostelium/Tetrahymena actin
chimeras. J. Mol. Biol. 296, 579–595 (2000).
30. Moulton, V., Semple, C. & Steel, M. Optimizing phylogenetic diversity under
constraints. J. Theor. Biol. 246, 186–194 (2007).
Supplementary Information is linked to the online version of the paper at
www.nature.com/nature.
Acknowledgements . We thank the following people for assistance in aspects of the
project including planning and discussions (R. Stevens, G. Olsen, R. Edwards,
J. Bristow, N. Ward, S. Baker, T. Lowe, J. Tiedje, G. Garrity, A. Darling, S. Giovannoni),
analysis of genomes whose work could not be included in this report (B. Henrissat,
G. Xie, J. Kinney, I. Paulsen, N. Rawlings, M. Huntemann), project management
(M. Miller, M. Fenner, M. McGowen, A. Greiner), sequencing and finishing (K. Ikeda,
M. Chovatia, P. Richardson, T. Glavinadelrio, C. Detter), culture growth, DNA
extraction, and metadata (D. Gleim, E. Brambilla, S. Schneider, M. Schröder,
M. Jando, G. Gehrich-Schröter, C. Wahrenburg, K. Steenblock, S. Welnitz, M. Kopitz,
R. Fähnrich, H. Pomrenke, A. Schütze, M. Rohde, M. Göker), and manuscript editing
(M. Youle). This work was performed under the auspices of the US Department of
Energy’s Office of Science, Biological and Environmental Research Program, and by
the University of California, Lawrence Berkeley National Laboratory under contract
no. DE-AC02-05CH11231, Lawrence Livermore National Laboratory under contract
no. DE-AC52-07NA27344, and Los Alamos National Laboratory under contract no.
DE-AC02-06NA25396. Support for J.A.E., D.W. and M.W. was provided by the
Gordon and Betty Moore Foundation Grant no. 1660 to J.A.E. Support for work at
DSMZ was provided under DFG INST 599/1-1.
Author Contributions D.W. (rRNA analysis, gene families, actin tree, manuscript
preparation), P.H. (selection of strains, analysis, manuscript preparation, project
coordination), L.G. and D.B. (project management), R.P., B.J.T., E.L., S.G., S.S. (strain
curation and growth), K.M., N.N.I., I.J.A., S.D.H., A.P., A.Ly. (annotation, genome
analysis), V.K. (CRISPRs, actin), M.W. (whole genome tree), P.D., C.K., A.Z. and
M.S. (actin studies), M.N., S.L., J.-F.C., F.C. and E.D. (sequencing), C.H., A.La., M.N.
and A.C. (finishing), P.C. (analysis), E.M.R. (manuscript preparation), N.C.K.
(selection of strains, annotation, analysis), H.-P.K. (strain selection and growth,
DNA preparation, manuscript preparation), J.A.E. (project lead and coordination,
analysis, manuscript preparation).
Author Information Genome sequence and annotation data is available at the JGI
IMG-GEBA page http://img.jgi.doe.gov/geba and has been submitted to GenBank
with accessions ABSZ00000000, ABTA00000000, ABTB00000000,
ABTC00000000, ABTD00000000, CP001618, ABTF00000000,
ABTG00000000, ABTH00000000, ABTI00000000, ABTJ00000000,
ABTK00000000, ABTM00000000, NZ_ABTN00000000,
NZ_ABTO00000000, ABTP00000000, ABTQ00000000,
NZ_ABTR00000000, NZ_ABTS00000000, NZ_ABTT00000000,
NZ_ABTU00000000, NZ_ABTV00000000, NZ_ABTW00000000,
NZ_ABTX00000000, NZ_ABTY00000000, NZ_ABTZ00000000,
NZ_ABUA00000000, NZ_ABUB00000000, NZ_ABUC00000000,
NZ_ABUD00000000., NZ_ABUE00000000, NZ_ABUF00000000,
NZ_ABUG00000000, NZ_ABUH00000000, NZ_ABUI00000000,
NZ_ABUJ00000000, NZ_ABUK00000000, ABUL00000000,
NZ_ABUM00000000, NZ_ABUO00000000, NZ_ABUP00000000,
ABUQ00000000, NZ_ABUR00000000, ABUS00000000,
NZ_ABUT00000000, NZ_ABUU00000000, NZ_ABUV00000000,
NZ_ABUW00000000, NZ_ABUX00000000, NZ_ABUZ00000000,
NZ_ABVA00000000, NZ_ABVB00000000 and NZ_ABVC00000000. All
strains that have been sequenced are available from the DSMZ culture collection
and culture accessions are available in Supplementary Information. This paper is
distributed under the terms of the Creative Commons
Attribution-Non-Commercial-Share Alike licence, and is freely available to all
readers at www.nature.com/nature. Reprints and permissions information is
available at www.nature.com/reprints. Further details on sequencing and genome
properties of each organism are being published in the journal Standards in
Genomic Sciences (SIGS) (http://standardsingenomics.org/). Correspondence
and requests for materials should be addressed to J.A.E. ([email protected]).
1060
©2009 Macmillan Publishers Limited. All rights reserved