Vol 462 | 24/31 December 2009 | doi:10.1038/nature08656 LETTERS A phylogeny-driven genomic encyclopaedia of Bacteria and Archaea Dongying Wu1,2, Philip Hugenholtz1, Konstantinos Mavromatis1, Rüdiger Pukall3, Eileen Dalin1, Natalia N. Ivanova1, Victor Kunin1, Lynne Goodwin4, Martin Wu5, Brian J. Tindall3, Sean D. Hooper1, Amrita Pati1, Athanasios Lykidis1, Stefan Spring3, Iain J. Anderson1, Patrik D’haeseleer1,6, Adam Zemla6, Mitchell Singer2, Alla Lapidus1, Matt Nolan1, Alex Copeland1, Cliff Han4, Feng Chen1, Jan-Fang Cheng1, Susan Lucas1, Cheryl Kerfeld1, Elke Lang3, Sabine Gronow3, Patrick Chain1,4, David Bruce4, Edward M. Rubin1, Nikos C. Kyrpides1, Hans-Peter Klenk3 & Jonathan A. Eisen1,2 Sequencing of bacterial and archaeal genomes has revolutionized our understanding of the many roles played by microorganisms1. There are now nearly 1,000 completed bacterial and archaeal genomes available2, most of which were chosen for sequencing on the basis of their physiology. As a result, the perspective provided by the currently available genomes is limited by a highly biased phylogenetic distribution3–5. To explore the value added by choosing microbial genomes for sequencing on the basis of their evolutionary relationships, we have sequenced and analysed the genomes of 56 culturable species of Bacteria and Archaea selected to maximize phylogenetic coverage. Analysis of these genomes demonstrated pronounced benefits (compared to an equivalent set of genomes randomly selected from the existing database) in diverse areas including the reconstruction of phylogenetic history, the discovery of new protein families and biological properties, and the prediction of functions for known genes from other organisms. Our results strongly support the need for systematic ‘phylogenomic’ efforts to compile a phylogeny-driven ‘Genomic Encyclopedia of Bacteria and Archaea’ in order to derive maximum knowledge from existing microbial genome data as well as from genome sequences to come. Since the publication of the first complete bacterial genome, sequencing of the microbial world has accelerated beyond expectations. The inventory of bacterial and archaeal isolates with complete or draft sequences is approaching the two thousand mark2. Most of these genome sequences are the product of studies in which one or a few isolates were targeted because of an interest in a specific characteristic of the organism. Although large-scale multi-isolate genome sequencing studies have been performed, they have tended to be focused on particular habitats or on the relatives of specific organisms. This overall lack of broad phylogenetic considerations in the selection of microbial genomes for sequencing, combined with a cultivation bottleneck6, has led to a strongly biased representation of recognized microbial phylogenetic diversity3–5. Although some projects have attempted to correct this (for example, see ref. 5), they have all been small in scope. To evaluate the potential benefits of a more systematic effort, we embarked on a pilot project to sequence approximately 100 genomes selected solely for their phylogenetic novelty: the ‘Genomic Encyclopedia of Bacteria and Archaea’ (GEBA). Organisms were selected on the basis of their position in a phylogenetic tree of small subunit (SSU) ribosomal RNA, the best sampled gene from across the tree of life7. Working from the root to the tips of the tree, we identified the most divergent lineages that lacked representatives with sequenced genomes (completed or in progress)8 and for which a species has been formally described9 and a type strain designated and deposited in a publicly accessible culture collection10. From hundreds of candidates, 200 type strains were selected both to obtain broad coverage across Bacteria and Archaea and to perform in-depth sampling of a single phylum. The Gram-positive bacterial phylum Actinobacteria was chosen for the latter purpose because of the availability of many phylogenetically and phenotypically diverse cultured strains, and because it had the lowest percentage of sequenced isolates of any phylum (1% versus an average of 2.3%)11. Of the 200 targeted isolates, 159 were designated as ‘high’ priority primarily on the basis of phylum-level novelty and the ability to obtain microgram quantities of high quality DNA. The genomes of these 159 are being sequenced, assembled, annotated (including recommended metadata12) and finished, and relevant data are being released through a dedicated Integrated Microbial Genomes database portal13 and deposited into GenBank. Currently, data from 106 genomes (62 of which are finished) are available. To assess the ramifications of this tree-based selection of organisms, we focused our analyses on the first 56 genomes for which the shotgun phase of sequencing was completed. The 53 bacteria and 3 archaea (Supplementary Table 1) represent both a broad sampling of bacterial diversity and a deeper sampling of the phylum Actinobacteria (26 GEBA genomes). An initial question we addressed was whether selection on the basis of phylogenetic novelty of SSU rRNA genes reliably identifies genomes that are phylogenetically novel on the basis of other criteria. This question arises because it is known that single genes, even SSU rRNA genes, do not perfectly predict genome-wide phylogenetic patterns14,15. To investigate this, we created a ‘genome tree’ (ref. 16) of completed bacterial genomes (Fig. 1) and then measured the relative contribution of the GEBA project using the phylogenetic diversity metric17. We found that the 53 GEBA bacteria accounted for 2.8–4.4 times more phylogenetic diversity than randomly sampled subsets of 53 non-GEBA bacterial genomes. A similar degree of improvement in phylogenetic diversity was seen for the more intensively sampled actinobacteria (Table 1). These analyses indicate that although SSU rRNA genes are not a perfect indicator of organismal evolution, their phylogenetic relationships are a sound predictor of phylogenetic novelty within the universal gene core present in bacterial genomes. 1 DOE Joint Genome Institute, Walnut Creek, California 94598, USA. 2University of California, Davis, Davis, California 95616, USA. 3DSMZ, German Collection of Microorganisms and Cell Cultures, 38124 Braunschweig, Germany. 4DOE Joint Genome Institute-Los Alamos National Laboratory, Los Alamos, California 87545, USA. 5University of Virginia, Charlottesville, Virginia 22904, USA. 6Lawrence Livermore National Laboratory, Livermore, California 94550, USA. 1056 ©2009 Macmillan Publishers Limited. All rights reserved LETTERS Clostridium botulinum B str Eklund 17B Clostridium perfringens str 13 555 Clostridium kluyveri DSM Maree linum A3 str Loch ni E88 Clostridium botu Clostridium teta C 824 ATC acetobutylicum m novyi NT Clostridium Clostridiu ntans ISDg ferme MF QY s um phyto As Clostridi s metalliredigen dii OhIL 0 ilu Alkaliph hilus oremlandifficile 63 ii Alkalip ostridium us prevot Cl cocc C 29328 Anaero 03 TC A 89 agna DSMs MB4 ldia m si ticus 05 Finego ccharolygcongen C 274 ns ATC xida r ten tor sa sirup robacte cellum aceto s MI 1 I o ellulo n S ae lum Caldic hermoan m thermmacu reducenicum 4C T pio P10 1 ridiu ulfoto lum 0 M Clost Des tomacu rmopro tor Z 29 51 Y the xvia ans ulfo Des culum is audaoform iense Ice13 a hafn ldum 907 n tom orud gen Pelo Desulf hydro terium estica CC 3ttinge 3 6 s T d c e atus thermu toba m mo tica Atr Go 148 LF did lfi M Can oxydo Desu cteriu oaceolfei s m IA WNvula a u NM par G2 rm e Carb p w hil liob He rella thsubs rmop s JWnella tiae Pe 53 i e o c ia IP ilu Mo wolfe m th oph Veillo gala ynov B CT 3 1 a s A 8L K s riu rm ona bacte s the ma a U 15 63 8 om las sm nis is e 1 44 ph mbio robiu op pla o id il 7 37 tro yc co ulm rit ob e G 9 y e 2 ia h S M na My a p art a m on lium M1 R Syn tra sm ma sm um a e um 70 Na pla s pla ne nit nia ic 9 2 co opla co op ge o pt 00 F 1 My yc My a hy sma eum llise C 7 ns Hr PG L1 li M t n a C sm pla a p a g AT tra s um ma B pla co m m tr ne SC or a W A co My plas plas 3 s a pe es a fl sm AY 8 id m la a G My o o r yc yc rova lasmyco las top sm ii P M M a se cop p m sop Phy opl dlaw um My ubs Me tus hyt lai rv s a p a pa id m sm es a d nd roo la oi sm Ca s b olep yc pla m ea he ch a Ur m itc A as w s w lo el ry te As yc M l op Fu so ba ct er iu m nu cle 5 1 37 3 30 P1 84 6 6 CM S C 98 IE C rC P1 N sP 1 s st M 1 a 7 u CC 92 inu os at 0 in ng C3 02 IT ar str ug elo RC C99 311 tr M p m L2A ris r ae s sp C 9 s bs T sto is cu s p C us su NA pa st oc cu s s p C rin s tr p 41 cy c c u s a inu s s bs 99 ro ho co cc us m ar u su ic ec ho co cc us m rin s SM M yn ec ho co occ cus ma arinu us D S yn ec ho oc c us m hil S yn ec lor oco cc us op S yn ch lor co cc lan sei S ro ch ro co xy oe m ns P ro chlo ro ter r w ulu uce P ro hlo ac cte arv ed P roc rob iba p inir s m P ub ex ium liotr ta urtu xidan 27 R on ob he len m c roo 08 2705 C top ia lla riu fer TW CC 1 09 A lack rthe cte m plei m N 41 M 71 129 S gge oba robiu hip ngu m K DS C 13 E rypt ic a w lo ikeiu cum NCT C cidimerym erium m je alyti riae 14 A ph act teriu ure hthe YS 3 0216 Troifidobebac teriumm dip iens SRS3 B oryn bac teriu effic rans C ryne bac rium tole Co ryne bacte radio ernae Co ryne ccus cav ena Co eoco bergia flavig silytica Kineuten onas ellulo B llulom nas c ddieii Ce nimo cter ke ans is Xyla guiba nitrific rius ganens San esia de sedenta ium michi Jon ococcus rium faec nsis subsp Kyt chybacte ichigane str CTCB07 Bra bacter m bsp xyli 33209 Clavi onia xyli su ila DC2201 ATCC Leifs ria rhizoph lmoninarum Kocu acterium sa 24 Renibobacter sp FB nes KPA171202 ac Arthr ium ter ac Propionibflavida Kribbella es sp JS614 Nocardioid acidiphila Catenulispora coelicolor A3 2 Streptomyces 11B Acidothermus cellulolyticus Thermobispora bispora Streptosporangium roseum Thermomonospora curvata Thermobifida fusca YX Nocardiopsis dassonvillei Frankia sp CcI3 Salinispora areni cola CNS 205 Stackebrand Geodermatoptia nassauensis hilus obscur Nakamure us Actinosyn lla multipartita ne ma mirum Saccha ro Saccha monospora vir ropolys idis Mycob po Myc acterium ra erythraea NRRL Myc obacteriu abscessu 2338 Rho obacteriu m gilvum s No dococcu m leprae PYR GCK Gordcardia fa s sp RH TN Tsukaonia b rcinica A1 ro IF Lep mure nchia M 101 52 Le tospir lla pa lis Bra ptospir a bifle urometa Tre chysp a inte xa sero bola Tre ponemira mu rrogan var Pa s p to rd o s a Bo oc erov c str n d ain ar C Bo rreli ema entic hii ope Patoc R rreli a he pallid ola A nha Plahodo a garirmsii D um s TCC gen 1 Ames i str M ncto pirell nii P AH ubsp 35405 Fioc pall Ak ethyla myc ula b Bi ruz id um L1 Op kerm cid es li altic s m a ip tr it C n S u a Nic h hols Chandidtus tnsia ilum ophil H 1 Ch lam at err muc infe us a u r la in C y n s e Ch hlammyddia t Pro PB9 iphila orum C lo y o ra to 0 V AT C hlo ro do phil cho chla 1 CC 4 C hlo ro her ph a a ma m BA P hlo ro biu pe ila bo tis ydia A8 Ch ros rob bac m p ton pne rtus A H am 35 lo the ium ulu ha tha umo S2 AR oeb R 6 S h r c m e la 13 op C al od ob o p p ob ss nia 3 hil aU C an inib oth ium chlo hae arv act ium e TW yt d a e c ris o um er WE A b oid T 1 op id ct rm h 25 ha atu er us loro vibr acte NC es CC 83 ga s A rub m ch iof ro IB BS 35 hu mo er arin rom orm ides 832 1 110 D 7 tc eb S us at is D hi o M DS SM i i ns p 1 Ca M 2 on hil 38 D3 2 66 65 ii us 55 AT as C iat C ic 3 3 us 40 5a 6 2 Rh izo M es o 2 48 3 ns I5 0 5 ta VP C 8 en on C m s icr AT er ale u r f u rin sis m is 83 SS 86 te ing pa en tao on W W 02 ac a l he pin aio tas alis ri G JIP ob m ter a et dis giv lle a ad oso ac hag s th es gin ue ce ilum Dy pir ob op ide roid as ia mchra oph 1 S ed itin ro te on ulc o hr 3 ei19 OP1 P h te ac m S ga syc 80 P 3A C c b ro s a p 0 m Ba ara phy atu oph m KT utu p YO P or did cyt eriu etii min s 0 74 9 P an no ct ors m ium C ap oba a f biu nib F5 5 2 SM 1 5144 C lav ell ro oge us V 15 s D CC T F am ic dr lic SB ne Gr lusimrihy aeo r sp oge us A 5 E ulfu ex pto ccin patic 669 S uif u su e i 2 1 251 Aqitratir ella cter hpylor C37 4018 SM 1 N olin ba ter NB i RM ns D 381 7 W elico bac sp tzler fica m 9 AA H elico ovumr bu enitri yianu CC B lei 269 H lfur acte as d dele is AT doy 82 40 s Su ob on m min bsp tu Arc lfurim pirillu r ho ni su sp fe Su uros bacte r jeju s sub 3826 Sulfmpylo bacte r fetu cisus 1 Ca mpylo bacte r con ilus 345 Ca mpylo bacte cetiph um Ellin Ca mpylo rio a acteri 76 Ca nitrovib teria b s Ellin60 HD100 s De bac tatu voru o si Acid acter u bacterios DK 1622 5 Solib lovibrio s xanthu Fw109 Bdelococcu acter sp Myx eromyxob aceum hr ce 56 Ana ium oc 80 sum So Haliang ium cellulo licus DSM 23 Sorangacter carbino ucens Rf4 Pelob ter uraniired cens PCA Geobac ter sulfurredu Geobac lovleyi SZ 9 Geobacter propionicus DSM 237 Pelobacter itrophicus SB Syntrophus acidr fumaroxidans MPOB Syntrophobacte rans Hxd3 Desulfococcus oleovo Desulfotalea psychrophila LSv54 Desulfohalobium retbaense Desulfomicrobium baculatum N Ba itro Br rt ba ad Ba one cte yrh rto lla r w izo n q in b biu rh B ella uin og ium Pa m izo ru b ta ra s Hy rv le ph iba gu M biu cel acil na s dsk p O om cu min es m la lifo tr yi R on lum os orh lot me rm To Nb S2 as 7 a iz i li i u ne lav rum ob MA ten s K lou 25 8 Ma ptu ame bv ium FF3 sis C58 se 5 Ro ric niu ntiv vic sp 03 16 3 se Sil au m o ia B 09 M ob ic ac iba Ca lis AT ran e NC 9 ter ct ulo m C s 38 1 D de er p ba aris C 1 DS 41 Pa inor No ra os J nitr om cte M 54 1 vos Rh cocc eoba anna ifica ero r sp CS1 44 phin gob Eryth odob us d cter schians O yi DS K3 0 ium rob acte enit shib s Ch S 1 a Zym Sp arom cter r sphrifica ae p CC 114 3 li S a n D om hing a ona opy ticivotorali eroids PD FL 1 1 s m xis ran s H es 122 2 Sp obilis alask s DSTCC 2 4 2 Glu con Glu hingo subs ensis M 1225941 Gra aceto conob mona p mo RB2 444 b nuli bac acter acter s witti bilis Z256 ter b diaz oxy chii M4 ethe otro dan RW Rho ph s 6 1 sd A Mag dospirill cidiphili ensis Cicus P 21H neto u GDN Al 5 sp m ru um c Anap irillum brum A ryptum IH1 Anap lasma magneti TCC 1 JF 5 1 p lasm a mar hagocytocum AM170 B ginale p Ehrlich str hilum H 1 Eh ia rum inantiurlichia cani St Marie Z Wolbac s m str s hia endo Welgestr Jake symbion vond Wol t Neorick strain TRS ofbachia pipien en ettsia se Brugia nn m tis Rickettsia etsu str Miya alayi yama typhi str Wilmington Rickettsia bellii RML36 Orie 9C Candidatus Pelantia tsutsugamushi Bor yong gibacter ubiq ue HTC Magnetococcus C1062 Desulfovibrio desulfuricans MC 1 subsp desulfuricanssp str G20 Desulfovibrio vulgaris subsp vulgaris DP4 Lawsonia intracellularis PHE MN100 at St um re pt s ob er De ubs m th p Le aci an io n ae su uc Se pto llus Fe ro lfo lea ba tri m T v v t l c o rv Th ido The herm ibrio ibrioum dell hia nili erm ba rm o a p A a t bu fo os cter oto tog cida eptTCCerm cc rmi iph ium ga a le m id 2 it ali s o m t in ov 5 id s Pe me nod arit ting ovo ora 586is tro lan os im ae ra ns Bu n e u chn T M tog sie m a M TM s e Bu De herm Meio eio a m nsis Rt1 SB O chn ra ap t ino us th he ob B 7 B 8 Wig co th erm rm ilis I4 1 Bu era a hidico gles cc er chn phid la us us SJ 29 wort u m s s r e 9 o ra a ico tr A rad ph silv ube 5 hia la P p glo iod ilus anu r ssin Buc hidic str S S Ac ura H s ola yr g idia hne ns B8 end ra ap str B Schiz thosip R1 Bau Cand osy hid p B aph ho id man m nia atus B Cand bion icola s aizon is gra n pis cica u g t lo dell chma idatus of Glotr Cc C ia pis minumm inic ola nnia peBloch ssina inara taciae str H m b n c c H nsylva annia revipaedri oma lpis n fl S lo icu orid Acti nob higella disca s str B anus PEN Haemacillus suflexneri coagula ta ophilu ccin 2a s s du ogene tr 301 c s1 re Vibri Aerom o ch yi 35000 30Z P o onas salmonhotobacterVibrio fischlerae O3 HP ium pr icida eri E 95 Shew subsp sa ofundu S114 Shewan anella sedi lmonicidam SS9 minis HA A449 ella frigid Psychro imarina NCIMW EB3 monas B 400 Co Pseudoalte lwellia psychreingrahamii 37 ryt romonas haloplanktishraea 34H TAC125 Idiomarina ioih iensis L2T Pseudoalteromo nas atlantica R T6c Kangiella koreensis Marinomonas MWYL1 Chromohalobacter salexigens sp DSM 3043 Hahella chejuensis KCTC 2396 Marinobacter aquaeolei VT8 Cellvibrio japonicus Ueda107 Saccharophagus degradans 2 40 e pv phaseolicola 1448A4 Pseudomonas syringa us 273 Psychrobacter arctic r sp PRwf 1 Psychrobacte r sp ADP1 2 Acinetobacte mensis SK 1 rku bo ax RSA 33ns Alcanivor burnetii Coxiella mophila str LeHA lla pneu cius okutanii L 2 Legione XC so a yo ogen larctica sicom ira crun atus Ve ho Candid Thiomicrosp is subsp S1703A ula1 larens sus VC sella tu cter nodo sa TemecK279a Franci idio th loba ilia Diche ylella fast maltoph s str BSaL1 X nas psulatu phila E 1 o m ca a halo MLH 07 ho cus i 7 otrop Sten thylococ odospir hrliche CC 19 H 1 Me Halorh ola e ni AT s SP 1 2 nic ocea oran EF0 J2 v lilim e Alka coccus acidoisenia rans C118 T o e tia 6 so Delfbacter alenivucens ii SP 1 Nitro ro aphth ired lodn m PM 1 h p ine nas n x ferr x cho hilu STIR is rm ip s e ra ri o ns V rom odofe ptoth etrole sariu ane 383s la s o e L m p ece taiw sp an P Rh ia d N libiu r n vidus lder oxy 197 1 o ic thy cte N Me leoba upria urkh rsen viump Eb CB 1 C B sa aa s aR 9 uc na etell rcus atic ha C 196 lyn o o P iim ord zoa rom rop 25 259 T in t 5 K C A B a u rm as e TC C 2 s 5 He on nas is A TC llatu 194 72 1 rom mo rm s A ge P1 124 83 6 hlo roso ltifo can s fla CC C 2 4 4 9 c De Nit mu itrifi cillu ae N ATC JCM sp 903 y2 P 18 s ira en a e sp s d ylob rrho eum ran riumTCCcus isB i e so ro cillu eth no lac tol cte A h B Nit ba M go vio dio oba dica trop tris s ia io l er ium ra hy in uto lu Th iss ter um et sp r a pa Ne bac teri M sub cte as n c o ca ba o m ba di o m ro ylo h in th do h C et ia an eu M ck X ps o rin d e o ij Rh Be Th Clostridium beijerinckii NCIMB 8052 Alicyclobacillus acidocaldarius Exiguobacterium sibiricum 255 15 Bacillus halodurans Bacillus clausii KSM C 125 K16 Oceanobac Geobacillus illus iheyensis HTE831 kaustophilus Bacillus HTA Bacillus pumilus SAFR 032 426 we ihenstep Lysiniba ha Staphy cillus sphaeri nensis KBAB4 Listeria lococcus ha cus C3 41 emolytic welshim Ente Lact rococcus eri serova us JCSC1435 Strepococcus faecalis V5 r 6b str SLCC 83 5334 Strep tococc lactis su Stre tococc us suis 05bsp crem ZYH33 oris SK Lac ptococ us mu 11 Lac tobac cus pyo tans UA Lac tobacillillus sake genes 159 i sub MGA La tobac us c La ctoba illus d asei A sp sake S10750 i 23K La ctoba cillus elbru TCC 3 Le ctob cillus helve eckii s 34 Oe ucon acillus gasse ticus D ubsp b Lac noco ostoc saliv ri AT PC 4 ulgari a C 5 c cus to ri 71 La C u b cu cit ATC Pe ctob acil s oe reum s UC 3332 CB La dio acil lus re ni P KM C118 3 AA 365 L cto coc lus ute SU 1 20 Deacto bacil cus pferme ri F27 Sp halobacil lus p ento ntum 5 s c lu la T h IF Hehermaero occo s bre ntaruaceus O 3 9 R rp o ba id v m W ATC 56 C os et ba cte es is A G hlo eifl osip culu r th sp TC CFS1 C 257 S loe ro exu ho m erm BA C 3 45 A yn ob fle s n te o V1 67 Th car ech act xus cast aura rren philu N e y o e a e n um s T o rm oc co r v ura nh tia Sy richstoc os hlor ccu iola ntia olzii cus y A c D S n c n is s C yn ec ode pun ec m sp eus us J SM TCC ya ec ho sm c ho ar JA P no ho co iu tifo co ina 2 CC 10 1394 237 th cy cc m rm cc M 3B 74 fl 1 79 ec st u e e us BI 2 e is s s ryt PC el C1 a 2 1 sp sp p hr C on 10 13 AT PC PC aeu 73 gat 17 C C C 7 m 10 us C 6 0 IM 2 BP 51 80 02 S 1 10 14 3 1 2 NATURE | Vol 462 | 24/31 December 2009 Aquificae Actinobacteria Synergistetes Bacteroidetes Cyanobacteria Thermotogae Alphaproteobacteria Chlorobi Chloroflexi Deinococcus/Thermus Deltaproteobacteria Chlamydiae/Verrucomicrobia Firmicutes Epsilonproteobacteria Planctomycetes Tenericutes Acidobacteria Spirochaetes Fusobacteria Gammaproteobacteria Betaproteobacteria Figure 1 | Maximum-likelihood phylogenetic tree of the bacterial domain based on a concatenated alignment of 31 broadly conserved protein-coding genes16. Phyla are distinguished by colour of the branch and GEBA genomes are indicated in red in the outer circle of species names. The discovery and characterization of new gene families and their associated novel functions provide one incentive for sequencing additional genomes, analysis of which has helped to redefine the protein family universe18. We explored the quantitative effect of tree-based genome selection on the pace of discovery of novel proteins and functions. Specifically, we compared the rate of discovery of novel protein families when progressively adding more closely related genomes versus when adding more distantly related ones (Fig. 2). Granted, many factors contribute to protein family diversity, such as ecological niche; nevertheless, higher rates of novel protein family discovery were found in the more phylogenetically diverse taxa (Fig. 2). In addition, of the 16,797 families identified in the 56 GEBA genomes, 1,768 showed no significant sequence similarity to any proteins, indicating the presence of novel functional diversity. These results highlight the utility of tree-based genome selection as a means to maximize the identification of novel protein families and argues against lateral gene transfer significantly redistributing genetic novelty between distantly related lineages. 1057 ©2009 Macmillan Publishers Limited. All rights reserved LETTERS NATURE | Vol 462 | 24/31 December 2009 11.0 4.3 3.2 6 0.7 (100) 1.4 6 0.3 (100) 2.8–4.4 2.5–3.9 New protein family links 46 3 6 4 (5) 6.6 to .15.3 Genes in new chromosomal cassettes 71,579 16,579 6 5,523 (20) 3.2–6.5 New gene fusions 433 65 6 31 (20) 4.5–12.7 GEBA genomes were compared to equivalently sized random sets of reference genomes to quantify the effect of phylogenetic selection. Novel proteins also can serve to link distantly related homologues whose relatedness would otherwise go undetected. Forty-six such links were identified in the 56 GEBA genomes compared to an average of only three new links in equivalent sets of randomly sampled nonGEBA genomes (Table 1). A useful complement to homology-based predictions of gene function are ‘non-homology methods’ (ref. 19) such as gene context-based inference that relies on the conserved clustering of functionally related genes across multiple genomes, often in operons or as gene fusions20. We identified over 70,000 genes in new chromosomal cassettes of two or more genes in the GEBA genomes. This represents a three- to sixfold increase over equivalent sets of nonGEBA genomes (Table 1). Similarly, the number of new gene fusions identified in the GEBA genomes is 4 to ,13 times greater than in randomly selected genome sets (Table 1). Because the GEBA data set produced a several-fold improvement over random sets for all metrics examined (Table 1), we predict that other aspects of sequence-based biological discovery will similarly benefit from treebased genome sequencing. The GEBA genomes also show significant phylogenetic expansions within known protein families. For example, although only two of the 56 GEBA organisms are known cellulose degraders, we identified in the set of genomes a variety of glycoside hydrolase (GH) genes that may participate in the breakdown of cellulose and hemicelluloses. Among these are 28 and 7 phylogenetically divergent members of the endoglucanase- and processive exoglucanase-containing GH6 and GH48 Protein family number (including families with single members) b Peptidase M4 thermolysin Hypothetical protein 1.6 kb Hypothetical protein 1.0 kb Hypothetical protein 0.5 kb BARP 1 kb Phosphatase c Oxidoreductase Conserved hypothetical protein 70,000 Hypothetical protein Methyltransferase Bacteria from GEBA project (1,060.6 new families/genome) 60,000 d 50,000 40,000 Actinobacteria (649.6 new families/genome) 30,000 Enterobacteriaceae (307.7 new families/genome) 20,000 10,000 Streptococcus agalactiae (121.3 new families/genome) 0 a RNase treated Genome tree phylogenetic diversity17 Bacteria (domain) Actinobacteria (phylum) Random sets Fold (number of resamplings) improvement RNA GEBA set DNA Comparative genomic metric families, respectively. Halorhabdus utahensis, a halophilic archaeon known to have b-xylanase and b-xylosidase activities21, has a chromosomal cluster including two GH10 family b-xylanases and six novel GH5 family proteins of unknown specificity. The enrichment of genetic diversity is also seen within families of non-coding RNAs, transposable elements, and other cellular components. For example, the genome of the marine myxobacterium Haliangium ochraceum contains 807 CRISPR (clustered regularly interspaced short palindromic repeats) units including the largest single CRISPR array known, comprising 382 spacer/repeat units. CRISPR is a newly recognized, but ancient and widespread, system in bacteria and archaea that confers resistance to viruses and other invading foreign DNAs22. Results from the GEBA pilot project challenge our current understanding for the taxonomic distribution of known gene families. The most striking example of which is the discovery of an actin homologue in H. ochraceum. Actin and its close relatives are structural components of the eukaryotic cytoskeleton that are found in every eukaryote and only in eukaryotes. Bacteria and archaea encode instead the shape-determining protein MreB. Although MreBs have some functional and structural similarities to eukaryotic actins, they are regarded, at best, distantly related homologues23 and possibly not Markers Table 1 | Effect of SSU rRNA tree-based selection of organisms on comparative genomic metrics 0 10 20 30 40 50 Genome number 60 70 80 Figure 2 | Rate of discovery of protein families as a function of phylogenetic breadth of genomes. For each of four groupings (species, different strains of Streptococcus agalactiae; family, Enterobacteriaceae; phylum, Actinobacteria; domain, GEBA bacteria), all proteins from that group were compared to each other to identify protein families. Then the total number of protein families was calculated as genomes were progressively sampled from the group (starting with one genome until all were sampled). This was done multiple times for each of the four groups using random starting seeds; the average and standard deviation were then plotted. 1C0F_A BARP D G E D V Q A L V I D N G S G M C K A G F A GD D A P R A V F P S I VG R P R H T G K DS Y V . . . S S . . . . P I I I H P G S D T L Q A G L A D E E H P G S I F P N I VG R H K L A G L M E W V D Q R 1C0F_A BARP . . . . G D E A Q S K R G I L T L K Y P I E HG I V T N W D D M E K IW H H T F Y N E LR V A P E E V L C V G Q E A I D Q S A T V L L R H P V W SG I V G D W E A F A A VL R H T F Y R A LW V A P E E 1C0F_A BARP H P V L L T E A P L . . . N P K A N R E K M TQ I M F E T F N T P A MY V A I Q A V L SL Y A S G R H P I V V T E S P H V Y R S F Q L R R E Q L TR L L F E T F H A P Q VA V C S E A A M SL Y A C G L 1C0F_A BARP T T G I V M D S G D G V S H T V P I Y E G Y AL P H A I L R L D L A GR D L T D Y M M KI L T E R G D T G L V V S L G D F V S Y V A P V H R G A IV D A G L T F L E P D GR S I T E Y L S RL L L E R G 1C0F_A BARP Y S F T T T A E R E I V R D I K E K L A Y V AL D F E A E M Q T A A SS S A L E K S Y EL P D G Q V H V F T S P E A L R L V R D I K E T L C Y V AD D V A K E . . A A R NA D S V E A T Y LL P N G E T 1C0F_A BARP I T I G N E R F R C P E A L F Q P S F L G M ES A G I H E T T Y N S IM K C D V D I R KD L Y G N V L V L G N E R F R C P E V L F H P D L L G W ES P G L T D A V C N A IM K C D P S L Q AE L F G N I 1C0F_A BARP V L S G G T T M F P G I A D R M N K E L T A LA P S T M K I K I I A PP E R K Y S V W IG G S I L A V V T G G G S L F P G L S E R L Q R E L E Q RA P A E A P V H L L T RD D R R H L P W KG A A R F A 1C0F_A BARP S L S T F Q Q M W I S K E E Y D E S G P S I VH R K C F R D A Q F A G F A L T R Q A Y E R H G A E L IY Q M . . Figure 3 | A bacterial homologue of actin. a, Genomic context of the bacterial actin-related protein (BARP) gene within the genome of the marine Deltaproteobacterium H. ochraceum. Red, gene encoding BARP; white, genes encoding hypothetical proteins; black, genes with functional annotations. b, RT–PCR demonstration of expression of the gene encoding BARP in H. ochraceum. c, Ribbon plot of the putative structure of BARP. d, Alignment of BARP with actin from Dictyostelium discoideum29 with similarities in black shaded text. Secondary structure elements (arrows, beta-strands; bars, alphahelices) are colour-coded as in c. A phylogenetic tree including this protein is in Supplementary Figure 1. 1058 ©2009 Macmillan Publishers Limited. All rights reserved LETTERS NATURE | Vol 462 | 24/31 December 2009 even homologous. Like other bacteria, H. ochraceum encodes a bona fide MreB protein, but in addition, it encodes a protein that is clearly a member of the actin family, which we have named BARP (bacterial actin-related protein; Fig. 3). Although we do not yet have evidence for its precise function, BARP is expressed in H. ochraceum (Fig. 3b). Assuming that the H. ochraceum mreB orthologue performs the same function as in other bacteria, and given that the myxobacteria, to which this species belongs, are known to synthesize actin-targeting toxins24, we propose that this BARP may be a dominant-negative inhibitor of eukaryotic actin polymerization. Regardless of its precise function, this first—and so far only—discovery of an expressed homologue of eukaryotic actin in a member of the Bacteria highlights the potential for novel and surprising biological discoveries given a wider genomic sampling of the tree of life. We conclude that targeting microorganisms for genome sequencing solely on the basis of phylogenetic considerations offers significant farreaching benefits in diverse areas. Furthermore, the benefits of phylogenetically driven genome sequencing show no sign of saturating with these first 56 genomes. A key question then lies in determining how much bacterial and archaeal diversity remains to be sampled. Using SSU rRNA gene sequences as a proxy for organismal diversity (Fig. 4), we estimate that sequencing the genomes of only 1,520 phylogenetically selected isolates could encompass half of the phylogenetic diversity represented by known cultured bacteria and archaea. Given the continuing reductions in both the cost and difficulty in sequencing genomes25, this is certainly a tractable target in the next few years. However, the great majority of recognized bacterial and archaeal diversity is not represented by pure cultures and an additional 9,218 genome sequences from currently uncultured species would be required to capture 50% of this recognized diversity (Fig. 4). Such an undertaking will require new approaches to culturing or processing of multi-species samples using methods such as metagenomics26 or physical isolation of cells from mixed populations followed by whole genome amplification methods27. Obtaining reference genomes for the uncultured microbial majority will be a natural extension of the GEBA project, the ultimate goal of which is to provide a phylogenetically balanced genomic representation of the microbial tree of life. The pilot study presented here is a dedicated first step in this direction. METHODS SUMMARY Starting with a phylogenetic tree of SSU rRNA genes7, we identified major branches that had no available genome sequences but for which cultured isolates were available in the DSMZ or ATCC culture collections. Selected isolates (Supplementary Table 1a, b) from these branches were grown and DNA isolated (Supplementary Table 1c) and quality checked. DNA was then used for shotgun genome sequencing by Sanger/ABI, Roche/454 and/or Illumina/Solexa technologies (Supplementary Table 2). Sequence reads were assembled separately with different assembly methods and the best draft assembly was used for annotation and as a starting point for genome completion (current genome status is in Supplementary Table 2). Annotation (gene identification, functional prediction, etc.) was performed using the IMG system (http://img.jgi.doe.gov/geba); this was done both after shotgun sequencing and again after genome completion. For ‘whole genome tree’ analysis, a PHYML maximum likelihood phylogenetic tree of a concatenated alignment of 31 marker genes was built using AMPHORA16. Phylogenetic diversity was calculated as the sum of branch lengths in this and other trees. Protein families were built for various genome sets by using the Markov clustering algorithm (MCL)28 to group proteins on the basis of ‘all versus all’ blastp searches. For analysis of phylogenetic diversity of organisms, a phylogenetic tree was built for a combined alignment of SSU rRNA sequences from published genomes and a nonredundant subset of greengenes SSU rRNA7. Further analysis of the genomes was done using IMG database queries and new computational analyses as described in the main text, legends and Supplementary Methods. Received 3 June; accepted 30 October 2009. 1. 2. 3. 4. 5. 6. 7. 1,200 1,000 Phylogenetic diversity 8. GEBA genomes Pre-GEBA genomes Organisms from the greengenes database Organisms from the greengenes database (excluding environmental samples) 9. 10. 11. 800 120 12. 100 600 13. 80 60 400 14. 40 15. 20 200 0 0 0 400 800 1,200 16. 17. 0 5,000 10,000 15,000 20,000 25,000 30,000 35,000 40,000 Number of organisms Figure 4 | Phylogenetic diversity of bacteria and archaea on the basis of SSU rRNA genes. Using a phylogenetic tree of unique SSU rRNA gene sequences7, phylogenetic diversity was measured for four subsets of this tree: organisms with sequenced genomes pre-GEBA (blue), the GEBA organisms (red), all cultured organisms (dark grey), and all available SSU rRNA genes (light grey). For each subtree, taxa were sorted by their contribution to the subtree phylogenetic diversity30 and the cumulative phylogenetic diversity was plotted from maximal (left) to the least (right). The inset magnifies the first 1,500 organisms. Comparison of the plots shows the phylogenetic ‘dark matter’ left to be sampled. 18. 19. 20. 21. 22. 23. Fraser, C. M., Eisen, J. A. & Salzberg, S. L. Microbial genome sequencing. Nature 406, 799–803 (2000). Liolios, K., Mavromatis, K., Tavernarakis, N. & Kyrpides, N. C. The Genomes On Line Database (GOLD) in 2007: status of genomic and metagenomic projects and their associated metadata. Nucleic Acids Res. 36 (database issue), D475–D479 (2008). Hugenholtz, P. Exploring prokaryotic diversity in the genomic era. Genome Biol. 3, REVIEWS0003.1–REVIEWS0003.8 (2002). Eisen, J. A. Assessing evolutionary relationships among microbes from wholegenome analysis. Curr. Opin. Microbiol. 3, 475–480 (2000). Wu, D. et al. Complete genome sequence of the aerobic CO-oxidizing thermophile Thermomicrobium roseum. PLoS One 4, e4207 (2009). Pace, N. R. A molecular view of microbial diversity and the biosphere. Science 276, 734–740 (1997). DeSantis, T. Z. et al. Greengenes, a chimera-checked 16S rRNA gene database and workbench compatible with ARB. Appl. Environ. Microbiol. 72, 5069–5072 (2006). Bernal, A., Ear, U. & Kyrpides, N. Genomes OnLine Database (GOLD): a monitor of genome projects world-wide. Nucleic Acids Res. 29, 126–127 (2001). Lapage, S. P. et al., International Code of Nomenclature of Bacteria, 1990 Revision. (American Society for Microbiology, 1992). Ward, N., Eisen, J., Fraser, C. & Stackebrandt, E. Sequenced strains must be saved from extinction. Nature 414, 148 (2001). Hugenholtz, P. & Kyrpides, N. C. A changing of the guard. Environ. Microbiol. 11, 551–553 (2009). Field, D. et al. The minimum information about a genome sequence (MIGS) specification. Nature Biotechnol. 26, 541–547 (2008). Markowitz, V. M. et al. The integrated microbial genomes (IMG) system in 2007: data content and analysis tool extensions. Nucleic Acids Res. 36 (database issue), D528–D533 (2008). Achtman, M. & Wagner, M. Microbial diversity and the genetic nature of microbial species. Nature Rev. Microbiol. 6, 431–440 (2008). Beiko, R. G., Doolittle, W. F. & Charlebois, R. L. The impact of reticulate evolution on genome phylogeny. Syst. Biol. 57, 844–856 (2008). Wu, M. & Eisen, J. A. A simple, fast, and accurate method of phylogenomic inference. Genome Biol. 9, R151 (2008). Pardi, F. & Goldman, N. Resource-aware taxon selection for maximizing phylogenetic diversity. Syst. Biol. 56, 431–444 (2007). Kunin, V., Cases, I., Enright, A. J., de Lorenzo, V. & Ouzounis, C. A. Myriads of protein families, and still counting. Genome Biol. 4, 401 (2003). Marcotte, E. M. et al. Detecting protein function and protein-protein interactions from genome sequences. Science 285, 751–753 (1999). Enright, A. J., Iliopoulos, I., Kyrpides, N. C. & Ouzounis, C. A. Protein interaction maps for complete genomes based on gene fusion events. Nature 402, 86–90 (1999). Wainø, M. & Ingvorsen, K. Production of b-xylanase and b-xylosidase by the extremely halophilic archaeon Halorhabdus utahensis. Extremophiles 7, 87–93 (2003). Barrangou, R. et al. CRISPR provides acquired resistance against viruses in prokaryotes. Science 315, 1709–1712 (2007). Doolittle, R. F. & York, A. L. Bacterial actins? An evolutionary perspective. Bioessays 24, 293–296 (2002). 1059 ©2009 Macmillan Publishers Limited. All rights reserved LETTERS NATURE | Vol 462 | 24/31 December 2009 24. Sasse, F., Kunze, B., Gronewold, T. M. & Reichenbach, H. The chondramides: cytostatic agents from myxobacteria acting on the actin cytoskeleton. J. Natl. Cancer Inst. 90, 1559–1563 (1998). 25. Shendure, J. & Ji, H. Next-generation DNA sequencing. Nature Biotechnol. 26, 1135–1145 (2008). 26. Kunin, V., Copeland, A., Lapidus, A., Mavromatis, K. & Hugenholtz, P. A bioinformatician’s guide to metagenomics. Microbiol. Mol. Biol. Rev. 72, 557–578 (2008). 27. Ishoey, T., Woyke, T., Stepanauskas, R., Novotny, M. & Lasken, R. S. Genomic sequencing of single microbial cells from environmental samples. Curr. Opin. Microbiol. 11, 198–204 (2008). 28. Enright, A. J., Van Dongen, S. & Ouzounis, C. A. An efficient algorithm for large-scale detection of protein families. Nucleic Acids Res. 30, 1575–1584 (2002). 29. Matsuura, Y. et al. Structural basis for the higher Ca21-activation of the regulated actin-activated myosin ATPase observed with Dictyostelium/Tetrahymena actin chimeras. J. Mol. Biol. 296, 579–595 (2000). 30. Moulton, V., Semple, C. & Steel, M. Optimizing phylogenetic diversity under constraints. J. Theor. Biol. 246, 186–194 (2007). Supplementary Information is linked to the online version of the paper at www.nature.com/nature. Acknowledgements . We thank the following people for assistance in aspects of the project including planning and discussions (R. Stevens, G. Olsen, R. Edwards, J. Bristow, N. Ward, S. Baker, T. Lowe, J. Tiedje, G. Garrity, A. Darling, S. Giovannoni), analysis of genomes whose work could not be included in this report (B. Henrissat, G. Xie, J. Kinney, I. Paulsen, N. Rawlings, M. Huntemann), project management (M. Miller, M. Fenner, M. McGowen, A. Greiner), sequencing and finishing (K. Ikeda, M. Chovatia, P. Richardson, T. Glavinadelrio, C. Detter), culture growth, DNA extraction, and metadata (D. Gleim, E. Brambilla, S. Schneider, M. Schröder, M. Jando, G. Gehrich-Schröter, C. Wahrenburg, K. Steenblock, S. Welnitz, M. Kopitz, R. Fähnrich, H. Pomrenke, A. Schütze, M. Rohde, M. Göker), and manuscript editing (M. Youle). This work was performed under the auspices of the US Department of Energy’s Office of Science, Biological and Environmental Research Program, and by the University of California, Lawrence Berkeley National Laboratory under contract no. DE-AC02-05CH11231, Lawrence Livermore National Laboratory under contract no. DE-AC52-07NA27344, and Los Alamos National Laboratory under contract no. DE-AC02-06NA25396. Support for J.A.E., D.W. and M.W. was provided by the Gordon and Betty Moore Foundation Grant no. 1660 to J.A.E. Support for work at DSMZ was provided under DFG INST 599/1-1. Author Contributions D.W. (rRNA analysis, gene families, actin tree, manuscript preparation), P.H. (selection of strains, analysis, manuscript preparation, project coordination), L.G. and D.B. (project management), R.P., B.J.T., E.L., S.G., S.S. (strain curation and growth), K.M., N.N.I., I.J.A., S.D.H., A.P., A.Ly. (annotation, genome analysis), V.K. (CRISPRs, actin), M.W. (whole genome tree), P.D., C.K., A.Z. and M.S. (actin studies), M.N., S.L., J.-F.C., F.C. and E.D. (sequencing), C.H., A.La., M.N. and A.C. (finishing), P.C. (analysis), E.M.R. (manuscript preparation), N.C.K. (selection of strains, annotation, analysis), H.-P.K. (strain selection and growth, DNA preparation, manuscript preparation), J.A.E. (project lead and coordination, analysis, manuscript preparation). Author Information Genome sequence and annotation data is available at the JGI IMG-GEBA page http://img.jgi.doe.gov/geba and has been submitted to GenBank with accessions ABSZ00000000, ABTA00000000, ABTB00000000, ABTC00000000, ABTD00000000, CP001618, ABTF00000000, ABTG00000000, ABTH00000000, ABTI00000000, ABTJ00000000, ABTK00000000, ABTM00000000, NZ_ABTN00000000, NZ_ABTO00000000, ABTP00000000, ABTQ00000000, NZ_ABTR00000000, NZ_ABTS00000000, NZ_ABTT00000000, NZ_ABTU00000000, NZ_ABTV00000000, NZ_ABTW00000000, NZ_ABTX00000000, NZ_ABTY00000000, NZ_ABTZ00000000, NZ_ABUA00000000, NZ_ABUB00000000, NZ_ABUC00000000, NZ_ABUD00000000., NZ_ABUE00000000, NZ_ABUF00000000, NZ_ABUG00000000, NZ_ABUH00000000, NZ_ABUI00000000, NZ_ABUJ00000000, NZ_ABUK00000000, ABUL00000000, NZ_ABUM00000000, NZ_ABUO00000000, NZ_ABUP00000000, ABUQ00000000, NZ_ABUR00000000, ABUS00000000, NZ_ABUT00000000, NZ_ABUU00000000, NZ_ABUV00000000, NZ_ABUW00000000, NZ_ABUX00000000, NZ_ABUZ00000000, NZ_ABVA00000000, NZ_ABVB00000000 and NZ_ABVC00000000. All strains that have been sequenced are available from the DSMZ culture collection and culture accessions are available in Supplementary Information. This paper is distributed under the terms of the Creative Commons Attribution-Non-Commercial-Share Alike licence, and is freely available to all readers at www.nature.com/nature. Reprints and permissions information is available at www.nature.com/reprints. Further details on sequencing and genome properties of each organism are being published in the journal Standards in Genomic Sciences (SIGS) (http://standardsingenomics.org/). Correspondence and requests for materials should be addressed to J.A.E. ([email protected]). 1060 ©2009 Macmillan Publishers Limited. All rights reserved
© Copyright 2026 Paperzz