Annota&on'and'integrated'retrieval' of'microbial'genome'sequences Hideaki'Sugawara' Na&onal'Ins&tute'of'Gene&cs,'Japan' [email protected] September 5th, 2014 WDCM training course, Beijing Part'I INTRODUCTION September 5th, 2014 WDCM training course, Beijing Cost'per'HumanFsized'Genome Data+from+the+NHGRI+Genome+Sequencing+Program+(GSP)+ hQp://www.genome.gov/sequencingcosts/ September'5th,'2014 WDCM'training'course,'Beijing Cost'per'Raw'Megabase'of'DNA'Sequences Data+from+the+NHGRI+Genome+Sequencing+Program+(GSP)+ hQp://www.genome.gov/sequencingcosts/ September'5th,'2014 WDCM'training'course,'Beijing • The+following+'sequence+coverage'+values+were+used+in+calculaDng+the+cost+per+ genome:+ • SangerFbased'sequencing'(average'read'length=500F600'bases):'6Ffold'coverage' • 454'sequencing'(average'read'length=300F400'bases):'10Ffold'coverage' • Illumina'and'SOLiD'sequencing'(average'read'length=50F100'bases):'30Ffold'coverage' • ProducDon+cost+ ! ! ! ! ! ! • Labor,'administra&on,'management,'u&li&es,'reagents,'and'consumables' Sequencing'instruments'and'other'large'equipment'(amor&zed'over'three'years)' Informa&cs'ac&vi&es'directly'related'to'sequence'produc&on'(e.g.,'laboratory'informa&on' management'systems'and'ini&al'data'processing)' Shotgun'library'construc&on'(required'for'preparing'DNA'to'be'sequenced)' Submission'of'data'to'a'public'database' Indirect'' Costs+NonGproducDon+cost+ • Quality'assessment/control'for'sequencing'projects' • Technology'development'to'improve'sequencing'pipelines' • Development'of'bioinforma&cs/computa&onal'tools'to'improve'sequencing'pipelines'or'to' improve'downstream'sequence'analysis' • Management'of'individual'sequencing'projects' • Informa&cs'equipment' • Data'analysis'downstream'of'ini&al'data'processing'(e.g.,'sequence'assembly,'sequence' alignments,'iden&fying'variants,'and'interpreta&on'of'results)' September 5th, 2014 WDCM training course, Beijing Sta&s&cs'of'Microbial'Genome'Sequencing • Studies' • Metagenomic ' ' '''472' • NonFMetagenomic''''''''18,766' • Biosamples' • HostFassociated' ''''''''1,704'' • Engineered' ' '''''''''''293' • Environmental' ''''''''''''''''3,140' ' ' • Organisms' • Archaea'' • Bacteria'' • Eukarya' ' • Projects' ' ' ' ' ' '926'' ''''''''''''36,962'' ''''''''''''''8,624' • Complete'Projects ' '''''' 6,555' • Permanent'Dracs'''''''''''''22,551' • Incomplete'Projects'''''''''20,794' • Targeted'(not'yet'started)'' '''''Projects'' ' ' ' '925' hQp://www.genomesonline.org/ The'work'conducted'by'the'U.S.'Department'of'Energy'Joint'Genome'Ins&tute'is'supported'by'the'Office'of'Science'of' the'U.S.'Department'of'Energy'under'Contract'No.'DEFAC02F05CH11231.'Accordingly,'the'U.S.'Government'retains'a' nonexclusive,'royaltyFfree'license'to'publish'or'reproduce'these'documents,'or'allow'others'to'do'so,'for'U.S.' Government'purposes.'All'documents'available'from'this'server'may'be'protected'under'the'U.S.'and'Foreign'Copyright' Laws'and'permission'to'reproduce'them'may'be'required.'The'public'may'copy'and'use'this'informa&on'without' charge,'provided'that'this'No&ce'and'any'statement'of'authorship'are'reproduced'on'all'copies.'JGI'is'not'responsible' for'the'contents'of'any'offFsite'pages'referenced. September 5th, 2014 WDCM training course, Beijing Biosample'Distribu&on'Map hQp://www.genomesonline.org/genomemap September'5th,'2014 WDCM'training'course,'Beijing Genomic(Encyclopedia(of(Bacteria(and(Archaea:( Sequencing(a(Myriad(of(Type(Strains(' Kyrpides'NC,'Hugenholtz'P,'Eisen'JA,'Woyke'T,'Göker'M,'et'al.'(2014).'' PLoS(Biol'12(8):'e1001920.'' doi:10.1371/journal.pbio.1001920' Box+1.+The+Value+of+Type+and+ 64'authors/57'ins&tu&ons' Reference+Strains+ Figure(1.(((((From(a(total(of(approximately(11,000(bacterial( and(archaeal(type(strains,(3,285((30%)(have(a(publicly( known(genome(project. September'5th,'2014 Box+2.+Global+Data+Standards+ “——'Accurate'es&mates'of'diversity' will'require'not'only'standards'for'data' but'also'standard'opera&ng'procedures' for'all'phases'of'data'genera&on'and' collec&on—“' Box+3.+CreaDng+a+Comprehensive+ Microbial+Genomic+Framework+ “—FCurrently'recognized'genome' projects'have'mapped'~2.8%'of'that' known'microbial'diversity'[13].' Sequencing'all'of'the'remaining'type' strains'will'increase'the'phylogene&c' coverage'encompassed'and'will'then' approach'15%'of'the'known'bacterial' and'archaeal'diversity,'—F'“' WDCM'training'course,'Beijing An(integrated(catalog(of(reference(genes(in(the( human(gut(microbiome. Nat(Biotechnol.'2014'Aug;32(8):834F41.'doi:'10.1038/nbt.2942.'Epub'2014'Jul'6.' Li'J.'et(al.(/MetaHIT'Consor&um'(hQp://www.metahit.eu/)' Sources:'' 249'newly'sequenced'samples'+'1,018'previously'sequenced'samples'(Human' Intes&nal'Tract:'MetaHIT)';'a'cohort'from'three'con&nents'(Europe,'China'and'USA)' Results:'' • The'integrated'gene'catalog'(IGC)'comprising'9,879,896'genes.'' • The'catalog'includes'closeFtoFcomplete'sets'of'higher'quality''genes'for'most'gut' microbes' • Analyses'of'a'group'of'samples'from'Chinese'and'Danish'individuals'using'the'catalog' revealed'countryFspecific'gut'microbial'signatures.'' • This'expanded'catalog'should'facilitate'quan&ta&ve'characteriza&on'of'metagenomic,' metatranscriptomic'and'metaproteomic'data'from'the'gut'microbiome'to'understand' its'varia&on'across'popula&ons'in'human'health'and'disease.' • Web'site''hQp://meta.genomics.cn'' Marine'metagenomics'projects' '435'projects' Ref:'NCBI'BioProject'DB'(hQp://www.ncbi.nlm.nih.gov/bioproject/)' ''''''''((metagenom*)'AND'marine)'NOT'marine[SubmiQer'Organiza&on] September 5th, 2014 WDCM training course, Beijing Marine(Metagenomics:(New(Tools(for(the(Study(and( ExploitaRon(of(Marine(Microbial(Metabolism Kennedy,'J.(et(al.(Marine(Drugs.(2010;'8(3):'608–628.''Published'online'Mar'15,'2010.'doi:''10.3390/md8030608' ' '''Figure'1.'Enzyme'discovery'from'metagenomes:'func&onal'and'' Table'1'''Marine'enzymes'discovered'from' ''''''''''''''''''''SequenceFbased'approaches.' ''''''''''''''''Microbial'and'Metagenomic'sources. AcDvity:+ Esterase;'Lipase;'Cellulase;' Chi&nase;'Amidase;'Amylase;' Phytase;'Protease;'Alkane' hydroxylase;'Xylanase' ' Habitat:+ Antarc&c'ice;'Antarc&c'Seawater' Arc&c'sediment;'Bal&c'Sea' sediment;'Coastal'solfataric'vent;' DeepFsea'basin;'DeepFsea' hydrothermal'vent;'DeepFsea' sediment;'Estuary;'Fish'gut;' Hydrocarbon'seep;'Marine'hot' spring;'Marine'sediment/sludges' Marine'Sponge;'Sea'Hare'Eggs;' Sea'saltern;'Shipworm;'Surface' seawater;'Tidal'Flat'' September 5th, 2014 Scien&fic'Data' ' Data'maQers'!! Credit''''''''''''''Reuse'''''''''''''Quality'''''''''''Discovery'''''''''''''''Open''''''''''''''''Service Scien&fic'Data'is'an'openFaccess,'peerFreviewed'publica&on'for' descrip&ons'of'scien&fically'valuable'datasets.'Our'primary' ar&cleFtype,'the'Data+Descriptor,'is'designed'to'make'your'data' more'discoverable,'interpretable'and'reusable.' hQp://www.nature.com/sdata/' hQp://www.nature.com/sdata/about/principles' ' ' September 5th, 2014 WDCM training course, Beijing The(rise(of(the(dataTcentric(research(and( publicaRon(enterprises SusannaFAssunta'Sansone,'PhD'(Data'Consultant,'Honorary'Academic'Editor)' hQp://www.slideshare.net/SusannaSansone September 5th, 2014 WDCM training course, Beijing Part'2'Annota&on MIGAP:+MICROBIAL+GENOME+ ANNOTATION+PIPELINE September'5th,'2014 WDCM'training'course,'Beijing Contents • • • • • Concept'and'usage'of'MiGAP' Start'MiGAP' Basic'opera&ons'of'MiGAP' User'levels'(bF,'sF'and'gF)' Pipeline'(workflow)'editor'of'gFMiGAP Concept'and'usage'of'MiGAP MiGAP(Microbial'Genome'Annota&on'Pipeline)' De'novo'annota&on'of'nucleo&de'sequences''of''prokaryo&c'and''eukaryo&c''microbes' Sugawara+H,+Ohyama+A,+Mori+H+and+Kurokawaw+K.+ Microbial'Genome'Annota&on'Pipeline'(MiGAP)'for'diverse'users.'' 20th(Int.(Conf.(Genome(InformaRcs((Kanagawa,'Japan)''' 2009:'SF001,'p'1F2.'' '' Sta&s&cs'of'MiGAP'usages 44'publica&ons'have'used' MiGAP''by'2013. Sequences'are'annotate'by'predic&on'programs'and' blastFing'ORFs'against'reference'databases Results'are'stored'by' “Features'and'qualifiers”' Predic&on' programs ' • ORFs • • Reference' Reference' databases Reference' databases Reference' databases databases • CDS' – /ECFNumber' – /func&on' – /gene' – /product' RBS' – /note' rRNA' – /note' tRNA' – /note' MiGAP' Microbial'Genome'Annota&on'Pipeline • De'novo'annota&on'of'nucleo&de'sequences'of'prokaryo&c' and'eukaryo&c'microbes' • Data'items'in'the'annota&on " ORF(CDS)'and'RBS " de'novo'predic&on'of'ORC'(CDS)'by'MetaGeneAnnotator'(MGA),'Glimmer+or'Augustus+ " de'novo'predic&on'of'Ribosome'Binding'Site'' in'the'case'of'MGA "'start'codon' " rRNA' " de'novo'predic&on'by'RNAmmer+ " homology+search'for'16S'rRNA'' " tRNA' " de'novo'predic&on'by'tRNAScanGSE " Transla&on'of'nucleo&de'sequences'to'amino'acid'sequences " Inherit'annota&on'of'known'amino'acids'sequences'by'NCBI+blast+ " refer'to'top'hit • Pipeline' • • a'chain'of'dataFprocessing'processes'or'other'socware'en&&es' '''(ref'hQp://en.wikipedia.org/wiki/Pipeline)' MiGAP'is'a'branched'parallel'pipeline.' input output Reference'databases'in'MiGAP • RNA' – prokaryote' • 5S' • 16S' • 23S' – eukaryote' • 5.8S' • 18S' • 28S • CDS amino'acid'sequence'DB)' • Ortholog'DB ' – COG' – KOG' – EGGNOG' • Non'redundant'DB' – NRAA daily'update ' – TrEMBL monthly'update ' – RefSeq bimonthly'update ' Input '''MiGAP ''''''output • Input' • Genomic'nucleo&de'sequences' • Single'and'mul&ple'FastA'or'a'simple'text' • mul&ple'con&gs'are'to'be'10,000'or'less' • Please'refrain'from'singletons'in'short'reads'of'NGS'(Next'Genera&on' Sequencers)' • Output' • Links'to'the'result'file' • • • • • Log'file:'pipeline.log' Nucleo&de'and'amino'acid'sequences:''Fna,'Faa' Sequence'input:'*.fasta' Feature'defini&on'*.csv,'*.annt,' Annota&on:'''*.ddbj *.embl,'*.gbk' • Compressed'file'of'mul&ple'con&gs:'.tar.gz' • Result'acer'ORF'predic&on result.*' • Result'acer'annota&on' resultFa.*' Start'MiGAP Login User'registra&on'at'DDBJ'site' (hQp://www.ddbj.nig.ac.jp/) Startup'screen'of'MiGAP Horizontal'menu ' '''Logout,'Help,'Contact'US'(the'MiGAP'admin)' On/off'ver&cal'menu Ver&cal'menu ' Pipeline:'input' Pipeline'history:'retrieve'results' Change'User'Level:'upgrade'from'bFMiGAP'to'sFMiGAP'and'gFMiGAP' Current'Process:'check'processes'running' Basic'opera&on'of'MiGAP Input Current'process Cancel'jobs'in'“Pipeline'history” Pipeline'history' list'of'results'of'your'jobs Click'one'of'your'jobs Click'a'con&g Click'a'feature'and'browse'the'details Display'alignment Downloadable'output'files'for'postFprocessing • • • • • • • Log+File+ • pipeline.log' N.A.+ • resultFna.fasta:'nucleo&de'sequences'of'ORFs' A.A.+ • resultFaa.fasta:'amino'acid'sequences'of'ORFs' CSV+ • result.csv:'features'in'CSV'file'by'con&gs'(before'the'annota&on)' • resultFa.csv:'features'in'CSV'files'by'con&gs'(acer'the'annota&on)' GenbBank+ • result.gbk:'nucleo&de'sequence'for'the'submission'to'GenBank' • result.Fa.gbk:'annota&on'data'for'the'submission'to'GenBank' EMBL+ • result.embl:'nucleo&de'sequence'for'the'submission'to'EMBL' • resultFa.embl:'annota&on'data'for'the'submission'to'EMBL' DDBJ+ • result.fasta:'nucleo&de'sequences'by'con&gs' ORF ' • result.annt:'Features'table'for'the'submission'to'DDBJ ORF ' • result.ddbj:'DDBJ'format' ORF ' • result.ddbj:'Mul&ple'DDBJ'files' ORF ' • resultFa.fasta:'nucleo&de'sequences'by'con&gs'(Annota&on ' • resultFa.annt:'Features'table'for'the'submission'to'DDBJ' ORF ' • resultFa.ddbj:'nucleo&de'sequences'by'con&gs' Annota&on ' • resultFa.ddbj:'Features'table'for'the'submission'to'DDBJ' Annota&on Pipeline.log Parameters'of'tools'and'version'of'databases'are'recorded Read'Parameter'File=Done[2012/05/14'09:51:13]' Pipe'Line'Name=Sample'data' Sequence'Filename=direct' Read'Sequence'File=Done[2012/05/14'09:51:14]' Number'of'Con&g=1' Total'Length'of'Sequence=10530' Write'Genbank'File=Done[2012/05/14'09:58:12]' Write'EMBL'File=Done[2012/05/14'09:58:12]' Write'DDBJ'File=Done[2012/05/14'09:58:12]' Write'Informa&on'File=Done[2012/05/14'09:58:12]' Write'Feature'File=Done[2012/05/14'09:58:12]' Create'Genome'Map=Done[2012/05/14'09:51:25]' Create'Feature'Map=Done[2012/05/14'09:51:25]' Start'Time=1336956674148' End'Time=1336956685833' Process'ID=53067' Wai&ng'List=0' Unexpected'Error=' Memory'Status=37MB'/'15271MB' Detail=End'Annota&on' A'Start'Time=1336956685834' A'End'Time=1336957092535' Metagene=Done[2012/05/14'09:51:20]' Metagene'Version=MetaGeneAnnotator'1.0' Metagene'Parameter=Fm' tRNAscan=Done[2012/05/14'09:51:20]' tRNAscan'Version=tRNAscanFSE'1.23' tRNAscan'Parameter=' RNAmmer=Done[2012/05/14'09:51:20]' RNAmmer'Version=RNAmmer'1.2' RNAmmer'Parameter=FS'bac'Fm'tsu,lsu' Blast'Version=NCBI'BLAST'2.2.18' 1st'DB'Name=RefSeq' 1st'DB'Version=20120308' 1st'DB'Count=3' 1st'DB'Revision=release52' PhaseF1=Done[3/3]' 2nd'DB'Name=TrEMBL' 2nd'DB'Version=20120222' 2nd'DB'Count=3' 2nd'DB'Revision=release2012_02' PhaseF2=Done[3/3]' 3rd'DB'Name=COG' 3rd'DB'Version=20030417' 3rd'DB'Count=3' 3rd'DB'Revision=' PhaseF3=Done[3/3]' 16S'rRNA=Done[2012/05/14'09:51:22]' 16S'rRNA'Parameter=FF'F'Fa'4' 16S'rRNA'Name=16S'rRNA' 16S'rRNA'Version=20090220' 16S'rRNA'Count=1' 16S'rRNA'Revision=' A.A.'Mapping'Name=bbgbk' Annota&on=PhaseF3'Done[3/3][2012/05/14'09:58:12]' Annota&on'Parameter=FF'F' • User'levels'(bF,'sF,'gF) '' bFMiGAP,'sFMiGAP,'gFMiGAP • bFMiGAP ' ''''''bronze'level'for'novices up'to'10'jobs ' – 'default' • sFMiGAP ' '''''''silver'level'for'experienced'users' HTML ' – parameter'se|ng' – databases'selec&on' – workflow'branch' • gFMiGAP:'' '''''gold'level'for'advanced'users Applet ' – parameter'se|ng' – databases'selec&on' – workflow'branch' bFMiGAP,'sFMiGAP,'gFMiGAP bFMiGAP,'sFMiGAP,'gFMIGAP bFMiGAP,'sFMiGAP,'gFMiGAP Pipeline'(workflow)'editor'of'gFMiGAP Start Start Drag'and'drop'AUGUSTUS Drag'and'drop'AUGUSTUS Set'5.8SrRNA Set'18SrRNA Set'28SrRNA Include'RNAmmer'into'the'pipeline Set'the'1st'blast Set'databases'that'the'3rd'blast'will'use Insert'branch'acer'the'1st'blast Insert'branch'acer'2nd'blast bFMiGAP' ' ' ' >Seq1 ATCTTTTTCGGCTTTT TTTAGTATCCACAGA GGTTATCGACA' >seq2' CATTTTCACATTACCA ACCCCTGTGGACAAG GTTTTT Qualifier ORF' tRNA' MGA ' Archaea Bacteria ' ' KK16 SDB 16SrRNA' tRNAScanFSE' ' ' BLAST' rRNA' ' Map' Map' ORF ' ' DDBJ'File' ' RNAmmer' KK16 SDB 1 ' ' ' 2 ' HM:60%' OV:60% HM:60%' OV:60% Draw' BLAST' BLAST' HTML 3 ' Hit? No Annota&on ' ' Alignment' ' DDBJ' File' /Downlod' ' ' RefSeq' microbial'DB COG ' ' ' DDBJ ' HM:30%' OV:30% DDBJ BLAST' Draw' HTML TrEMBL'DB DB HTML HM=' OV=' Percent'Inden&ty' ' 2012/06/05 ' Microbial'Genome'Annota&on'Pipeline Qualifier ' ' sFMiGAP' >Seq ATCTTTTTCGGCTTTT TTTAGTATCCACAGA GGTTATCGACAACAT TTTCACATTACCAAC CCCTGTGGACAAGGT TTTT ORF' tRNA' 16SrRNA' rRNA' ' Map' Map' ORF ' ' DDBJ ' DDBJ MGA tRNAScanFSE' RNAmmer' K16S DB ' K16S DB BLAST' Draw' 1 ' ' HM: OV: 2 ' ' ' BLAST' 3 ' HM: OV: Hit? No ' ' ' ' BLAST' ' DDBJ BLAST' Draw' HM: OV: Hit? No ' Annota&on ' ' Alignment' ' DDBJ File' Download' Glimmer ' Archaea Bacteria ' Eukaryote' ' ' Start'Codon Stop'Codon' 2 ORF ' ORF' Applet RefSeq' microbial'DB TrEMBL'DB COG'DB COG'DB COG'DB KOG'DB KOG'DB KOG'DB eggNOG'DB eggNOG'DB RefSeq' microbial'DB TrEMBL'DB RefSeq' microbial'DB TrEMBL'DB NR'DB NR'DB NR'DB ' DB COG'DB DB eggNOG'DB Applet Part'3'Integrated'Retrieval'of'Microbial'Genome'Sequences MIROBEDB.JP:+AGGREGATION+OF+ENVIRONMENTAL,+ PHENOTYPIC+AND+GENOMIC+DATA+FOR+THE+STUDY+AND+ UTILIZATION+OF+MICROBES September'5th,'2014 WDCM'training'course,'Beijing Contents • Background'and'concept' • Informa&on'technologies'behind'the' scenes' • Please'try'it'!' Background'and'concept' Many'microbial'databases'(DBs)'exist'…' Ortholog Taxonomy Culture' Collec&on Genome Pathogen Gene' Func&on Metagenome Which+DBs+should+we+use? From'Na&onal'Research'Council'(USA) Microbes'inhabit'almost'everywhere'on'Earth'and'interact'with'their'environments. Knowledge'of'microbes'will'have'' high'poten&al'scien&fic'and'commercial'applica&ons. Promo&ng'the'Integrated'Use'of'' Life'Science'Databases'in'Japan' ・ FY'2007F2010'“Integrated'Database'Project”' → Database'Center'for'Life'Science'(DBCLS)' ' ・ FY'2011F'' → Na&onal'Bioscience'Database'Center'(NBDC)' ' About'NBDC' ・ Established'in'April'2011' ・ As'part'of'the'Japan'Science'and'Technology'Agency'(JST),'a' funding'agency'supported'by'MEXT' ' URL:''hQp://biosciencedbc.jp/?lng=en' Ac&vi&es'by'NBDC 1.'Formula&on'of'strategies'related'to'coordina&on'and'integra&on' of'DBs,'and'interna&onal'coopera&on' ' 2.'Crea&on'and'management'of'a'portal'website'from'exis&ng'life' science'DBs'hQp://biosciencedbc.jp/?lng=en' ' 3.'Funding'of'R&D'of'new'technology'necessary'for'organizing'and' linking'life'science'DBs' ' 4.'Funding'of'R&D'that'coordinate'exis&ng'and'emerging'DBs'in' specific'research'fields' Includes++microbes++(PI:+Ken+KUROKAWA) Aim+of+ 'to'integrate'several'microbial'data'(include'omics,'taxonomy/cultures,'habitats)'' 'using'seman&c'web'technology We integrate the microbial data that can be linked to genomes. http://microbedb.jp/ How'to'aggregate'diverse'data'sources'to'find'hidden' rela&onships'among'them? Gene Ortholog: MBGD Taxon Taxonomy: NCBI Taxonomy Genome: GTPS/RefSeq Annotation: TogoAnnotation Culture Collection: NBRC/JCM Environment Metadata: INSDC SRA Metagenome: INSDC SRA We integrate the microbial data that can be linked to genomes. http://microbedb.jp/ Gene Ortholog: MBGD Taxon Environment Taxonomy: NCBI Taxonomy Genome: GTPS/RefSeq Annotation: TogoAnnotation Culture Collection: NBRC/JCM Informa&on'technologies'behind'the' scenes' Metadata: INSDC SRA Metagenome: INSDC SRA RDF'is'a'standard'data'model'of'Seman&c'' Web'technology RDF RDF'(Resource'Descrip&on'Framework)' Data'model'which'uses'Triples'' (Subject'–'Predicate'–'Object) S <URI> '<URI> <URI>/Literal gtps:Gene1'' rdfs:label “16S'rRNA'gene” URI'node'can'be'linked'to'other'nodes S P O/S S P O Gene1 Gene1 Gene1 has' has' has' Func&on Func&on Func&on GO: GO: GO: 0003700 0003700 0003700 Genome1 Genome1 Genome1 organism organism organism Escherichia' Escherichia' Escherichia( coli coli coli O P P KO:03043 Organism1 Organism1 Organism1 has' has' has' Genome Genome Genome Genome1 Genome1 Genome1 Organism1 Organism1 Organism1 inhabit inhabit inhabit Lake Lake Lake O Ontology × Triple+store SPARQL Search To'prepare'data'in'RDF,'' the'database'management'system'automa&cally'recognize'same'resources. How'to'integrate'the'data'from'two'different'DBs? DB+1 Gene1 Gene1 Gene1 has' has' has' Func&on Func&on Func&on DB+2 GO: GO: GO: 0003700 0003700 0003700 Gene1 Organism'1 Genome1 Genome1 Genome1 organism organism organism Escherichia' Escherichia' coli Organism'1 coli Genome1 Enzyme'1 Organism1 Organism1 Organism1 has' has' has' Genome Genome Genome inhabit inhabit inhabit GO: Enzyme'1 0003700 can' organism Use Escherichia' Compound' coli1 Genome1 Genome1 Genome1 Organism1 Organism'1 Organism1 Organism1 Organism1 has' can' Func&on Produce' Lake Lake Lake has' can'' Genome Grow Genome1 Medium'1 owl:+ sameAs 1. When'two'DBs'use'same'URI,'already'two'DB’s'data'are'integrated.' 2. If'not,'you'can'integrate'two'DB’s'data'by'adding'one'Triple'(db1:A'owl:sameAs'db2:B).' You'don’t'need'to'place'all'of'these'data'in'one'DB'managenement'system. How'can'we'discriminate'whether'two'DB’s'resources'are'same'or'not? You'should'describe'your'resource'by'' using'some'Ontologies Ontology'is'a'structured'controlled'vocabulary''to'describe'proper&es'and'types'of'resources.' For'example,'to'answer:'What'is'soil?''What'is'a'rela&onship'between'soil'and'sand?'' MEO+(Microbes+Environmental+Ontology) PDO+(Pathogenic+Disease+Ontology)+ MCCV+(Microbial+Culture+CollecDon+Vocabulary)+ + MSV+(Metagenome+Sample+Vocabulary)+ + MPO+(Microbial+Phenotype+Ontology)+ + MBGD+Ortholog+Ontology Most'of'them'can'be'obtained'from' Sea'water Metagenome' (Environment) Sequence'similarity'' search Genome' (Taxon) Gene'clustering'using' Sequence'similarity' Ortholog' (Gene) Soil Human'gut We'have'converted'most'of'our'data'to'RDF,'' developed'many'ontologies,'and'developed'a'RDFized'microbial'DB. hQp://microbedb.jp/ More'than'1'billion'Triples! '''''''''Gene '''''''''Taxon''''' Ortholog:'MBGD '''Environment Taxonomy:'' NCBI'Taxonomy' Genome:'GTPS/RefSeq Annota&on:'' TogoAnnota&on Culture'Collec&on:' NBRC/JCM Metadata:'' INSDC'SRA' Metagenome:'' INSDC'SRA' Red'color'indicates'our'collaborators. RDF'conversion'example JCM/NBRC'Culture'Collec&on'data 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. Strain_Number' Other_Collec&on_Numbers' Name' Organism_Type' History_of_Deposit' Date_of_Isola&on' Isolated_from' Geographic_Origin' Status' Optimum_Temperature_for_Growth' Maximum_Temperature_for_Growth' Minimum_Temperature_for_Growth' Medium' Application' Literature Example'of'NBRC'Culture'Collec&on'RDF'data :MCCV_000026' nbrcmedium:NBRC_227 rdf:type' :MCCV_00018' “Strain'Number”' <hQp://www.dsmz.de/ catalogues/details/ culture/ DSMF40226.html>' ”'DSM'40226” :MCCV_000001' (Culture)' # :MCCV_000025' :MCCV_000033' ”'Applica&on" "Thienamycins'produc&on';'Vitamin'B12'(Cyanocobalamine)' produc&on';'Steroid'conversion" nbrc:NBRC_12841 <hQp://iden&fiers.org/taxonomy/67274>' :MCCV_00022'' :MCCV_000014' '“Op&mal'growth' temperature”'' :MCCV_000017' ”Type'Strain'" <hQp://www.ncbi.nlm.nih.gov/taxonomy/ 67274> :MCCV_000027' ”History'of'deposit”'' <hQp://purl.uniprot.org/taxonomy/ 67274> "false"^^xsd:boolean “28"^^<http://www.w3.org/2001/XMLSchema#integer> :MCCV_00023'' :MCCV_000028' dc:iden&fier ' '“Isolated'from”'' “IFO 12841 <-- SAJ <-- OWU (ISP 5226) <-- Squibb & Sons (F. Arnow, MD 2428, ETH 24234, NIHJ 501)” meo:MEO_0000007' # rdfs:label <hQp://iden&fiers.org/taxonomy/67263>' “Soil” :MCCV_000012'' <hQp://www.ncbi.nlm.nih.gov/taxonomy/ 67263> <hQp://purl.uniprot.org/taxonomy/ 67263> “Streptomyces griseus subsp. griseus (Krainsky 1914) Waksman and Henrici 1948” Overall'data'structure'of'MicrobeDB.jp Stanza'Development To' obtain' biological' knowledge' from' low' data' (sequence' and' metadata),' we' developed' a' variety' of' “Stanza”,' which' is' a' compact,' modular,' and' reusable'applica&on'for'data'analysis. Correla&on'analysis'' between'gene'abundance' and'metadata fastq' UCLUST' Iden&ty'>'97%,'cov'>'90%' Analyze'data'by'using'' 'the'Stanza' OTUs UCHIME' Reference'mode' UCHIME'De' novo'mode' Remove'chimeras Comparison'of'taxonomic'composi&on Clean'OTUs' Taxonomic'assignment'by'using'RDP' Classifier Stanza'categories'in'MicrobeDB.jp Gene'Defini&on' Gene'Publica&on' Ortholog'Defini&on' Gene'Annota&on' Ortholog'Group'Members' Ortholog'Cluster' Genome'Informa&on' GTPS'Gene/Genome'Feature' RefSeq'Gene/Genome'Feature' GTPS'Genome' GTPS'Genome'Defini&on' Other'Collec&on'Numbers' Pathogen'Informa&on' Phenotype'Informa&on' RefSeq'Genome' RefSeq'Genome'Defini&on' Strain'Defini&on' Strain'Genome' Strain'Reference' Taxon'Defini&on' Taxon'Hierarchy' Genes Taxon Sample'Func&on' Mapping'to'Environment'(Chromosome)' Mapping'to'Environment'(Plasmid)' Ortholog'Abundance'among'Environments' Ortholog'Abundance'in'Environment' Disease'Defini&on' Environment'Defini&on' MEO'Hierarchy' Environment MEO'Ontology'View' Meta16S'Sample'List' Metagenome'Sample'List' Numeric'Metadata'Histogram' Sample'Defini&on' Sample'Metadata' SRS'Cross'Reference' GenomeFSequenced'Strains' Symptom'List' Sequenced'Genome'List' Strain'List' Taxonomic'Composi&on'of'Genomes' Taxonomic'Composi&on'of'Meta'16S' Human'Meta'Body'Mapping' Strain'Metadata' Stanza'Example ・'Gene'Annota&on ・ Ortholog'list ・'Genome'Informa&on Stanza'Example ・Taxonomic'composi&on' '''of'16S'rRNA'gene'amplicon'' '''sequencing'analysis'' ・Func&onal'and'taxonomic' '''composi&on'of'a'' '''metagenome'sample' Stanza'Example You'can'understand'the'distribu&on'paQern'of'a'taxa'in'human'body. hQp://microbedb.jp/ Keyword'example:'lake lake meo:pond'is_a'meo:lake Genome+ sequenced+ strains+ isolated' from'lake Strain_A'mccv:isola&on_source'meo:pond''''''''''''Strain_A' Abundant+Orthologs+in+ metagenome+samples+ obtained'from'lake JCM/NBRC+Strains+ isolated'from'lake Metageno me+ samples+ obtained' from'lake Taxonomic+ composiDon+ of+16S+ amplicon+ sequencing+ which' sampled' from'lake MEO' hierarchica l'structure MicrobeDB.jp'will'facilitate'the'explora&on'of'the'exis&ng'scaQered'informa&on'of'microbes. Plan'of'integra&on'between'public'' genome'data'and'user'genome'data'in'MicrobeDB.jp Automa&c'microbial' 'genome'annota&on Microbial'genome'' sequence'data'' producer Input'Metadata'' related'to'the'genome.' Convert'genome'data'' to'the'RDF'format Integrate'' public'genome'data'' and'user'genome'data' RefSeq GTPS Public'microbial' 'genome'sequence'data Please'try'it'!' Acknowledgements MiGAP+ Dr.'Akira'OHYAMA' In'silico'biology'' hQp://www.insilicobiology.jp/' ' MirobeDB.jp+ Assistant'Professor'Hiroshi'MORI'and'Professor'Ken'KUROKAWA' Tokyo'Ins&tute'of'Technology,'Graduate'School'of'Bioscience'and' Biotechnology'Department'of'Biological'Informa&on' hQp://microbedb.jp/' ' Supported+by:+ WFCCFMIRCEN'World'Data'Center'for'Microorganisms' hQp://www.wdcm.org/'' Bureau'of'Interna&onal'Coopera&on,'Chinese'Academy'of'Sciences'' China'Na&onal'CommiQee,'the'CommiQee'on'Data'for'Science'and' Technology'(CODATA) September 5th, 2014 WDCM training course, Beijing
© Copyright 2024 Paperzz