The Gene Ontology Annotation (GOA) Database and enhancement of GO annotations through InterPro2GO Nicky Mulder [email protected] European Bioinformatics Institute Contents • Introduction to GOA • Manual GOA annotation • Electronic annotation: – InterPro2GO • GOA data flow • Uses of GOA • Future plans European Bioinformatics Institute What is GO annotation? GO Term ID An annotation is a statement that a gene product • has a particular molecular function • is involved in a particular biological process • is located within a certain cellular component …as determined by a particular method Evidence Code …as described in a particular reference. Reference European Bioinformatics Institute Gene Ontology Annotation (GOA) Database • GOA’s priority is to annotate the human, mouse and rat proteomes • Largest open-source contributor of annotations to GO • Provides 10 million annotations for more than 111,000 species • Share and integrate GO annotation European Bioinformatics Institute How do we annotate GO terms Manual Annotation Electronic Annotation All annotations must: • be attributed to a source • indicate what evidence was found to support the GO term-gene/protein association European Bioinformatics Institute Manual annotation • High quality • Specific gene or gene product associations made using: – Peer reviewed papers – Evidence codes • BUT: – Time-consuming – Requires trained biologists European Bioinformatics Institute Manual GO annotation Pubmed ID, Evidence code Read papers Find GO term Annotate to protein GO and EBI ftp sites GOA-association file European Bioinformatics Institute Oracle RDBMS Protein2GO tool Online European Bioinformatics Institute Information captured by GOA Source GOID Term European Bioinformatics Institute Evid RefDB RefID With DB With ID Qualifier How successful is manual-GOA? Source No. of annotations No. of distinct proteins Proteome Inc. 22054 6568 UniProt 67910 13697 IntAct 22002 11013 MGI 124919 29837 SGD 21761 5076 FlyBase 52386 8775 RGD 8036 3369 HGNC 3699 798 GeneDB 5502 1384 TAIR/TIGR 3367 1895 ZFIN 1012 334 Roslin Institute 14 6 AgBase 889 173 Reactome 15 12 WormBase 893 443 TIGR 139 79 Gramene 139 2812 GDB 165 103 336237 70728 TOTAL European Bioinformatics InstituteMANUAL 111740 taxa July 2006 Electronic Annotation • Large-scale assignment of GO terms to UniProtKB entries using existing information within database entries and manual mappings • Get IEA evidence code UniProt High quality electronic protein to GO associations InterPro Keyword HAMAP Curated mapping e.g. GO European Bioinformatics Institute EC Curated or electronic rule based mappings EC:1.1.1.1 > GO:alcohol dehydrogenase activity ; GO:0004022 www.uniprot.org/ European Bioinformatics Institute Mappings of external concepts to GO http://www.geneontology.org/GO.indices.shtml European Bioinformatics Institute InterPro2GO mapping • InterPro is a resource that integrates protein signatures databases, e.g. Pfam, Prints, Prosite, ProDom, SMART, TIGRFAMs etc. • It provides a means of classifying proteins into families and identifying domains. • Each InterPro entry groups proteins belonging to the same family and potentially having the same function European Bioinformatics Institute InterPro2Go mapping • Done manually, but using tools • Look at InterPro and protein annotation • For all Swiss-Prot proteins matching entry truly: – Get stats on DE lines, keywords, comments – Check how conserved common annotation is – Find appropriate GO term at most specific level that applies to all proteins (not necessarily domains) European Bioinformatics Institute Tools used –”SQUID” Statistics options: keyword description Gene name Organism Comments, etc. European Bioinformatics Institute SQUID statistics output European Bioinformatics Institute SQUID statistics output European Bioinformatics Institute InterPro2GO mapping in entry European Bioinformatics Institute InterProScan output with GO terms European Bioinformatics Institute InterPro2GO sanity checks • • • • • Run weekly Reports: Obsolete GO terms Obsolete (deleted) IPRs Secondary IPRs European Bioinformatics Institute Quality of GO mapping • BioCreAtIvE test set -635 GO annotations through InterPro2GO Exact term 151 24% Same lineage < granularity 273 43% Manually checked 44 proteins, 107 predictions: Same lineage > granularity 24 4% 97 correct (90%): New lineage 187 29% Minimal correct 424 67% -40 exact -57 same lineage 10 new lineage (unknown) Potentially incorrect Precision 211 33% 67-100% Camon et al., 2005, BMC Bioinformatics European Bioinformatics Institute 0 incorrect InterPro2GO mapping statistics Total no. IPRS mapped to GO 7126 % of IPRs mapped to at least 1 GO term 54% No. IPRS mapped to molecular function 5741 No. IPRS mapped to biological process 5543 No. IPRS mapped to cellular component 3426 No. GO terms mapped 2811 No. UniProt proteins mapped through interpro2go 2006489 (61%) % UniProt covered by InterPro 77.6% European Bioinformatics Institute How successful is IEA-GOA in general? • Provides large coverage • High Quality • However these annotations often use high-level GO terms and provide little detail. IEA Method No. of annotations No. of distinct proteins InterPro2GO 6281916 2006489 HAMAP2GO 199904 85814 SP Keyword2GO 3613883 1287830 EC2GO 207540 202657 TOTAL 10303243 2167001 Manual ones: 336237 European Bioinformatics Institute 70728 Jun 2006 Total GO statistics Total no. GO annotations 10639480 % GO associations manual 3.16% % GO associations electronic 96.84 % GO associations interpro2GO 59% Total no. proteins annotated to GO 2168717 % UniProt GO annotated in total 68.2% % UniProt GO annotated manually 2.2% % UniProt GO annotated electronically 66% % UniProt GO annotated through interpro2go 61% European Bioinformatics Institute GOA data flow Gene association files European Bioinformatics Institute Gene Association file format http://www.geneontology.org/GO.annotation.shtml European Bioinformatics Institute Example GOA cow file European Bioinformatics Institute Output from the GOA database Non-Redundant: based on IPI New GOA Cow Redundant GA slim for UniProt + GO slims Data also available in SRS, UniProt, QuickGO, MODs, Ensembl etc. European Bioinformatics Institute GA Files for Non-redundant species • Non-redundant complete protein set for each proteome is identified (>25% GO coverage) • Includes UniProt, IPI and MOD-specific IDs, e.g. mouse (MGI), rat (RGD), zebrafish (ZFIN) etc. • Xref files available with identifiers from: UniProt, IPI, RefSeq, Ensembl, UniGene etc. ftp://ftp.ebi.ac.uk/pub/databases/GO/goa ftp://ftp.ebi.ac.uk/pub/databases/integr8 European Bioinformatics Institute Uses of GOA data • Access protein functional information • Look at relationships between proteins, e.g. IntAct • Connect biological information to gene expression data • Determine functional composition of a proteome –using GO slim European Bioinformatics Institute Uses of GOA Find functional information on proteins http://www.ebi.ac.uk/ego European Bioinformatics Institute Uses of GOA Find functional information on interaction proteins (IntAct) http:www.ebi.ac.uk/intact European Bioinformatics Institute Uses of GOA European Bioinformatics Institute Overview proteome with GO Slim http://www.ebi.ac.uk/integr8 Uses of GOA Analysis of high-throughput data according to GO Microarray data analysis GO classification Larkin JE et al, Physiol Genomics, 2004 Proteomics data analysis GO classification Kislinger T et al, Mol Cell Proteomics, 2003 European Bioinformatics Institute Cunliffe HE et al, Cancer Res, 2003 Future plans • Continue deep level annotation of human, mouse and rat • Manually annotate splice variants • Outreach and inclusion of new datasets e.g. grape • New electronic mappings, e.g. unipathway2go • Ortholog prediction for electronic GO annotation • Develop tools for annotation training European Bioinformatics Institute Acknowledgements Rolf Apweiler Head of sequence database group Evelyn Camon GOA Coordinator Daniel Barrell GOA Programmer Emily Dimmer GOA Curator Rachael Huntley GOA Curator David Binns & John Maslen QuickGO, GOA tools All EBI UniProtKB Curators, HAMAP(SIB), IntAct, GO Editorial Office @ EBI All GO Consortium & associate members European Bioinformatics Institute
© Copyright 2026 Paperzz