Mulder - Gene Ontology Consortium

The Gene Ontology Annotation
(GOA) Database and
enhancement of GO annotations
through InterPro2GO
Nicky Mulder
[email protected]
European Bioinformatics Institute
Contents
• Introduction to GOA
• Manual GOA annotation
• Electronic annotation:
– InterPro2GO
• GOA data flow
• Uses of GOA
• Future plans
European Bioinformatics Institute
What is GO annotation?
GO
Term
ID
An annotation is a statement that a gene product
• has a particular molecular function
• is involved in a particular biological process
• is located within a certain cellular component
…as determined by a particular method
Evidence
Code
…as described in a particular reference.
Reference
European Bioinformatics Institute
Gene Ontology Annotation (GOA)
Database
• GOA’s priority is to annotate the human,
mouse and rat proteomes
• Largest open-source contributor of
annotations to GO
• Provides 10 million annotations for more than
111,000 species
• Share and integrate GO annotation
European Bioinformatics Institute
How do we annotate GO terms
 Manual Annotation
 Electronic Annotation
All annotations must:
• be attributed to a source
• indicate what evidence was found to
support the GO term-gene/protein
association
European Bioinformatics Institute
Manual annotation
• High quality
• Specific gene or gene product associations
made using:
– Peer reviewed papers
– Evidence codes
• BUT:
– Time-consuming
– Requires trained biologists
European Bioinformatics Institute
Manual GO annotation
Pubmed ID,
Evidence code
Read papers
Find GO term
Annotate to protein
GO and EBI
ftp sites
GOA-association file
European Bioinformatics Institute
Oracle RDBMS
Protein2GO tool Online
European Bioinformatics Institute
Information captured by GOA
Source
GOID
Term
European Bioinformatics Institute
Evid
RefDB
RefID
With DB
With ID
Qualifier
How successful is manual-GOA?
Source
No. of annotations
No. of distinct proteins
Proteome Inc.
22054
6568
UniProt
67910
13697
IntAct
22002
11013
MGI
124919
29837
SGD
21761
5076
FlyBase
52386
8775
RGD
8036
3369
HGNC
3699
798
GeneDB
5502
1384
TAIR/TIGR
3367
1895
ZFIN
1012
334
Roslin Institute
14
6
AgBase
889
173
Reactome
15
12
WormBase
893
443
TIGR
139
79
Gramene
139
2812
GDB
165
103
336237
70728
TOTAL
European Bioinformatics
InstituteMANUAL
111740
taxa
July 2006
Electronic Annotation
• Large-scale assignment of GO terms to
UniProtKB entries using existing information
within database entries and manual mappings
• Get IEA evidence code
UniProt
High quality
electronic
protein to GO
associations
InterPro
Keyword
HAMAP
Curated mapping e.g.
GO
European Bioinformatics Institute
EC
Curated or
electronic
rule based
mappings
EC:1.1.1.1 > GO:alcohol
dehydrogenase activity ;
GO:0004022
www.uniprot.org/
European Bioinformatics Institute
Mappings of external concepts to GO
http://www.geneontology.org/GO.indices.shtml
European Bioinformatics Institute
InterPro2GO mapping
• InterPro is a resource that integrates protein
signatures databases, e.g. Pfam, Prints,
Prosite, ProDom, SMART, TIGRFAMs etc.
• It provides a means of classifying proteins
into families and identifying domains.
• Each InterPro entry groups proteins belonging
to the same family and potentially having the
same function
European Bioinformatics Institute
InterPro2Go mapping
• Done manually, but using tools
• Look at InterPro and protein annotation
• For all Swiss-Prot proteins matching entry
truly:
– Get stats on DE lines, keywords, comments
– Check how conserved common annotation is
– Find appropriate GO term at most specific level
that applies to all proteins (not necessarily
domains)
European Bioinformatics Institute
Tools used –”SQUID”
Statistics options:
keyword
description
Gene name
Organism
Comments, etc.
European Bioinformatics Institute
SQUID statistics output
European Bioinformatics Institute
SQUID statistics output
European Bioinformatics Institute
InterPro2GO mapping in entry
European Bioinformatics Institute
InterProScan output with GO terms
European Bioinformatics Institute
InterPro2GO sanity checks
•
•
•
•
•
Run weekly
Reports:
Obsolete GO terms
Obsolete (deleted) IPRs
Secondary IPRs
European Bioinformatics Institute
Quality of GO mapping
• BioCreAtIvE test set -635 GO annotations
through InterPro2GO
Exact term
151
24%
Same lineage < granularity
273
43%
Manually checked 44
proteins, 107 predictions:
Same lineage > granularity
24
4%
97 correct (90%):
New lineage
187
29%
Minimal correct
424
67%
-40 exact
-57 same lineage
10 new lineage (unknown)
Potentially incorrect
Precision
211
33%
67-100%
Camon et al., 2005, BMC Bioinformatics
European Bioinformatics Institute
0 incorrect
InterPro2GO mapping statistics
Total no. IPRS mapped to GO
7126
% of IPRs mapped to at least 1 GO term
54%
No. IPRS mapped to molecular function
5741
No. IPRS mapped to biological process
5543
No. IPRS mapped to cellular component
3426
No. GO terms mapped
2811
No. UniProt proteins mapped through interpro2go
2006489 (61%)
% UniProt covered by InterPro
77.6%
European Bioinformatics Institute
How successful is IEA-GOA in general?
• Provides large coverage
• High Quality
• However these annotations often use high-level GO terms
and provide little detail.
IEA Method
No. of annotations No. of distinct proteins
InterPro2GO
6281916
2006489
HAMAP2GO
199904
85814
SP Keyword2GO
3613883
1287830
EC2GO
207540
202657
TOTAL
10303243
2167001
Manual ones: 336237
European Bioinformatics Institute
70728
Jun 2006
Total GO statistics
Total no. GO annotations
10639480
% GO associations manual
3.16%
% GO associations electronic
96.84
% GO associations interpro2GO
59%
Total no. proteins annotated to GO
2168717
% UniProt GO annotated in total
68.2%
% UniProt GO annotated manually
2.2%
% UniProt GO annotated electronically
66%
% UniProt GO annotated through interpro2go
61%
European Bioinformatics Institute
GOA data flow
Gene association files
European Bioinformatics Institute
Gene Association file format
http://www.geneontology.org/GO.annotation.shtml
European Bioinformatics Institute
Example GOA cow file
European Bioinformatics Institute
Output from the GOA database
Non-Redundant: based on IPI
New
GOA
Cow
Redundant
GA slim for
UniProt +
GO slims
Data also available in SRS, UniProt, QuickGO,
MODs, Ensembl etc.
European Bioinformatics Institute
GA Files for Non-redundant species
• Non-redundant complete protein set for each
proteome is identified (>25% GO coverage)
• Includes UniProt, IPI and MOD-specific IDs,
e.g. mouse (MGI), rat (RGD), zebrafish
(ZFIN) etc.
• Xref files available with identifiers from:
UniProt, IPI, RefSeq, Ensembl, UniGene etc.
ftp://ftp.ebi.ac.uk/pub/databases/GO/goa
ftp://ftp.ebi.ac.uk/pub/databases/integr8
European Bioinformatics Institute
Uses of GOA data
• Access protein functional information
• Look at relationships between proteins, e.g.
IntAct
• Connect biological information to gene
expression data
• Determine functional composition of a
proteome –using GO slim
European Bioinformatics Institute
Uses of GOA
Find functional information on proteins
http://www.ebi.ac.uk/ego
European Bioinformatics Institute
Uses of GOA
Find functional information on interaction proteins (IntAct)
http:www.ebi.ac.uk/intact
European Bioinformatics Institute
Uses of GOA
European Bioinformatics Institute
Overview proteome with GO Slim
http://www.ebi.ac.uk/integr8
Uses of GOA
Analysis of high-throughput data according to GO
Microarray data analysis
GO classification
Larkin JE et al, Physiol Genomics, 2004
Proteomics data analysis
GO classification
Kislinger T et al, Mol Cell Proteomics, 2003
European Bioinformatics Institute
Cunliffe HE et al, Cancer Res, 2003
Future plans
• Continue deep level annotation of human, mouse
and rat
• Manually annotate splice variants
• Outreach and inclusion of new datasets e.g. grape
• New electronic mappings, e.g. unipathway2go
• Ortholog prediction for electronic GO annotation
• Develop tools for annotation training
European Bioinformatics Institute
Acknowledgements
Rolf Apweiler Head of sequence database group
Evelyn Camon GOA Coordinator
Daniel Barrell GOA Programmer
Emily Dimmer GOA Curator
Rachael Huntley GOA Curator
David Binns & John Maslen QuickGO, GOA tools
All EBI UniProtKB Curators, HAMAP(SIB), IntAct,
GO Editorial Office @ EBI
All GO Consortium & associate members
European Bioinformatics Institute