Article The Repertoires of Ubiquitinating and

The Repertoires of Ubiquitinating and Deubiquitinating
Enzymes in Eukaryotic Genomes
Andrew Paul Hutchins,y,1 Shaq Liu,y,1 Diego Diez,1 and Diego Miranda-Saavedra1,*
1
Bioinformatics and Genomics Laboratory, World Premier International (WPI) Immunology Frontier Research Center (IFReC),
Osaka University, Osaka, Japan
y
These authors contributed equally to this work.
*Corresponding author: E-mail: [email protected].
Associate editor: Kenneth Wolfe
Abstract
Reversible protein ubiquitination regulates virtually all known cellular activities. Here, we present a quantitatively
evaluated and broadly applicable method to predict eukaryotic ubiquitinating enzymes (UBE) and deubiquitinating
enzymes (DUB) and its application to 50 distinct genomes belonging to four of the five major phylogenetic supergroups of
eukaryotes: unikonts (including metazoans, fungi, choanozoa, and amoebozoa), excavates, chromalveolates, and
plants. Our method relies on a collection of profile hidden Markov models, and we demonstrate its superior performance
(coverage and classification accuracy >99%) by identifying approximately 25% and approximately 35% additional UBE
and DUB genes in yeast and human, which had not been reported before. In yeast, we predict 85 UBE and 24 DUB genes,
for 814 UBE and 107 DUB genes in the human genome. Most UBE and DUB families are present in all eukaryotic lineages,
with plants and animals harboring massively enlarged repertoires of ubiquitin ligases. Unicellular organisms, on the
other hand, typically harbor less than 300 UBEs and less than 40 DUBs per genome. Ninety-one UBE/DUB genes are
orthologous across all four eukaryotic supergroups, and these likely represent a primordial core of enzymes of
the ubiquitination system probably dating back to the first eukaryotes approximately 2 billion years ago. Our
genome-wide predictions are available through the Database of Ubiquitinating and Deubiquitinating Enzymes (www.
DUDE-db.org), where users can also perform advanced sequence and phylogenetic analyses and submit their own
predictions.
Key words: ubiquitination, functional prediction, profile hidden Markov model.
Introduction
Article
Protein ubiquitination is a reversible post-translational modification that regulates virtually all known cellular activities
(Grabbe et al. 2011). In a hierarchical cascade, ubiquitinating
enzymes (UBEs) catalyze the covalent addition of the 76
amino acid molecule ubiquitin to target proteins by the
sequential action of a ubiquitin-activating enzyme (E1) and
a ubiquitin-conjugating enzyme (E2) that selectively
interacts with a ubiquitin ligase (E3) to control substrate
specificity. Protein ubiquitination can be reversed by the
action of deubiquitinating enzymes (DUBs). The ubiquitin
signaling system is both functionally diverse and tightly regulated, and its dysregulation is responsible for a large number
of conditions including many types of cancer, neurodegenerative, metabolic, and muscle wasting disorders (reviewed
in Ciechanover and Schwartz 2004; Petroski 2008;
Cohen and Tcherpakov 2010; Kirkin and Dikic 2011;
Fulda et al. 2012). Moreover, viruses are known to employ a
variety of strategies to manipulate the ubiquitin system to
facilitate their replication and also encode their own viral
UBEs and DUBs to evade the immune response (Randow
and Lehner 2009).
The UBEs consist of three structurally and functionally
different classes of enzymes (E1, E2, and E3): humans possess
only two E1 genes and yeast only one. The E2s are a relatively
small class of enzymes with 41 genes reported in humans and
14 in yeast and are characterized by the highly conserved
ubiquitin-conjugating (UBC) domain (van Wijk and
Timmers 2010). On the other hand, E3s are a large and structurally very diverse class of enzymes and have been classified
into three main families (reviewed in Ardley and Robinson
2005): HECT (homologous to E6-associated protein
C-terminus), RING (really interesting new gene), and U-box
(a type of modified RING motif). The HECT are E3s with
catalytic activity as the HECT domain contains a central cysteine residue that acts as an acceptor for ubiquitin. Neither
RING nor U-box E3s have a direct role in catalysis and function as adaptor-like molecules to facilitate protein ubiquitination while needing the presence of E2s and other proteins.
RING finger E3s are by far the largest family of E3s and are
characterized by the presence of a RING finger domain (a cysteine/histidine-rich, zinc-chelating domain that promotes
both protein–protein and protein–DNA interactions).
Besides the simple E3s containing a RING finger domain
(“RNF domain"), two classes of complex RING E3s have
been described: 1) the RBR (RING in between RING–RING),
which contain an IBR domain (“In Between Ring fingers") and
two RING domains and are capable of E2 binding and
ß The Author 2013. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution.
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://
creativecommons.org/licenses/by-nc/3.0/), which permits unrestricted non-commercial use, distribution, and reproduction in any
medium, provided the original work is properly cited.
1172
Open Access
Mol. Biol. Evol. 30(5):1172–1187 doi:10.1093/molbev/mst022 Advance Access publication February 7, 2013
Prediction of Ubiquitinating and Deubiquitinating Enzymes . doi:10.1093/molbev/mst022
substrate recognition. A recently proposed mechanism for
RBR ubiquitin transfer suggests that RBR ligases combine
features of both RING- and HECT-type ligases, with
independent catalytic activity (Wenzel and Klevit 2012). 2)
The CRL (Cullin–RING ligases), large multiprotein complexes
comprising at least a cullin protein and a RING protein. The
cullin protein provides the central scaffolding component for
each of the complexes. The main CRL families are the F-box,
the BTB (Broad complex, Tramtrak, Bric-a-brac), and the
SOCS (Suppressor Of Cytokine Signaling). The third family
of E3s is the U-box, characterized by their approximately 75
amino acid (U-box) domain. The U-box is structurally similar
to the RING finger domain but lacks the full complement of
zinc-binding ligands required for metal chelation. Initially, all
U-box E3s were termed “E4" as they were thought to be
proteins auxiliary to E2–E3 complexes and involved in ubiquitin chain elongation. However, it was later recognized that
other U-box E3s actually interact with E2s and present E3
ligase activity independently of other E3s. Their function as
E3s or E4s may depend not only on the nature of the substrate but also on the presence of other proteins in the
multiprotein complex. Finally, there is a very small set of
E3s that cannot be classified as HECT, RING finger, or
U-box E3s, and these comprise the ZnF A20 and DDB1-like
families.
The DUBs are a smaller enzyme set than the UBEs, and
mechanistically all DUBs are either cysteine or metalloproteases. DUBs have been classified into five distinct families: the
JAMM motif proteases (JAMM), the Machado–Joseph
Disease protein domain proteases (MJD), the ovarian tumor
proteases (OTU), the ubiquitin C-terminal hydrolases,
and the ubiquitin-specific proteases (USP) (Nijman et al.
2005).
Despite the central importance that reversible protein ubiquitination plays in cellular signaling, the UBE and DUB repertoires of almost all genomes remain uncharted. Here, we
describe a highly sensitive and specific sequence analysis
method for the automatic classification of UBEs and DUBs
into their corresponding families (N.B. our method does not
attempt to predict large suites of cullin adapters, such as
F-box-LRR or MATH-BTB). First, we describe the method in
detail and a series of tests to evaluate its performance. Second,
we also provide proof of concept of the general applicability
of the method by scanning 50 completely sequenced eukaryotic genomes. General aspects of UBE/DUB genomic content
and conservation of function are discussed across four
eukaryotic supergroups. We chose a wide spectrum of species
covering most of the sequenced eukaryotic phylogenetic
diversity, including unikonts (metazoans, fungi, choanozoa,
and amoebozoa), plants, excavates, and chromalveolates.
Finally, we present the Database of Ubiquitinating and
Deubiquitinating Enzymes (DUDE-db), a comprehensive repository of UBE and DUB. DUDE-db is publicly available at
http://www.DUDE-db.org and features an annotated database of predicted UBEs and DUBs in 50 eukaryotic genomes,
and a web server interface where users can scan a set of
proteins for the presence of sequence domains diagnostic
for any of the UBE and DUB families.
MBE
Results
A Novel Method for the Automatic Classification of
UBEs and DUBs into Families
We have developed an automatic and generally applicable
method for the systematic retrieval and classification of UBEs
and DUBs into families (fig. 1). Briefly, our method is based on
a collection of 33 specifically selected profile hidden Markov
models (HMMs) from the Pfam (Punta et al. 2012), SMART
(Letunic et al. 2012), and SUPERFAMILY (Wilson et al. 2009)
databases. The selection of HMMs was modeled on the members of the various UBE and DUB families and subfamilies of
the human genome. The evaluation for sensitivity was done
on the manually annotated yeast UBE/DUB repertoire and on
a phylogenetically diverse set of experimentally validated
UBEs and DUBs from the UniProt database. These tests reported a coverage of more than 99% for the “UBE/DUB HMM
Library." Additionally, the correct classification rate on the
family level was 100% for both the yeast and UniProt data
sets. Therefore, we expect the UBE/DUB HMM Library to
retrieve more than 99% of proteins encoding UBEs and
DUBs, which will likely be correctly assigned to specific
UBE/DUB protein families in all cases. It must be noted
that our method relies on the “trusted cut-offs" of the distinct
HMMs as implemented in InterProScan (Zdobnov and
Apweiler 2001). These cutoffs are those recommended by
the individual InterPro member databases and therefore are
believed to report relevant matches. Other than this, no test
for false positives was implemented as no gold standard exists
yet for the correct classification of UBEs and DUBs.
To show the wide applicability of the UBE/DUB HMM
Library, we first scanned the yeast and human genomes
and present an extended repertoire of UBE and DUB genes
that have remained uncharacterized in these model organisms. We then extend our analyses to 50 eukaryotic genomes,
which are made available through the DUDE-db.
The Enlarged Repertoire of UBE and DUB in Human
and Yeast
We applied the UBE/DUB HMM Library to scan the yeast and
human genomes for UBEs and DUBs, the only two species for
which the genomic UBE/DUB complements have been reported. Analysis of the yeast genome (Ensembl 67) predicted
85 UBE genes (of which 72 had been characterized previously)
and 24 DUB genes (of which 17 were already known) (table 1),
thus increasing the yeast repertoire of UBEs and DUBs by
approximately 25% (fig. 2A). Most importantly, we describe
two and four new members of the OTU and JAMM families
of DUBs that were not reported in the yeast catalog. At 109
genes, the yeast complement of UBEs and DUBs is one-ninth
the size of the human complement by gene number, but
more than two-thirds of the yeast enzymes have an identifiable human ortholog. Moreover, of the 22 yeast genes whose
genetic deletion produces an inviable phenotype, 20 are conserved in human (supplementary table S1, Supplementary
Material online). This suggests that yeast might be a very
1173
MBE
Hutchins et al. . doi:10.1093/molbev/mst022
1
Dataset of human
ubiquitinating (UBE) and
deubiquitinating (DUB)
enzymes
7
2
Genome-wide annotation
of UBEs and DUBs
InterProScan
6
4
3
E1
E2
Evaluation of UBE/DUB
HMM Library
E1
E2
5
HECT
E3
RING finger
RNF domain
BTB
SOCS
F-box
U-box
Other
ZnF A20
DDB1-like
UBE/DUB HMM Library
JAMM
MJD
DUBs
OTU
UCH
USP
FIG. 1. Flow diagram illustrating the construction and evaluation of the UBE/DUB HMM Library. The reported sets of human UBEs and DUBs (Step 1)
were run through InterProScan to identify the protein domain signatures diagnostic for the various enzyme families (Steps 2 and 3). This set of HMMs
was first tested on the human data set to identify potential cross-hits to other families and also to uncover any false positives in the original annotation
(Step 4). The final library of HMMs (the “UBE/DUB HMM Library”), specific for the identification of UBEs and DUBs (Step 5), was evaluated on the
characterized set of UBEs and DUBs from yeast and also on a phylogenetically diverse set of proteins from the UniProt database (Step 6). Finally, the
UBE/DUB HMM Library was applied to the characterization of UBEs and DUBs in 50 completely sequenced eukaryotic genomes (Step 7).
useful model organism with which to investigate the functions of a number of human UBEs and DUBs.
Similarly, we predicted 2,593 UBEs and 392 DUBs in the
human genome, encoded by 814 UBE and 107 DUB genes.
This figure represents an approximately 35% increase in the
number of UBE and DUB genes in the human genome in
comparison to the study by Li et al. (2008) (fig. 2A) and
brings the full human UBE/DUB complement to a new
total of 921 genes (table 1). Although yeast has an approximately similar number of UBEs and protein kinases
(Miranda-Saavedra et al. 2007), humans possess 65% more
1174
genes encoding UBEs than protein kinases. A comparison of
the yeast and human UBE/DUB complements shows that
enzymes harboring an RNF domain make up approximately
50% of the entire UBE/DUB repertoire and that genes of
the families HECT, RNF domain, and F-box are present in
approximately equal proportions in both organisms (fig. 2B).
Finally, humans possess three families that are absent
from yeast: SOCS, ZnF A20, and the DUB family MJD
(fig. 2B).
We report seven genes in the human genome that have
multiple distinctive domains and so fall into a new category
MBE
Prediction of Ubiquitinating and Deubiquitinating Enzymes . doi:10.1093/molbev/mst022
Table 1. The New, Enlarged, Complement of UBE and DUB Genes of Yeast and Human as Characterized by
the UBE/DUB HMM Library in Comparison with the Previously Annotated Data Sets.
Previous Annotation for Yeast
E1s and E2s: SCUD Database (2008)
E3s: Li et al. (2008)
DUBs: Amerik et al. (2000)
Species
Class
E1
Family
E1
E2
UBE/DUB HMM
Library
Genes
1
Genes
3
E2
11
14
E3
HECT
RING/RNF domain
RING/BTB
RING/SOCS box
RING/F-box
U-box
Other/ZnF A20
Other/DDB1-like
Mixed family
5
39
3
0
10
2
0
2
0
5
43
6
0
10
2
0
2
0
DUB
JAMM
MJD
OUT
UCH
USP
0
0
0
1
16
4
0
2
1
17
90
109
Yeast
Total
Species
Previous Annotation for Human
E1s and E2s: van Wijk and Timmers (2010)
E3s: Li et al. (2008)
DUBs: Nijman et al. (2005)
UBE/DUB HMM
Library
Class
E1
Family
E1
Genes (Proteins)
5* (12)
E2
E2
E3
HECT
RING/RNF domain
RING/BTB
RING/SOCS box
RING/F-box
U-box
Other/ZnF A20
Other/DDB1-like
Mixed family
DUB
JAMM
MJD
OUT
UCH
USP
Human
Total
Genes
2
32
43 (178)
26
271
162
35
56
7
6
3
0
28
407
199
39
73
5
4
4
7
12
4
13
4
51
12
4
13
4
74
684
(104)
(1,319)
(592)
(94)
(209)
(20)
(35)
(12)
(18)
(45)
(45)
(41)
(25)
(236)
921 (2,985)
NOTE.—UCH, ubiquitin C-terminal hydrolases. Our count of five and three E1s in human and yeast, respectively, does not reflect an
increase in the number of ubiquitin-activating genes, rather it reflects the sequence identity shared with the three other ubiquitin-like
activating enzymes, namely (in human) UBEA2 (SUMO pathway), UBEA3 (NEDD8 pathway), and UBEA7 (ISG15 pathway).
that we term “Mixed-family” (fig. 2C). These enzymes harbor
various combinations of domains diagnostic for distinct
UBE/DUB families, including OTU/ZnF A20, RNF/HECT,
U-box/RNF, and Cullin/RNF (fig. 2C). Mixed-family enzymes
are found in limited numbers in many of the species surveyed but are particularly abundant in metazoans and
absent in yeasts and excavates. Even though mixed-family
enzymes represent an almost negligible portion of any
genome, they illustrate the flexibility of the ubiquitination
system to produce protein machines that combine specific
functions in new and creative ways.
The UBE/DUB Complements of 50 Eukaryotic
Genomes
We applied our method on 50 completely sequenced eukaryotic genomes belonging to four eukaryotic supergroups as defined by Keeling et al. (2005): unikonts (including metazoans,
1175
MBE
25%
20%
15%
10%
5%
C
S
Fbo
x
U
-b
Zn ox
F
D A20
D
M B1
ix
ed like
-F
am
JA ily
M
M
M
JD
O
TU
U
C
H
U
SP
H
BT
B
0%
SO
UB genes
814
DUB genes 107
Total
921
30%
T
UB genes
85
DUB genes 24
Total
109
35%
F
Previously annotated genes
Yeast
Human
40%
E1
Novel genes
45%
N
25%
50%
R
20%
B
E2
Human
EC
Yeast
A
Percentage of entire UB/DUB complement
Hutchins et al. . doi:10.1093/molbev/mst022
C
ZnF A20
CEZANNE2 (OTUD7A)
OTU/ZnF A20
OTU
UBA-like
ZnF A20
CEZANNE (OTUD7B)
OTU/ZnF A20
PHD zinc finger
OTU
E3 ubiquitin ligase EDD
A20 (TNFAIP3)
OTU/ZnF A20
ZnF A20
ZnF A20
ZnF A20
ZnF A20
PABP C-terminal region
OTU
ARM repeat
G2E3
RNF/HECT
CPH
HECT
RNF
Galactose binding-like
EDD (UBR5)
RNF/HECT
HECT
RNF
Cullin N-terminal
(2799 aa)
‘Winged-helix’ DNA-binding
UBOX5
U box/RNF
U box
RNF
CUL9
Cullin/RNF
1
RNF
200
400
600
800
RNF
(2517 aa)
1000
Protein size
(amino acids)
FIG. 2. The yeast and human UBE and DUB complements. (A) The UBE/DUB HMM Library unveiled 25% and 35% additional UBE and DUB genes
in yeast and human that had not been characterized hitherto. This brings the yeast and human data sets to 109 and 921 genes, respectively.
(B) Contributions of the various UBE/DUB families as a percentage of the entire UBE/DUB gene complements of yeast and human. (C) Mixed-family
enzymes harbor domains diagnostic for more than one UBE or DUB family. Seven such genes are found in the human genome but none in yeast.
CUL9 is the only human cullin protein that contains an E3 (RNF) domain. In this plot, the family-diagnostic domains are shown in black, whereas the
protein accessory domains are color coded. All enzymes and their domains are drawn to scale, except for EDD and CUL9.
choanozoa, fungi, and amoebozoa), excavates, chromalveolates, and plants. These genomes display a wide range of
genome sizes and adaptations to their environments, including important human pathogens such as Trypanosoma brucei
and Leishmania major. If we accept the classification as presented here, an enormous variation in the sizes of eukaryotic
UBE/DUB complements exists, ranging from 37 UBEs and
7 DUBs for the intracellular microspodian parasite
Encephalitozoon cuniculi to 2,593 UBEs and 392 DUBs in
human (supplementary table S2, Supplementary Material
online). These numbers, however, represent predicted proteins (including splice isoforms) and not necessarily individual
genes. Therefore, the number of genes encoding UBEs and
DUBs per genome is likely to be smaller, especially for metazoan species. Examination of the phylogenetic distribution of
UBEs and DUBs (fig. 3A and supplementary table S2,
Supplementary Material online) shows that all E1, E2, and
E3 families, with the exception of SOCS and ZnF A20, are
1176
universally distributed across the four eukaryotic supergroups.
SOCS family E3s are found in all metazoans, and their absence
from other taxa, especially the choanoflagellate Monosiga
brevicollis (the closest outgroup to metazoa), suggests that
SOCSs are exclusively a metazoan innovation. Similarly, all
DUB families are universally found in all phylogenetic lineages,
except for the MJD proteases, which are absent from
excavates.
A linear correlation exists between the number of proteins
encoded in a genome and the size of its UBE/DUB complement (fig. 3C), with the proportion of genes encoding UBEs/
DUBs ranging from 5.3% (Arabidopsis thaliana) to 1.2%
(Phytophthora sojae). Only 1.7% of human genes encode
UBE/DUB, a figure similar to that of protein kinases (Martin
et al. 2009). Unicellular organisms typically have less than 300
UBEs and less than 40 DUBs, whereas the more complex
(plant and animal) multicellular organisms possess massively
enlarged repertoires of E3s and to some extent larger E2 and
MBE
Prediction of Ubiquitinating and Deubiquitinating Enzymes . doi:10.1093/molbev/mst022
Deubiquitinating enzymes
Ubiquitinating enzymes
SP
U
H
C
U
O
TU
JD
M
M
JA
D
M
1B
A
F
Zn
bo
x
U
bo
x
20
lik
e
Other
F
SO
C
TB
B
F
N
R
EC
H
E2
UNIKONTS
E1
T
S
RING finger
D
A
Fungi
Amoebozoa/
Choanozoa
Metazoans
EXCAVATES
PLANTS
CHROMALVEOLATES
Present
Absent
C
(E1, E2, E3)
1000
500
Enzymes
1500
2000
B
0
(DUBs)
0
20000
40000
60000
80000
Total number of proteins in genome
FIG. 3. Evolutionary conservation of the distinct UBE and DUB families in four eukaryotic supergroups. (A) All UBE families are present in all eukaryotic
genomes surveyed, except for SOCS (metazoan specific) and ZnF A20 (absent in excavates and fungi). All DUB families are universally present in
eukaryotes, except for the MJD DUBs, which appear to have been lost from excavates. (B) Evolutionary tree displaying the relationships among the
various eukaryotic groups included in this study, with the number of species indicated in parentheses. The area of the circles is proportional to the total
number of sequences identified for each group, which are also shown. (C) In eukaryotic genomes a linear relationship exists between the number of
UBE/DUB and the total number of predicted proteins.
DUB complements too (fig. 4 and supplementary table S2,
Supplementary Material online). Therefore, it appears that
organismal complexity roughly correlates with an increase
in the complexity of the ubiquitination system.
Because UBEs and DUBs are found in all eukaryotic lineages
explored, we interrogated the OrthoMCL database (Chen
et al. 2006) for orthologous genes across the four eukaryotic
supergroups. OrthoMCL is a sensitive genome-scale algorithm
for grouping orthologous protein sequences, with release 5
featuring approximately 1.4 million sequences from 150
genomes (including 57 unikonts, 13 excavates, 11 plants, 17
chromalveolates, and 52 bacteria). We found that 91 genes
are orthologous across all four eukaryotic supergroups (i.e.,
orthologs are present in at least one genome from each supergroup) (fig. 5). These 91 genes represent an essential core
of UBEs and DUBs that probably dates back to the first eukaryotes approximately 2 billion years ago. However, some of
these genes appear to have been lost secondarily in specific
lineages, especially excavates and chromalveolates, two lineages that include many parasitic species. This suggests that
their likely essential functions have been taken over by other
enzymes or that the original enzymes have evolved beyond
recognition. If we select only those genes that are present in at
least 75% of the genomes of each eukaryotic supergroup, a
1177
UNIKONTS
CHROMAL
VEOLATES PLANTS EXCAVATES
1178
0
500
1000
1500
Number of proteins
FIG. 4. Bar plot of the UBE and DUB complements of 50 eukaryotic genomes split into E1, E2, E3, and DUB.
Homo sapiens
Pongo abelli
Pan troglodytes
Macaca mulatta
Rattus norvegicus
Mus musculus
Monodelphis domestica
Ornithorhynchus anatinus
Gallus gallus
Xenopus tropicalis
Takifugu rubripes
Danio rerio
Ciona intestinalis
Branchiostoma floridae
Strongylocentrotus purpuratus
Drosophila melanogaster
Anopheles gambiae
Caenorhabditis elegans
Nematostella vectensis
Trichoplax adhaerens
Monosiga brevicollis
Dictyostelium discoideum
Yarrowia lipolytica
Ustilago maydis
Phanerochaete chrysosporium
Schizosaccharomyces pombe
Saccharomyces cerevisiae
Neurospora crassa
Kluyveromyces lactis
Encephalitozoon cuniculi
Cryptococcus neoformans
Candida glabrata
Trypanosoma cruzi
Trypanosoma brucei
Leishmania major
Leishmania infantum
Zea mays
Populus trichocarpa
Physcomitrella patens
Oryza sativa
Arabidopsis thaliana
Volvox carteri
Ostreococcus tauri
Ostreococcus lucimarinus
Cyanidioschyzon merolae
Chlamydomonas reinhardtii
Thalassiosira pseudonana
Phytophthora ramorum
Phytophthora sojae
Phaeodactylum tricornutum
DUB
E3
E2
E1
2000
2500
3000
Hutchins et al. . doi:10.1093/molbev/mst022
MBE
MBE
U
ni
ko
n
Ex ts
ca
va
te
Pl
an s
ts
C
hr
om
al
ve
U
ni
ko
n
Ex ts
ca
va
te
Pl
an s
ts
C
hr
om
al
ve
ol
at
es
ol
at
es
Prediction of Ubiquitinating and Deubiquitinating Enzymes . doi:10.1093/molbev/mst022
E1
HECT
E2
DDB1-like
BTB
U-box
UCH
USP
OTU
RNF
JAMM
Cullin
100%
50%
0%
% of species in each eukaryotic supergroup
in which an ortholog is present
FIG. 5. Circle plots showing the presence of the 91 orthologous UBE/DUB genes found in all four eukaryotic supergroups. A fully colored circle indicates
that 100% of the species included in that eukaryotic supergroup possess an identifiable ortholog (see inset key). Four genes encoding cullins also appear
to be largely conserved in all four eukaryotic supergroups.
1179
MBE
Hutchins et al. . doi:10.1093/molbev/mst022
subset of 18 genes is identified, including 12 UBE and 6 DUB
genes. Figure 5 lists the human genes associated with these 91
orthologous groups (with the 18 “ultra conserved" groups
underlined). Examination of the mouse orthologs in the
Mouse Genome Database (Eppig et al. 2012) shows that
the genetic deletion of the majority of these genes produces
embryonic lethal phenotypes or leads to serious conditions
later in life (e.g., abnormal function and development of the
nervous system, inflammatory diseases, infertility, and major
nuclear organization defects leading to abnormal chromosome condensation and segregation).
The DUDE-db
Our method and predictions are accessible via an online web
resource (http://www.DUDE-db.org/). DUDE-db contains the
precomputed predictions for 50 eukaryotic genomes, including 35,228 distinct UBEs and DUBs proteins, which have been
complemented by including proteins encoding the important
scaffolding proteins of the “Cullin" and “VHL" families in the
same 50 genomes. Cullin and VHL proteins were identified
by the presence of specific protein domains as provided by
high-quality models from the Pfam, SMART, and
SUPERFAMILY databases (Cullin: PF00888 [IPR001373] and
SM00182/SSF75632 [IPR016158]; VHL: PF01847/SSF49468
[IPR022772]). This brings the total number of proteins in
DUDE-db to 35,778. UBEs constitute approximately 87% of
all database entries and are mainly composed of E3s (81%),
followed by E2s (6%) and E1s (<1%). Among the E3s, the
most abundant family is the RNF domain, making up approximately 50% of all E3s, followed by BTB enzymes (23%) and
F-box enzymes (16%). DUBs represent approximately 11%
of all entries, with USPs being the main DUB family (57% of
all DUBs). The scaffolding proteins of the Cullin and VHL
families represent a mere 1.5% of all entries. Finally, enzymes
belonging to more than one UBE/DUB family (the
“Mixed-family" category) account for only 0.4% of the data
set. The annotation of the proteins in DUDE-db was enriched with the information provided in the original
genome releases, including not only the original protein ID
but also the gene ID each protein maps to, the annotated gene names and gene descriptions, and the protein
sequence.
DUDE-db provides a comprehensive interface for accessing
the database: The “Search database" tab allows the user to
browse the database by keyword (that will match sequence
and gene IDs, gene names, and gene descriptions) or by any
combination of UBE/DUB family and species (fig. 6). The results are tabulated in an output HTML page, detailing not
only the original protein IDs and their family classification but
also their associated gene IDs, gene names and gene descriptions, and a mini-plot of the protein domain architecture. By
clicking on the mini-plot, a full-size image of the protein
domain architecture will be displayed (fig. 6). The user is
given the option to download the data sets, either in the
“text-only" version (fully annotated protein sequences in
FASTA format) or also including full-size protein domain architecture plots for each protein. The “Start Jalview" button
1180
allows the user to launch a Java applet of the multiple sequence analysis tool Jalview (Waterhouse et al. 2009) to visualize the query results. Using Jalview, the user can easily edit
and color the sequences by conservation, protein secondary
structural properties, or amino acid chemical characteristics
and perform on-the-fly calculations of phylogenetic trees
(Neighbor-Joining and average distance). The full Jalview application can be readily launched via the “File!View in Full
Application" option. This provides access to various multiple
sequence alignment algorithms (MAFFT, Muscle, ClustalW,
T-Coffee, and Probcons), secondary structure prediction by
JNet, and structure display with Jmol. Furthermore, multiple
alignments in Jalview can now be exported to TOPALi v2
(Milne et al. 2009) via a synchronized interface where the
user can access more sophisticated methods of sequence
evolutionary analysis. Through the “Peptide scan" interface,
users can submit their own sequences for prediction, either by
“copying and pasting" the sequences in a text box or uploading them from a local file. The input sequences are first subjected to basic quality assurance checks before submission to
a multinode Linux cluster. In the meantime, the user can also
check the job’s running status and even submit a second set
of sequences for scanning, while the first job is still in progress.
The user is asked to provide a job name and e-mail address
where a hyperlink with the results will be sent upon job submission. The results page reports those proteins that are predicted to be UBEs, DUBs, or scaffolding proteins (Cullin/VHL),
and their corresponding domain architecture plots.
Discussion
Protein ubiquitination is a fundamental mechanism controlling virtually all known cellular functions in complex ways that
we are only beginning to understand (Grabbe et al. 2011).
Despite the importance of ubiquitination in normal physiological and disease states, partial catalogs of these important
enzymes only existed for a few select species. We report here a
broadly applicable computational method to predict UBEs
and DUBs on the basis of identifying protein domains diagnostic of the various UBE/DUB families. Our method relies on
HMMs specifically selected from three distinct protein
domain databases (Pfam, SMART, and SUPERFAMILY), and
upon evaluation and manual examination of the results, we
report a coverage and a family-level classification accuracy
greater than 99%. This means that using the UBE/DUB
HMM Library, we expect not only to retrieve more than
99% of these enzymes from any genome but also that these
enzymes will be correctly assigned to a specific protein family
in all cases. HMMs are known to be both especially sensitive
for database searches and highly specific for the classification
of protein sequences into specific groups, as we previously
showed for the protein kinase superfamily (Miranda-Saavedra
and Barton 2007). The application of our method on the
previously reported UBE/DUB sets of yeast and human allowed us to identify proteins that most likely are annotation
mistakes and most importantly to find new genes encoding
UBEs and DUBs that had not been reported before. In human,
we report an additional 35% genes encoding UBEs and DUBs,
and in the well-studied Baker’s yeast an additional 25%,
Prediction of Ubiquitinating and Deubiquitinating Enzymes . doi:10.1093/molbev/mst022
MBE
A
B
FIG. 6. DUDE-db user interface. (A) Screenshot of the web interface to DUDE-db. Users can select any combination of UBE/DUB families and species for
display and download. Results are displayed in an HTML table (B) with each enzyme’s protein architecture displayed as a mini-plot. Clicking on the
mini-plot generates a full-size image. Results can be downloaded as fully annotated sequences in FASTA format or in the “deluxe” download version
including all full-size protein architecture images too.
thus highlighting the power of our method for characterizing
the complements of UBEs and DUBs genome wide. Our predictions on 50 eukaryotic genomes cover the major phylogenetic supergroups of eukaryotes and are now available
through DUDE-db, a comprehensive database of UBEs and
DUBs. DUDE-db also features an easy-to-use prediction server
where users can scan their protein data sets. Therefore, our
prediction method and its web interface, DUDE-db, represent
a significant advance in the field and fill an essential void for
researchers studying protein ubiquitination. DUDE-db will be
updated periodically to incorporate additional predicted and
manually annotated UBE/DUB complements, as well as
predictions of proteins with ubiquitin-binding domains
(Dikic et al. 2009). Recently, a database of UBE and DUB
covering a limited portion of the eukaryotic tree was published (Gao et al. 2013). The authors used the sequences of
1181
MBE
Hutchins et al. . doi:10.1093/molbev/mst022
keyword-annotated UBEs and DUBs to build HMMs with
which multiple genomes were interrogated. Our method encapsulates a richer diversity in the sequences that make up
our HMMs, and as an exemplary result, we report more than
three times more proteins encoding UBEs and DUBs in
human (2,985 vs. 886).
Protein ubiquitination and phosphorylation are similar in
that both types of chemical modification can be reversed
(by deubiquitinases and phosphatases, respectively) and
specifically recognized (by phosphoprotein- and ubiquitinbinding domains), both can be autoregulated (by autoubiquitination and autophosphorylation), and both can be
induced by various stimuli. Moreover, both systems are tightly
interlinked because ubiquitination is often dependent on protein phosphorylation (e.g., to create binding sites for E3s or to
modulate E3 activity), and many protein kinases are known to
be ubiquitinated as exquisitely illustrated for the EGF-receptor
signaling network (Woelk et al. 2007; Argenzio et al. 2011).
The ubiquitination system is larger by number of enzymes
than the phosphorylation system, with the added complexity
that all seven lysines of the ubiquitin molecule can be ubiquitinated, giving rise to poly-ubiquitinated and branched
poly-ubiquitinated chains. All these chemical modifications
introduce topologies that encode functionally distinct signals
that are recognized by cells and affect not only the half-lives of
target proteins but also their structure and activity, localization, and interaction partners (Woelk et al. 2007). The sheer
size of ubiquitin chains has hindered their identification by
quantitative proteomics, but improvements in novel controlled protein fragmentation techniques, together with
more sensitive and accurate mass spectrometers, will eventually enable the site-specific identification of ubiquitin modifications with the same degree of detail of acetylation and
phosphorylation, on a proteome-wide scale (Vertegaal 2011).
Coupling mass spectrometry to experiments in genetically
modified cells will lead to understanding the involvement
of specific E2/E3 pairs in regulating specific targets and cellular
processes in the dynamic context of other post-translational
modifications (Nagy and Dikic 2010) and how the subversion
of the system contributes to disease. Abnormal ubiquitination has been implicated in dozens of diseases (Ciechanover
and Schwartz 2004; Petroski 2008; Cohen and Tcherpakov
2010; Kirkin and Dikic 2011; Fulda et al. 2012), and UBEs
and DUBs could become the next large set of drug targets
after G protein-coupled receptors and protein kinases (Cohen
2002; Cohen and Tcherpakov 2010). Protein kinases are such a
successful group of targets because we have a thorough understanding of their druggability principles as most kinase
inhibitors target the ATP-binding pocket. As a result, chemical
libraries can be synthesized and tested directly against panels
of kinases. On the other hand, ubiquitination is more complex
and versatile than phosphorylation as a control mechanism,
and as such, the potential for therapeutic intervention is
enormous. The only commercially available inhibitor of the
ubiquitin system (Bortezomib) targets the 26S proteasome,
but recently some E3 inhibitors have entered clinical trials
(reviewed in Cohen and Tcherpakov 2010). The development
of general principles for identifying inhibitors against E3s,
1182
E2/E3 pairs, or DUBs will give birth to chemical libraries
that can be tested against enzymes of the ubiquitin system.
In this context, DUDE-db provides not only a data set of
predicted UBEs and DUBs in humans, but also an essential
toolkit for pharmacophylogenomic analysis (Searls 2003).
Phylogenetic reconstructions including a large number of species make more reliable ortholog and paralog calls. These in
turn can help identify functional shifts (orthologs) and
broader issues of coevolution, pleiotropy, and functional redundancy, all of which are essential considerations when reasoning about species differences and to identify appropriate
drug targets and disease models for drug development
pipelines.
Materials and Methods
Systematic Prediction of UBEs and DUBs
The method presented here relies on distinguishing among
the various eukaryotic UBE and DUB families by virtue of
diagnostic protein domain signatures. We aimed to characterize the specific combination of protein domain signatures
from the smallest number of protein domain databases that
allows the identification and classification of characterized
sets of UBEs and DUBs into their correct families, without
cross-hitting other families (fig. 1).
Working Data Sets
A data set of 2 human E1 and 32 E2 genes was previously
described by van Wijk and Timmers (2010). E1s and E2s are
characterized by ubiquitin-activating and UBC domains, respectively. E3s constitute a much larger and diverse group of
enzymes: In 2008, Li et al. reported a catalog of 616 distinct
human E3s representing individual human genes and classified into the HECT, RING finger, U-box, ZnF A20, and
DDB1-like families. Most of these sequences had been predicted by sequence similarity to other characterized ligases.
The RefSeq protein IDs provided for these E3s were mapped
to the human RefSeq database (release of July 2012) (Pruitt
et al. 2009). The original proteins could be retrieved for
598/616 RefSeq IDs. This E3 data set was curated by reannotating duplicate entries (NP_055763 as “RING finger" and
NP_694578 as “BTB"). Next, to obtain a final, nonredundant
gene data set, RefSeq IDs were mapped to the human
Ensembl gene set (release 67) (Flicek et al. 2012).
Nonmatches and double annotations to human Ensembl
genes were curated to produce a set of 2 E1, 32 E2, and 592
E3 genes in the human genome.
The human complement of DUBs was previously reported
to consist of 95 genes (Nijman et al. 2005). Examination
of this data set led to the removal of nine DUBs that were
either discontinued genes (USP17-like, USP51, DUB-3,
ENSG00000197767, IFP38, and ENSG00000198817) or pseudogenes (TL132, TL132-like, and PARP11). This produced a set
of 86 genes encoding DUBs in human.
Characterization of Protein Domain Signatures Specific
to UBE and DUB
The sets of human UBEs and DUBs described above
were analyzed with a local installation of InterProScan
MBE
Prediction of Ubiquitinating and Deubiquitinating Enzymes . doi:10.1093/molbev/mst022
(release 30.0) (Zdobnov and Apweiler 2001) run with default
parameters. This was done both to verify the identities of
these proteins as bona fide UBEs and DUBs and to identify
specific protein domain signatures diagnostic for the enzymes
in each family. InterProScan is the working tool behind
InterPro, the most complete and integrated database of protein domains, regions, and sites, widely used for the automatic
annotation of proteins and entire genomes. InterPro combines a number of member databases that use distinct underlying methodologies, which as a result cover different
regions of the protein space. Thus, by combining various protein signature databases, InterPro makes the most of their
individual strengths with the added advantage that InterPro
automatically integrates trusted cutoffs for each protein
domain model. We found that 26/592 (4.5%) of the E3s
reported by Li et al. (2008) do not harbor any protein domains
characteristic of any E3 family nor experimental evidence for
their catalytic activity is reported and were therefore disregarded. For these same reasons, 2/86 DUBs from the Nijman
et al. data set (Nijman et al. 2005) were also removed, producing a final superset (“the working dataset") of 600 UBEs (2
E1, 32 E2, and 566 E3 genes) and 84 DUB genes in the human
genome (supplementary table S3, Supplementary Material
online).
Analysis of the protein domain signatures of the working
data set led to the identification of a minimal group of 33
distinct protein domain signatures that allows the automatic
retrieval and correct family-level classification of all the enzymes in the working data set (supplementary table S4,
Supplementary Material online). We call this collection of
HMMs the “UBE/DUB HMM Library." We chose to combine
HMMs from Pfam (Punta et al. 2012), SMART (Letunic et al.
2012), and SUPERFAMILY (Wilson et al. 2009) as these databases differ in their contents and therefore their coverage.
Pfam (release 26.0) is a large collection of more than 13,000
protein family alignments, whereas SMART (version 7) contains models for more than 1,000 protein domains. Although
Pfam and SMART are built from manually curated alignments
of multiple protein sequences, SUPERFAMILY is based on the
sequences of domains with known three-dimensional structure as contained in the SCOP database (Andreeva et al.
2008). Therefore, the integration of HMMs from the three
databases leads to an improvement in the predictive power of
the method when compared with using any individual database in isolation. In InterPro, each HMM is usually integrated
into an “InterPro entry" that typically includes protein and
domain models not only from the Pfam, SMART, and
SUPERFAMILY databases but also from other InterPro
member databases. In fact, many of the 33 HMMs selected
above are part of InterPro entries that feature additional
models from other InterPro member databases, such as
Gene3D, PANTHER, PRINTS, and PROSITE. We proved empirically that including any other protein domain model from
other InterPro member databases in addition to the 33
HMMs of the UBE/DUB HMM Library (supplementary
table S4, Supplementary Material online) did not improve
the coverage on the annotated set of UniProt proteins associated with each InterPro entry. Therefore, the select
combination of HMMs from the Pfam, SMART, and
SUPERFAMILY databases that makes up the UBE/DUB
HMM Library represents both a minimal and comprehensive
collection of protein domain models that cover all the protein
space associated with each one of the InterPro entries they are
featured in.
Evaluation
To evaluate the coverage and accuracy of our method, we
performed a series of tests on sequences that had been manually annotated as UBEs and DUBs, and for many of which,
experimental evidence is available. In a first exercise, we attempted to classify the manually annotated sets of UBEs and
DUBs from the model yeast Saccharomyces cerevisiae. The set
of E1s and E2s was retrieved from the S. cerevisiae
Ubiquitination Database (SCUD) (Lee et al. 2008). Yeast E3s
had previously been characterized by Li et al. (2008), and the
DUBs were those reported by Amerik et al. (2000). According
to these annotated data sets, the yeast UBE/DUB complement consists of 109 genes, including E1 (1 gene), E2 (11), E3
(80), and DUB (17). Our method retrieved 89/109 proteins
(81.6%), with a correct classification rate of 100% on the
family level (table 2). A closer inspection of the proteins
that we failed to identify unveiled the incorrect annotation
of 19 proteins. These include the following eight RNF genes:
ASI1 (YMR119W) and ASI3 (YNL008C) are inner nuclear
membrane proteins, and although the authors suggested
the presence of nucleoplasmically oriented C-terminal RING
domains, they could not show auto- or transubiquitination
activity (Zargari et al. 2007). Moreover, not a single protein
domain could be identified in these proteins using
InterProScan. Similarly, SLX5 (YDL013W) has no identifiable
Table 2. Coverage and Family-Level Classification Accuracy of the
UBE/DUB HMM Library on the Yeast UBE/DUB Complement.
Species Class
Family
Genes Coverage Family-Level
(%)
Classification
Accuracy
on Data Set
Covered (%)
1
1 (100)
1 (100)
E1
E1
E2
E2
11
11 (100)
11 (100)
E3
HECT
RING/RNF domain
RING/BTB
RING/SOCS box
RING/F-box
U-box
Other/ZnF A20
Other/DDB1-like
Mixed family
5
47
3
0
21
2
0
2
0
5 (100)
39 (83.0)
3 (100)
NA
9 (42.8)
2 (100)
NA
2 (100)
NA
5 (100)
39 (100)
3 (100)
NA
9 (100)
2 (100)
NA
2 (100)
NA
DUB
JAMM
MJD
OUT
UCH
USP
0
0
0
1
16
NA
NA
NA
1 (100)
16 (100)
NA
NA
NA
1 (100)
16 (100)
109
89 (81.6)
89 (100)
Yeast
Total
NOTE.—UCH, ubiquitin C-terminal hydrolases; NA, not applicable.
1183
Hutchins et al. . doi:10.1093/molbev/mst022
protein domains, and although it was shown that the
Slx5p-Slx8p dimer has robust substrate-specific E3 ligase activity, such activity is ascribed to the SLX8 gene, with SLX5
working to enhance the catalytic process only (Xie et al. 2007).
The genes SNT2 (YGL131C) and PEP3 (YLR148W) are not
reported as having E3 activity nor encode any protein domains that may suggest so. The gene STE5 (YDR103W) is not
an E3 but a MAPK scaffold protein that controls the mating
decision, and the ubiquitination of Ste5p appears to be controlled by the Cdc4p E3 (an F-box) and requires Cdk1pmediated phosphorylation (Garrenton et al. 2009). PEX2
(YJL210W) does not encode an identifiable RNF domain,
and neither does PEX12, unlike PEX10, which may be the E3
responsible for the ubiquitination reactions reported for this
peroxisomal protein import complex (Platta et al. 2009). The
SSL1 gene (YLR005W), part of the TFIIH complex, has been
reported to display E3 activity (Takagi et al. 2005). Analysis of
its domain structure showed that the C-terminal domain is a
C2H2 zinc finger domain (a very common DNA-binding
domain found in many transcription factors) and not a
C3HC4 zinc finger domain (which is diagnostic for RNF
domain E3s). As the authors noted, Ssl1p might depend
upon another subunit for its activity. Among the proteins
annotated as F-box proteins, the following lack an identifiable
F-box domain, and no specific E3 activity has been conclusively ascribed: COS111 (YBR203W), SAF1 (YBR280C), RCY1
(YJL204C), HRT3 (YLR097C), YLR224W, YLR352W, AMN1
(YBR158W), CTF13/CBF3 (YMR094W), RAD7 (YJR052W),
YDR306C, and ROY1 (YMR258C). Discounting these 19
false positives from the original annotation brings the yeast
complement to 73 UBEs and 17 DUBs. Therefore, because our
method identified 89 of these 90 enzymes, we estimate a
coverage of approximately 99% (89/90) and a family-level
classification accuracy of 100% on this curated data set. The
only gene that we failed to identify is Elongin A (ELA1 or
YNL230C), an E3 containing an F-box domain that displays
marginal sequence similarity to canonical F boxes. This ELA1
F-box domain can only be identified with a PROSITE matrix
(and with a very poor score).
In a second test on a more phylogenetically diverse data
set, we examined the coverage and classification accuracy of
our method on a select set of proteins from the UniProt
database (Uniprot Consortium 2010), the largest public
repository of protein sequences. First, we examined the
Gene Ontology (GO) resource (Gene Ontology Consortium
2010) for the GO terms under the “Molecular Function" category that would help categorizing UBEs and DUBs. These GO
terms (supplementary table S5, Supplementary Material
online) were used to select a set of protein sequences
(UniProt release of 16 May 2012; http://www.ebi.ac.uk/uni
prot/) as our benchmark. Moreover, these proteins should
also have been annotated under the experimental evidence
codes IDA (“Inferred from Direct Assay"), IPI (“Inferred from
Physical Interaction"), IMP (“Inferred from Mutant
Phenotype"), IGI (“Inferred from Genetic Interaction"), and
EXP (“Inferred from experiment"). This selection resulted in
a set of 574 UniProt proteins from a variety of phylogenetic
lineages including chordates (human, mouse, rat, chicken,
1184
MBE
Xenopus laevis, and zebrafish), nonchordate animals
(Drosophila melanogaster and Caenorhabditis elegans),
plants (A. thaliana, Oryza sativa, and Capsicum annuum),
fungi (Schizosaccharomyces pombe, S. cerevisiae, Candida albicans, and Emericella nidulans), a number of (mainly Gramnegative) bacteria (Chlamydia trachomatis, Legionella
pneumophila, Mycobacterium tuberculosis, Salmonella typhimurium, and Shigella flexneri), and a virus (Human herpesvirus). The UBE/DUB HMM Library retrieved 514/572 proteins
(89.9%). Manual inspection of the set of 58 proteins that
our method failed to identify showed that 16 of these were
fragments instead of full proteins. Moreover, 24 of the
proteins were misannotated as UBEs or DUBs, 10 proteins
did not harbor any identifiable protein domains at all, and 7
proteins were of bacterial origin. Bacterial enzymes of the
ubiquitination pathway differ substantially from the eukaryotic enzymes sequence wise, and the UBE/DUB HMM Library
was not expected to find them, although we successfully
identified and classified a bacterial E3 ligase from Leg. pneumophila (LubX, UniProt identifier LUBX_LEGPH). We also
failed to identify human ZFP91 (UniProt: ZFP91_HUMAN),
an atypical E3 reported to mediate the activatory Lys(63)linked ubiquitination of MAP3K14/NIK (Jin et al. 2010).
ZFP91 cannot possibly be identified as an E3 on the sequence
level as it harbors C2H2 zinc finger domains (which typically
bind DNA), and as the name implies, it is “atypical." Therefore,
if we take into account the presence of these false positives in
the original UniProt annotation, we report a coverage of 514/
515 (99.8%) on this data set. Among the 514 proteins that
were retrieved are representatives of all UBE and DUB families
except DDB1-like, and the correct classification rate of all
identified proteins was 100% on the family level (table 3).
Collectively, the results from these two tests indicate a coverage more than 99% for the retrieval of UBEs and DUBs from
eukaryotic genomes, with a correct classification rate of 100%
on the family level.
The DUDE-db: Contents and Sequence Data Sources
The UBE/DUB HMM Library was used to scan the predicted
protein data sets of 50 completely sequenced and published
eukaryotic genomes. The genomes analyzed belong to four of
the five eukaryotic supergroups, as defined by Keeling et al.
(2005). These include the unikonts: Anopheles gambiae (Holt
et al. 2002), Branchiostoma floridae (Putnam et al. 2008),
C. elegans (C. elegans Sequencing Consortium 1998), Can.
glabrata (Dujon et al. 2004), Ciona intestinalis (Dehal et al.
2002), Cryptococcus neoformans (Loftus et al. 2005), Danio
rerio (Flicek et al. 2012), Dictyostelium discoideum (Eichinger
et al. 2005), D. melanogaster (Adams et al. 2000), E. cuniculi
(Katinka et al. 2001), Gallus gallus (International Chicken
Genome Sequencing Consortium 2004), Homo sapiens
(Lander et al. 2001), Kluyveromyces lactis (Dujon et al.
2004), Macaca mulatta (Gibbs et al. 2007), Monodelphis
domestica (Mikkelsen et al. 2007), Monosiga brevicollis (King
et al. 2008), Mus musculus (Waterston et al. 2002),
Nematostella vectensis (Putnam et al. 2007), Neurospora
crassa (Galagan et al. 2003), Ornithorhynchus anatinus
Prediction of Ubiquitinating and Deubiquitinating Enzymes . doi:10.1093/molbev/mst022
Table 3. Family-Level Classification Accuracy of the UBE/DUB HMM
Library on a Manually Annotated and Experimentally Verified Set of
UBEs and DUBs from the UniProt Database.
Class
Family
E1
E1
3
3 (100)
Family-Level
Classification
Accuracy on the
Data Set
Covered (%)
3 (100)
E2
E2
79
79 (100)
79 (100)
E3
HECT
RING/RNF domain
RING/BTB
RING/SOCS box
RING/F-box
U-box
Other/ZnF A20
Other/DDB1-like
Mixed family
33
244
2
1
14
37
1
0
6
33 (100)
244 (100)
2 (100)
1 (100)
14 (100)
37 (100)
1 (100)
NA
6 (100)
33 (100)
244 (100)
2 (100)
1 (100)
14 (100)
37 (100)
1 (100)
NA
6 (100)
DUB
UCH
USP
MJD
OUT
JAMM
Atypical
Total
Proteins
6
69
4
10
5
1
515
Coverage
(%)
6
69
4
10
5
0
(100)
(100)
(100)
(100)
(100)
(0)
6 (100)
69 (100)
4 (100)
10 (100)
5 (100)
NA
514 (99.8)
514 (100)
NOTE.—The UBE/DUB HMM Library recovered 514/515 (99.8%) of all bona fide UBE
and DUB from the UniProt database. This near-perfect coverage was mirrored by a
perfect family-level classification across all enzyme families. The “Mixed-family” category includes proteins that harbor domains diagnostic for more than UBE or DUB
family. UCH, ubiquitin C-terminal hydrolases; NA, not applicable.
(Warren et al. 2008), Pan troglodytes (Chimpanzee Sequencing and Analysis Consortium 2005), Phanerochaete chrysosporium (Martinez et al. 2004), Pongo pygmaeus (Locke et al.
2011), Rattus norvegicus (Gibbs et al. 2004), S. cerevisiae
(Goffeau et al. 1996), Sch. pombe (Wood et al. 2002),
Strongylocentrotus purpuratus (Sodergren et al. 2006),
Takifugu rubribes (Aparicio et al. 2002), Trichoplax adhaerens
(Srivastava et al. 2008), Ustilago maydis (Kamper et al. 2006),
X. tropicalis (Hellsten et al. 2010), and Yarrowia lipolytica
(Dujon et al. 2004); the excavates: L. infantum (Peacock
et al. 2007), L. major (Ivens et al. 2005), T. brucei (Berriman
et al. 2005) and T. cruzi (El-Sayed et al. 2005); the plants:
A. thaliana (Arabidopsis Genome Initiative 2000),
Chlamydomonas reinhardtii (Merchant et al. 2007), Cyanidioschyzon merolae (Matsuzaki et al. 2004), O. sativa
(International Rice Genome Sequencing Project 2005),
Ostreococcus lucimarinus (Palenik et al. 2007), Ost. tauri
(Derelle et al. 2006), Physcomitrella patens (Rensing et al.
2008), Populus trichocarpa (Tuskan et al. 2006), Volvox carteri
(Prochnik et al. 2010), and Zea mays (Wei et al. 2009); and the
chromalveolates: Phaeodactylum tricornutum (Bowler et al.
2008), P. ramorum (Tyler et al. 2006), P. sojae (Tyler et al.
2006), and Thalassiosira pseudonana (Armbrust et al. 2004).
Supplementary table S2, Supplementary Material online, lists
the UBE and DUB protein contents of these species and the
corresponding database releases.
MBE
DUDE-db was designed to store the UBE/DUB complements of an unlimited number of genomes and is freely available at http://www.DUDE-db.org/. DUDE-db is stored as a
MySQL relational database (http://www.mysql.com/). The
server is implemented as a set of Python and Perl CGI scripts
running under Apache (http://www.apache.org/).
Supplementary Material
Supplementary tables S1–S5 are available at Molecular Biology
and Evolution online (http://www.mbe.oxfordjournals.org/).
Acknowledgments
The authors thank Dr. David Martin (Dundee) for help with
Jalview, Dr. Shandar Ahmad (Osaka) for reading the manuscript, and Ms. Mineko Tanimoto for secretarial support.
Japan Society for the Promotion of Science (JSPS) through
the WPI-IFReC Research Program and a Kakenhi grant; the
Kishimoto Foundation; and the ETHZ-JST Japanese-Swiss
Cooperative Program.
References
Adams MD, Celniker SE, Holt RA, et al. (195 co-authors). 2000. The
genome sequence of Drosophila melanogaster. Science 287:
2185–2195.
Amerik AY, Li SJ, Hochstrasser M. 2000. Analysis of the deubiquitinating
enzymes of the yeast Saccharomyces cerevisiae. Biol Chem. 381:
981–992.
Andreeva A, Howorth D, Chandonia JM, Brenner SE, Hubbard TJ,
Chothia C, Murzin AG. 2008. Data growth and its impact on the
SCOP database: new developments. Nucleic Acids Res. 36:
D419–D425.
Aparicio S, Chapman J, Stupka E, et al. (41 co-authors). 2002.
Whole-genome shotgun assembly and analysis of the genome of
Fugu rubripes. Science 297:1301–1310.
Arabidopsis Genome Initiative. 2000. Analysis of the genome sequence
of the flowering plant Arabidopsis thaliana. Nature 408:796–815.
Ardley HC, Robinson PA. 2005. E3 ubiquitin ligases. Essays Biochem. 41:
15–30.
Argenzio E, Bange T, Oldrini B, Bianchi F, Peesari R, Mari S, Di Fiore PP,
Mann M, Polo S. 2011. Proteomic snapshot of the EGF-induced
ubiquitin network. Mol Syst Biol. 7:462.
Armbrust EV, Berges JA, Bowler C, et al. (45 co-authors). 2004. The
genome of the diatom Thalassiosira pseudonana: ecology, evolution,
and metabolism. Science 306:79–86.
Berriman M, Ghedin E, Hertz-Fowler C, et al. (102 co-authors). 2005. The
genome of the African trypanosome Trypanosoma brucei. Science
309:416–422.
Bowler C, Allen AE, Badger JH, et al. (80 co-authors). 2008. The
Phaeodactylum genome reveals the evolutionary history of diatom
genomes. Nature 456:239–244.
C. elegans Sequencing Consortium. 1998. Genome sequence of the
nematode C. elegans: a platform for investigating biology. Science
282:2012–2018.
Chen F, Mackey AJ, Stoeckert CJ Jr, Roos DS. 2006. OrthoMCL-DB:
querying a comprehensive multi-species collection of ortholog
groups. Nucleic Acids Res. 34:D363–D368.
Chimpanzee Sequencing and Analysis Consortium. 2005. Initial sequence of the chimpanzee genome and comparison with the
human genome. Nature 437:69–87.
Ciechanover A, Schwartz AL. 2004. The ubiquitin system: pathogenesis
of human diseases and drug targeting. Biochim Biophys Acta 1695:
3–17.
Cohen P. 2002. Protein kinases—the major drug targets of the
twenty-first century? Nat Rev Drug Discov. 1:309–315.
1185
Hutchins et al. . doi:10.1093/molbev/mst022
Cohen P, Tcherpakov M. 2010. Will the ubiquitin system furnish as
many drug targets as protein kinases? Cell 143:686–693.
Dehal P, Satou Y, Campbell RK, et al. (88 co-authors). 2002. The draft
genome of Ciona intestinalis: insights into chordate and vertebrate
origins. Science 298:2157–2167.
Derelle E, Ferraz C, Rombauts S, et al. (27 co-authors). 2006. Genome
analysis of the smallest free-living eukaryote Ostreococcus tauri unveils many unique features. Proc Natl Acad Sci U S A. 103:
11647–11652.
Dikic I, Wakatsuki S, Walters KJ. 2009. Ubiquitin-binding domains—from
structures to functions. Nat Rev Mol Cell Biol. 10:659–671.
Dujon B, Sherman D, Fischer G, et al. (68 co-authors). 2004. Genome
evolution in yeasts. Nature 430:35–44.
Eichinger L, Pachebat JA, Glockner G, et al. (98 co-authors). 2005. The
genome of the social amoeba Dictyostelium discoideum. Nature 435:
43–57.
El-Sayed NM, Myler PJ, Bartholomeu DC, et al. (83 co-authors). 2005.
The genome sequence of Trypanosoma cruzi, etiologic agent of
Chagas disease. Science 309:409–415.
Eppig JT, Blake JA, Bult CJ, Kadin JA, Richardson JE. 2012. The Mouse
Genome Database (MGD): comprehensive resource for genetics and
genomics of the laboratory mouse. Nucleic Acids Res. 40:D881–D886.
Flicek P, Amode MR, Barrell D, et al. (57 co-authors). 2012. Ensembl
2012. Nucleic Acids Res. 40:D84–D90.
Fulda S, Rajalingam K, Dikic I. 2012. Ubiquitylation in immune disorders
and cancer: from molecular mechanisms to therapeutic implications. EMBO Mol Med. 4:545–556.
Galagan JE, Calvo SE, Borkovich KA, et al. (77 co-authors). 2003. The
genome sequence of the filamentous fungus Neurospora crassa.
Nature 422:859–868.
Gao T, Liu Z, Wang Y, Cheng H, Yang Q, Guo A, Ren J, Xue Y. 2013.
UUCD: a family-based database of ubiquitin and ubiquitin-like conjugation. Nucleic Acids Res. 41:D445–D451.
Garrenton LS, Braunwarth A, Irniger S, Hurt E, Künzler M, Thorner J.
2009. Nucleus-specific and cell cycle-regulated degradation of mitogen-activated protein kinase scaffold protein Ste5 contributes to the
control of signaling competence. Mol Cell Biol. 29:582–601.
Gene Ontology Consortium. 2010. The Gene Ontology in 2010: extensions and refinements. Nucleic Acids Res. 38:D331–D335.
Gibbs RA, Rogers J, Katze MG, et al. (176 co-authors). 2007. Evolutionary
and biomedical insights from the rhesus macaque genome. Science
316:222–234.
Gibbs RA, Weinstock GM, Metzker ML, et al. (228 co-authors). 2004.
Genome sequence of the Brown Norway rat yields insights into
mammalian evolution. Nature 428:493–521.
Goffeau A, Barrell BG, Bussey H, et al. (16 co-authors). 1996. Life with
6000 genes. Science 274:546, 563–547.
Grabbe C, Husnjak K, Dikic I. 2011. The spatial and temporal organization of ubiquitin networks. Nat Rev Mol Cell Biol. 12:295–307.
Hellsten U, Harland RM, Gilchrist MJ, et al. (48 co-authors). 2010. The
genome of the Western clawed frog Xenopus tropicalis. Science 328:
633–636.
Holt RA, Subramanian GM, Halpern A, et al. (124 co-authors). 2002. The
genome sequence of the malaria mosquito Anopheles gambiae.
Science 298:129–149.
International Chicken Genome Sequencing Consortium. 2004. Sequence
and comparative analysis of the chicken genome provide unique
perspectives on vertebrate evolution. Nature 432:695–716.
International Rice Genome Sequencing Project. 2005. The map-based
sequence of the rice genome. Nature 436:793–800.
Ivens AC, Peacock CS, Worthey EA, et al. (101 co-authors). 2005. The
genome of the kinetoplastid parasite, Leishmania major. Science 309:
436–442.
Jin X, Jin HR, Jung HS, Lee SJ, Lee JH, Lee JJ. 2010. An atypical E3 ligase zinc
finger protein 91 stabilizes and activates NF-kappaB-inducing kinase
via Lys63-linked ubiquitination. J Biol Chem. 285:30539–30547.
Kamper J, Kahmann R, Bolker M, et al. (80 co-authors). 2006. Insights
from the genome of the biotrophic fungal plant pathogen Ustilago
maydis. Nature 444:97–101.
1186
MBE
Katinka MD, Duprat S, Cornillot E, et al. (17 co-authors). 2001. Genome
sequence and gene compaction of the eukaryote parasite
Encephalitozoon cuniculi. Nature 414:450–453.
Keeling PJ, Burger G, Durnford DG, Lang BF, Lee RW, Pearlman RE, Roger
AJ, Gray MW. 2005. The tree of eukaryotes. Trends Ecol Evol. 20:
670–676.
King N, Westbrook MJ, Young SL, et al. (36 co-authors). 2008. The
genome of the choanoflagellate Monosiga brevicollis and the
origin of metazoans. Nature 451:783–788.
Kirkin V, Dikic I. 2011. Ubiquitin networks in cancer. Curr Opin Genet
Dev. 21:21–28.
Lander ES, Linton LM, Birren B, et al. (255 co-authors). 2001.
Initial sequencing and analysis of the human genome. Nature 409:
860–921.
Lee WC, Lee M, Jung JW, Kim KP, Kim D. 2008. SCUD: Saccharomyces
cerevisiae ubiquitination database. BMC Genomics 9:440.
Letunic I, Doerks T, Bork P. 2012. SMART 7: recent updates to the
protein domain annotation resource. Nucleic Acids Res. 40:
D302–D305.
Li W, Bengtson MH, Ulbrich A, Matsuda A, Reddy VA, Orth A, Chanda
SK, Batalov S, Joazeiro CA. 2008. Genome-wide and functional annotation of human E3 ubiquitin ligases identifies MULAN, a mitochondrial E3 that regulates the organelle’s dynamics and signaling. PLoS
One 3:e1487.
Locke DP, Hillier LW, Warren WC, et al. (102 co-authors). 2011.
Comparative and demographic analysis of orang-utan genomes.
Nature 469:529–533.
Loftus BJ, Fung E, Roncaglia P, et al. (54 co-authors). 2005. The genome
of the basidiomycetous yeast and human pathogen Cryptococcus
neoformans. Science 307:1321–1324.
Martin DM, Miranda-Saavedra D, Barton GJ. 2009. Kinomer v. 1.0: a
database of systematically classified eukaryotic protein kinases.
Nucleic Acids Res. 37:D244–D250.
Martinez D, Larrondo LF, Putnam N, et al. (15 co-authors). 2004.
Genome sequence of the lignocellulose degrading fungus
Phanerochaete chrysosporium strain RP78. Nat Biotechnol. 22:
695–700.
Matsuzaki M, Misumi O, Shin IT, et al. (41 co-authors). 2004. Genome
sequence of the ultrasmall unicellular red alga Cyanidioschyzon merolae 10D. Nature 428:653–657.
Merchant SS, Prochnik SE, Vallon O, et al. (117 co-authors). 2007. The
Chlamydomonas genome reveals the evolution of key animal and
plant functions. Science 318:245–250.
Mikkelsen TS, Wakefield MJ, Aken B, et al. (62 co-authors). 2007.
Genome of the marsupial Monodelphis domestica reveals innovation
in non-coding sequences. Nature 447:167–177.
Milne I, Lindner D, Bayer M, Husmeier D, McGuire G, Marshall DF,
Wright F. 2009. TOPALi v2: a rich graphical interface for evolutionary
analyses of multiple alignments on HPC clusters and multi-core
desktops. Bioinformatics 25:126–127.
Miranda-Saavedra D, Barton GJ. 2007. Classification and functional
annotation of eukaryotic protein kinases. Proteins 68:893–914.
Miranda-Saavedra D, Stark MJ, Packer JC, Vivares CP, Doerig C, Barton
GJ. 2007. The complement of protein kinases of the
microsporidium Encephalitozoon cuniculi in relation to those of
Saccharomyces cerevisiae and Schizosaccharomyces pombe. BMC
Genomics 8:309.
Nagy V, Dikic I. 2010. Ubiquitin ligase complexes: from substrate selectivity to conjugational specificity. Biol Chem. 391:163–169.
Nijman SM, Luna-Vargas MP, Velds A, Brummelkamp TR, Dirac AM,
Sixma TK, Bernards R. 2005. A genomic and functional inventory of
deubiquitinating enzymes. Cell 123:773–786.
Palenik B, Grimwood J, Aerts A, et al. (39 co-authors). 2007. The tiny
eukaryote Ostreococcus provides genomic insights into the
paradox of plankton speciation. Proc Natl Acad Sci U S A. 104:
7705–7710.
Peacock CS, Seeger K, Harris D, et al. (42 co-authors). 2007. Comparative
genomic analysis of three Leishmania species that cause diverse
human disease. Nat Genet. 39:839–847.
Prediction of Ubiquitinating and Deubiquitinating Enzymes . doi:10.1093/molbev/mst022
Petroski MD. 2008. The ubiquitin system, disease, and drug discovery.
BMC Biochem. 9(1 Suppl):S7.
Platta HW, El Magraoui F, Baumer BE, Schlee D, Girzalsky W, Erdmann R.
2009. Pex2 and pex12 function as protein-ubiquitin ligases in peroxisomal protein import. Mol Cell Biol. 29:5505–5516.
Prochnik SE, Umen J, Nedelcu AM, et al. (28 co-authors). 2010. Genomic
analysis of organismal complexity in the multicellular green alga
Volvox carteri. Science 329:223–226.
Pruitt KD, Tatusova T, Klimke W, Maglott DR. 2009. NCBI Reference
Sequences: current status, policy and new initiatives. Nucleic Acids
Res. 37:D32–D36.
Punta M, Coggill PC, Eberhardt RY, et al. (16 co-authors). 2012. The
Pfam protein families database. Nucleic Acids Res. 40:D290–D301.
Putnam NH, Butts T, Ferrier DE, et al. (37 co-authors). 2008. The amphioxus genome and the evolution of the chordate karyotype.
Nature 453:1064–1071.
Putnam NH, Srivastava M, Hellsten U, et al. (19 co-authors). 2007. Sea
anemone genome reveals ancestral eumetazoan gene repertoire and
genomic organization. Science 317:86–94.
Randow F, Lehner PJ. 2009. Viral avoidance and exploitation of the
ubiquitin system. Nat Cell Biol. 11:527–534.
Rensing SA, Lang D, Zimmer AD, et al. (71 co-authors). 2008. The
Physcomitrella genome reveals evolutionary insights into the conquest of land by plants. Science 319:64–69.
Searls DB. 2003. Pharmacophylogenomics: genes, evolution and drug
targets. Nat Rev Drug Discov. 2:613–623.
Sodergren E, Weinstock GM, Davidson EH, et al. (226 co-authors). 2006.
The genome of the sea urchin Strongylocentrotus purpuratus. Science
314:941–952.
Srivastava M, Begovic E, Chapman J, et al. (21 co-authors). 2008. The
Trichoplax genome and the nature of placozoans. Nature 454:
955–960.
Takagi Y, Masuda CA, Chang WH, Komori H, Wang D, Hunter T, Joazeiro
CA, Kornberg RD. 2005. Ubiquitin ligase activity of TFIIH and the
transcriptional response to DNA damage. Mol Cell. 18:237–243.
Tuskan GA, Difazio S, Jansson S, et al. (111 co-authors). 2006. The
genome of black cottonwood, Populus trichocarpa (Torr. & Gray).
Science 313:1596–1604.
Tyler BM, Tripathy S, Zhang X, et al. (53 co-authors). 2006. Phytophthora
genome sequences uncover evolutionary origins and mechanisms of
pathogenesis. Science 313:1261–1266.
MBE
Uniprot Consortium. 2010. The Universal Protein Resource (UniProt) in
2010. Nucleic Acids Res. 38:D142–D148.
van Wijk SJ, Timmers HT. 2010. The family of ubiquitin-conjugating
enzymes (E2s): deciding between life and death of proteins. FASEB
J. 24:981–993.
Vertegaal AC. 2011. Uncovering ubiquitin and ubiquitin-like signaling
networks. Chem Rev. 111:7923–7940.
Warren WC, Hillier LW, Marshall Graves JA, et al. (105 co-authors). 2008.
Genome analysis of the platypus reveals unique signatures of evolution. Nature 453:175–183.
Waterhouse AM, Procter JB, Martin DM, Clamp M, Barton GJ. 2009.
Jalview Version 2—a multiple sequence alignment editor and analysis workbench. Bioinformatics 25:1189–1191.
Waterston RH, Lindblad-Toh K, Birney E, et al. (222 co-authors). 2002.
Initial sequencing and comparative analysis of the mouse genome.
Nature 420:520–562.
Wei F, Zhang J, Zhou S, et al. (20 co-authors). 2009. The physical and
genetic framework of the maize B73 genome. PLoS Genet. 5:
e1000715.
Wenzel DM, Klevit RE. 2012. Following Ariadne’s thread: a new perspective on RBR ubiquitin ligases. BMC Biol. 10:24.
Wilson D, Pethica R, Zhou Y, Talbot C, Vogel C, Madera M, Chothia C,
Gough J. 2009. SUPERFAMILY—sophisticated comparative genomics, data mining, visualization and phylogeny. Nucleic Acids Res.
37:D380–D386.
Woelk T, Sigismund S, Penengo L, Polo S. 2007. The ubiquitination code:
a signalling problem. Cell Div. 2:11.
Wood V, Gwilliam R, Rajandream MA, et al. (134 co-authors). 2002. The
genome sequence of Schizosaccharomyces pombe. Nature 415:
871–880.
Xie Y, Kerscher O, Kroetz MB, McConchie HF, Sung P, Hochstrasser
M. 2007. The yeast Hex3.Slx8 heterodimer is a ubiquitin ligase
stimulated by substrate sumoylation. J Biol Chem. 282:
34176–34184.
Zargari A, Boban M, Heessen S, Andréasson C, Thyberg J, Ljungdahl PO.
2007. Inner nuclear membrane proteins Asi1, Asi2, and Asi3 function in concert to maintain the latent properties of transcription
factors Stp1 and Stp2. J Biol Chem. 282:594–605.
Zdobnov EM, Apweiler R. 2001. InterProScan—an integration platform
for the signature-recognition methods in InterPro. Bioinformatics 17:
847–848.
1187