From genome to proteome: developing expression clone resources

Human Molecular Genetics, 2006, Vol. 15, Review Issue 1
doi:10.1093/hmg/ddl048
R31–R43
From genome to proteome: developing expression
clone resources for the human genome
Gary Temple1,*, Philippe Lamesch3,4, Stuart Milstein3, David E. Hill3, Lukas Wagner2,
Troy Moore5 and Marc Vidal4
1
Mammalian Gene Collection, National Human Genome Research Institute and 2National Center for Biotechnology
Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20892, USA, 3Center for Cancer
Systems Biology (CCSB) and Department of Cancer Biology, Dana-Farber Cancer Institute and Department of
Genetics, Harvard Medical School, Boston, MA 02115, USA, 4Unité de Recherche en Biologie Moléculaire, Facultés
Universitaires Notre-Dame de la Paix, 5000 Namur, Belgium and 5Open Biosystems, Huntsville, AL 35806, USA
Received February 20, 2006; Revised and Accepted March 2, 2006
cDNA clones have long been valuable reagents for studying the structure and function of proteins. With
recent access to the entire human genome sequence, it has become possible and highly productive to
compare the sequences of mRNAs to their genes, in order to validate the sequences and protein-coding
annotations of each (1,2). Thus, well-characterized collections of human cDNAs are now playing an essential
role in defining the structure and function of human genes and proteins. In this review, we will summarize
the major collections of human cDNA clones, discuss some limitations common to most of these collections
and describe several noteworthy proteomics applications, focusing on the detection and analysis of
protein – protein interactions (PPI). These human cDNA collections contain principally two types of cDNA
clones. The largest collections comprise cDNAs with full-length protein coding sequences (FL-CDS).
Some but not all of these cDNA clones may represent the entire mRNA sequence, but many are missing
considerable non-coding UTR sequence, usually at the 50 end. A second type of cDNA clone, a ‘full-ORF’
(F-ORF) expression clone, is one where the annotated protein-coding sequence, excised of 50 UTR and
30 UTR sequence, has been transferred to a vector designed to facilitate transfer to other vectors for protein
expression.
MAJOR HUMAN FL-CDS CLONING PROGRAMS
During the past decade, several large-scale government and
academic programs have collected and characterized human
FL-CDS cDNA clones. The major programs include the
NEDO (FLJ) Project, the Kazusa cDNA Project, the Mammalian Gene Collection (MGC), the German Human cDNA
Project, the Harvard Institute of Proteomics (HIP) and the
ORFeome program of the Center for Cancer Systems
Biology (CCSB) at the Dana-Farber Cancer Institute. These
programs distribute their sequence information through
GenBank, EMBL/EBI or the DNA Database of Japan
(DDBJ), all of which exchange data daily through the
International Nucleotide Sequence Database.
The following descriptions address only human cDNA
clones, although most of these programs offer cDNA clones
for additional organisms, as well. The utility of cDNA
collections for proteomics studies have been described in
several recent reviews (3 –7). Further details on each
program are provided in Table 1.
The human full-length sequencing project of
Japan NEDO (FLJ)
The largest human FL-CDS cDNA collection is the NEDO
(FLJ) project (8,9), a joint program of the Institute of
Medical Sciences of the University of Tokyo, the Helix
Research Institute and the Kazusa DNA Research Institute
*To whom correspondence should be addressed. Tel: þ1 3015945951; Fax: þ1 3014802770; Email: [email protected]
Published by Oxford University Press 2006.
The online version of this article has been published under an open access model. Users are entitled to use, reproduce, disseminate, or display the open access version of
this article for non-commercial purposes provided that: the original authorship is properly and fully attributed; the Journal and Oxford University Press are attributed as the
original place of publication with the correct citation details given; if an article is subsequently reproduced or disseminated not in its entirety but only in part or as a
derivative work this must be clearly indicated. For commercial re-use, please contact: [email protected]
R32
Human Molecular Genetics, 2006, Vol. 15, Review Issue 1
Table 1. Government and academic programs to build collections of human FL-CDS clones and F-ORF expression clones
Source
NEDO (FLJ)
MGC
Kazusa
No. of genesa Sequence
validation
Stop codon UTR
Type of clone
sequences
Vector system
Clones distributed
(MTA Req’d)
Miscellaneous
(References)
18 500
Full-length
þ
þ
FL-CDS
pME18S-FLb
NITE-BRC (Yesc)
.30 000 clones
from oligo-capping
cDNA libraries
(S. Sugano,
personal communication, N. Nomura,
personal communication, 14,15)e
18 500
Full-length
+g
2
F-ORF
GatewayTM ,
pENTR201
TBD
.13 000
Full-length
þ
þ
FL-CDS
Various types
IMAGE (No)
(N. Goshima, in preparation;
N. Nomura,
personal communication)
.22 000 clones (78)f
.5 500
Full-length
þ
þ
FL-CDS
Various types
KDRIc (Yes)
ORFs .4 kb
F-ORF
GatewayTM ,
wd
.1 000
Full-length
+
2
c,d
KDRI
(Yes)
(79)
Flexi
DKFZ
.10 000
1 200
Full-length
Full-length
þ
+
þ
2
Many as FL-CDS
F-ORF
GatewayTM ,
pENTR221
CCSB
.10 000
End sequences 2
2
F-ORF
HIP
RZPD (No)
RPZD (No)
(20,22,80,81)
GatewayTM ,
pENTR223
Open Biosystems
(No)
(23,57b)
(5)
4 000
Full-length
Most +
2
F-ORF
GatewayTM ,
pENTR201,
pENTR221
HIP (No)
3 000
Full-length
Most +
2
F-ORF
CreatorTM
pDNR-dual
HIP (No)
a
Most of these collections also include mRNA isoforms for a small percentage of these genes.
The pME18S-FL vector is a mammalian expression vector, containing modified SV40 promoter, SV40 small t splice donor and acceptor 50 to the
cDNA insert; proteins expressed from this vector contain short N-terminal fusion peptide from SV40 (13).
c
Clones available without license only to academic researchers for research use (requires MTA).
d
A matching set of ORF expression clones in the Flexiw Vector system is under construction.
e
Clones originating from the University of Tokyo and the Helix Research Institute are distributed through the NITE Biological Resource Center
(http://www.nbrc.nite.go.jp/e/hflcdna-e.html), and clones originating from KDRI are distributed through KDRI. Kazusa clones distributed by
KDRI: (http://www.kazusa.or.jp/NEDO/clone.req/index.html).
f
MGC clones are distributed through the IMAGE Consortium distributors (see Table 2).
g
+ denotes expression ORFs are available with and without stop codons.
b
(KDRI) (http://www.nedo.go.jp/bio-e/) (10 –12). The NEDO
(FLJ) collection (http://fldb.hri.co.jp/cgi-bin/cDNA3/public/
publication/index.cgi) contains 30 000 human FL-CDS
clones representing 18 500 loci obtained from 130
libraries (10) (S. Sugano, personal communication). Most
of the clones in this collection were obtained using the
‘oligo-capping’ method that enriches for cDNAs extending
to the 50 end of the mRNA (13 –15). Recently, the 18 500
non-redundant FL-CDS clones have been configured as
F-ORF Gateway expression clones (N. Goshima, manuscript
in preparation; N. Nomura, personal communication). Clones
may be requested from NITE-BRC (see Table 1 for URL)
and require a signed material transfer agreement (MTA).
genes (11,12). A major focus of the HUGE project (http://
www.kazusa.or.jp/huge/index.html) is to characterize the
function of proteins .50 kD. A significant part of the collection has been manually curated with additional information on
possible protein function (11). Nearly, 1000 of these clones are
available also as F-ORF GatewayTM expression clones; and
the same set of F-ORF clones is being constructed in the
Flexiw Vector system (O. Ohara, personal communication).
This collection has recently been expanded to include an
additional 3000 large cDNA clones, giving a total of
5500 clones. All clones are fully sequenced, and are available from the KDRI through a signed MTA.
Mammalian gene collection (MGC)
The KDRI HUGE collection of long FL-CDS clones
This collection contains FL-CDS cDNAs ranging in size from 3.3
to over 10 kb, representing more than 2000 novel, non-redundant
The MGC (16,17) is an FL-CDS cDNA cloning and sequencing program sponsored and managed by the US National
Institutes of Health (NIH) (http://mgc.nci.nih.gov/). Its goal
Human Molecular Genetics, 2006, Vol. 15, Review Issue 1
for human clones is to achieve, by 2007, at least one FL-CDS
cDNA clone for each of the well-defined 18 368 human
RefSeq loci, discussed subsequently (18) (L. Wagner, personal
communication) (Note: See Supplementary Material for a list
of these genes.) The first 12 000 non-redundant human
clones were obtained by MGC from .180 cDNA libraries
prepared from a wide variety of tissues. Most new human
clones deposited over the past 2 years have been obtained
by a directed RT-PCR cloning strategy. MGC full-length
cDNAs are sequenced to a high standard of quality, and all
sequences are compared with the reference genome to identify
mismatches, with differences annotated in the GenBank
entries (1,16). Clones are distributed, without restriction,
through the IMAGE Consortium.
The German human genome project
Begun in 1997, a consortium of German Scientific Institutes has
cloned and characterized cDNAs for transcripts then missing
from other collections, and provided functional information
on these transcripts (19,20). Descriptions of more than
10 000 cDNA clones obtained from this program (19,21,22)
are provided at http://mips.gsf.de/proj/cdna/Sites/HU_cDNA_
Database.htm. The cDNAs are fully sequenced to a high
quality (less than one error per 10 000 nt), and many contain
the full protein coding sequence. A set of 1200 F-ORF
GatewayTM expression clones recently has been constructed
that will be released in the near future (S. Wiemann, personal
communication). Clones are available, without restriction,
through the consortium’s commercial distributor, RZPD.
Dana-Farber Cancer Institute Center for Cancer
Systems Biology (CCSB)
This program (23,57b) is building a steadily expanding collection of human F-ORF expression clones (http://horfdb.dfci.
harvard.edu/). Their goal is to generate a complete set of
F-ORF ORFeome clones representing all protein-coding
sequences in the human genome. The full coding sequence
is PCR-amplified from MGC human FL-CDS clones and
transferred into GatewayTM Entry vectors (23). The absence
of a stop codon in these clones allows users to create both
C- and N-terminal fusion proteins in appropriate expression
vectors. This collection presently comprises non-redundant
F-ORF expression clones for 10 000 human genes. Clones
are all end-sequenced and distributed without restriction
through Open Biosystems.
Harvard Institute of Proteomics (HIP)
HIP offers an ever-growing collection of human F-ORF
expression clones in two different expression vector formats
(5,6): GatewayTM (4000 F-ORFs, most with and without
stop codons) and Creator (5500 F-ORFs, representing
3000 genes, most with and without stop codons). All are
fully sequenced. These clones are distributed without
restriction though HIP: http://www.hip.harvard.edu/.
HIP also provides a distribution service to research laboratories, called the Shared Plasmid Resource, whereby HIP
distributes clones donated to HIP by any outside research
laboratory. Shared Plasmid Resource clones are available
R33
without restriction to non-commercial requestors, and to
commercial investigators with the permission of the donating
laboratory.
Commercial sources of cDNA clones. Sizeable collections of
human FL-CDS and F-ORF expression clones are also
available from commercial vendors, the largest of which are
listed in Table 2, together with some properties of their
human clone collections.
Useful databases of human cDNA clone sequences and related
information.
(a) The H-Invitational Database (H-Inv): This database is the
product of two international H-Inv workshops, held in
2002 and 2004 (10), which convened a diverse group of
bioscientists to annotate and manually curate 41 118
FL-CDS, representing upto 21 037 loci, derived from the
largest published collections of human cDNA sequences.
The H-Inv database browser (http://www.h-invitational.
jp/) describes where each cDNA maps to the genome,
together with extensive functional information.
(b) UCSC Genome Browser Database: The UCSC Browser
(http://genome.ucsc.edu/) displays extensive sequence
and annotation information on the sequence of human
and 11 other vertebrate genomes, and for several model
organisms. The browser supports rapid visualization and
querying of genes, gene predictions, mRNAs, ESTs,
expression and variation data.
(c) The Vertebrate Genome Annotation (Vega) database
(http://vega.sanger.ac.uk) provides a resource for browsing manually annotated finished sequences of human,
mouse, zebrafish and dog genomes. The Vega browser
includes detailed genome maps of genes, transcripts, proteins and protein domains.
LIMITATIONS TO CURRENT cDNA COLLECTIONS
Errors in mRNA sequence and annotation
Reliable gene counting, discussed below, relies primarily on
mapping cDNA sequences to the human genome, which in
turn often reveals variations between the cDNA and genome
sequence (1,2). Variation in cDNA sequences can arise
either from experimental artifacts introduced into the cDNA
sequence or by naturally occurring sequence variation. Experimental artifacts arise during cDNA synthesis, by reverse transcriptase, during the ligation of the cDNA to the vector, and
during subsequent cloning steps, as well as from errors in
the DNA sequence analysis of the cDNA insert. When
RT-PCR is used to clone cDNAs, errors can arise during the
DNA amplification process, as well as from the synthetic
DNA primers used for PCR.
An accepted standard for mRNA sequences for human,
mouse, and more than 2400 other species is the RefSeq
program at NCBI (24). RefSeq provides a carefully curated,
non-redundant set of full-length mRNA sequences (including
50 and 30 UTR), based on genomic, mRNA and protein sequence
evidence. A second set of high-quality, full-length human,
mouse and rat mRNA sequences is available through the
MGC program at the NIH. Launched in early 2000, the MGC
R34
Human Molecular Genetics, 2006, Vol. 15, Review Issue 1
Table 2. Commercial sources of full-length human ORF clones
Commercial
provider
Clone collection
name
No. of
unique genes
representedb
Sequence
validation
Stop
codon
UTR
sequences
present
Vector system
Miscellaneous
(References)
IMAGE
distributorsa
Genecopoeia
MGC
.13 000
Full-length
þ
þ
Mixed types
(78)
ORF express
13 500
.70% Full-length
+
2
GatewayTM entry
OmicsLink
13 500
+
2
UltimateTM ORF
collection
FreedomTM ORF
collection
Incyte gene collection
ORFeome collection
TrueCloneTM collection
.12 000
ORF transferred
from ORF
expressc
Full-length
þ
2
OmicsLink
expression
vectors (38 types)
GatewayTM entry
Genecopoeia cDNA
templates
Ready-to-use
expression clones
Invitrogen
Open Biosystems
Origene
FlexClones
RZPD
Full ORF shuttle
Full-ORF expression
TM
Full-length
+
2
Creator
Full-length
Ends only
75% Ends 25% F-L
þ
2
þ
þ
2
þ
Mixed types
GatewayTM entry
pCMV6-XL4
.150
Full-length
+
2
Flexi Vectorw
3 700
550
Full-length
Full-length
Most +
+
2
2
GatewayTM entry
GatewayTM
expression
(15 types)
1 100
.10 000
.10 000
.17 000
pDNR-dual
MGC cDNA templates
cDNA templates (82)
(82)
MGC cDNA templates
Isolated from cDNA
libraries
(83)
Ready-to-use attB
expression clones
a
IMAGE distributors: ATCC www.atcc.org; GeneService www.geneservice.co.uk; Invitrogen www.invitrogen.com; Open Biosystems www.
openbioystems.com; RZPD www.rzpd.de.
Most of these collections include a small percentage of other mRNA isoforms.
c
Transfer uses non-PCR method (RecJoinTM ) with low likelihood of introducing mutations.
b
exploited recent advances in gene sequencing technology to set
high cDNA sequencing standards (,1 error per 50 000 nt). The
MGC sequences also are thoroughly curated to identify and
eliminate clones with frameshifts or chimeras, and to annotate
non-synonymous variation in the cDNA sequences that could
be due to experimental artifact (1,16).
Taken together, the RefSeq and MGC sequences provide the
most thoroughly curated collection of human cDNAs. Nevertheless, some errors undoubtedly remain in a fraction of these
sequences. Furey et al. (1) recently aligned the combined
mRNA sequences of MGC and RefSeq to the reference
human genome sequence (July 2003 release); they found that
EST and other mRNA sequences support natural variation
in the genome sequence about four times more often than in
the mRNA sequence, implying that the genome sequence is
considerably more accurate than the mRNA sequences in
these collections (1). After excluding known and probable
polymorphisms the authors estimated about one difference per
2500 nt, representing sequencing errors or other experimental
artifacts. A second study of MGC sequences (16) estimated
that non-synonymous sequence changes (resulting in altered
amino acids) because of experimental artifacts may populate
about 10% of non-redundant human MGC clones.
Another kind of error can arise in the sequence record from
an incorrect annotation of the protein-coding sequence (ORF)
within an mRNA. Identifying whether a start codon in a
particular transcript is or is not the initiating codon of a
protein-coding transcript is difficult when transcripts lack an
upstream stop codon within the 50 UTR in frame with the
initiating ATG, coupled with an absence of definitive protein
support for the N-terminal portion of the predicted protein.
This is a common circumstance, as 39% of human RefSeqs
mRNAs (24) have no in-frame upstream stop codon
(L. Wagner, personal communication). Choosing the starting
ATG in these situations often must rely on gene prediction
programs, such as GenScan and NSCAN (2,25), and on
algorithms that look for evidence of a transition from noncoding to coding sequences around the ATG (17).
Likewise when several ATG codons are in phase with a
predicted ORF, the identification of the predominant initiating
ATG may be ambiguous. The ribosome scanning model
(26,27) predicts that the ATG furthest 50 on the mRNA is preferred by ribosomes as the initiating ATG, unless it is hidden
by RNA secondary structure or is competing with a nearby
ATG surrounded by a more favorable Kozak consensus
sequence (26,27). In some eukaryotic mRNAs, however,
protein synthesis appears to initiate at different levels at
two or more ATG codons and, rarely, at non-ATG codons
(28 –30). Unconventional start codons such as these are
likely to be missed during the annotation process.
The gold standard for annotated human protein-coding
sequences is the Consensus CDS (CCDS) set of 13 142
genes (31) agreed to at every coding nucleotide by the
CCDS group, comprising the NIH, National Center for
Biotechnology Information (NCBI), European Bioinformatics
Institute (EBI), Wellcome Trust Sanger Institute (WTSI) and
University of California, Santa Cruz (UCSC).
Incomplete sequence representation in cDNA libraries
An optimally useful cDNA collection should contain at
least one representative transcript for each protein-coding
Human Molecular Genetics, 2006, Vol. 15, Review Issue 1
human gene. The total number of human genes, however,
is still uncertain. The essentially completed human
euchromatic genome sequence, published in 2004, suggested
that the human genome encodes 20 000 –25 000 genes (32).
Recent EnsEMBL and NCBI gene models predict
22 000 human genes (L. Wagner, personal communication)
[(2005) ENSembl Human, http://www.ensembl.org/Homo_
sapiens/index.html ENSembl], similar in number to recent
estimates for gene numbers in mouse (33) and rat (16,34). A
more conservative and more recent human gene estimate,
based on orthologous genes of mouse and dog (35), proposes
19 600 human genes (M. Clamp, manuscript in preparation).
These sets of annotated protein-coding genes include some
genes whose coding sequence is represented partly or entirely
by gene predictions, rather than by the sequence of FL-CDS
cDNAs.
Based on NCBI genome build 36.1, the number of human
protein-coding genes corresponding only to FL-CDS cDNAs
annotated on the genome (i.e. excluding gene models) totals
18 368 (as of March 25, 2006). (Note these genes are
identified with the following Entrez query against the Gene
database at http://www.ncbi.nlm.nih.gov/: homo_sapiens
[ORGANISM] AND protein_coding NOT srcdb_refseq_
model [PROPERTIES].) This gene set includes all 13 142
CCDS genes. A complete listing of this set of genes is provided in the Supplementary Material. Currently, the MGC
contains one or more FL-CDS clones for 75% of the 18 368
MGC gene set and for 80% of the CCDS genes (L. Wagner,
personal communication).
Until recently, MGC and most other large cDNA collections
have been built from clones isolated by random screening of
multiple cDNA libraries. RNA from a wide variety of
tissues is typically used to promote transcript diversity. Nevertheless, random screening approaches preferentially clone
cDNAs for the RNAs in highest abundance, which generally
derive from a relatively small number of genes. The vast
majority of genes, however, is represented at low abundance,
with about 1 –10 copies of mRNA per cell (37 – 39). These
mRNA frequency distributions likely contribute to the commonly observed drop in yield as libraries are progressively
screened for new clones representing unique genes. To
improve the yield of clones for new genes, libraries can be
treated to normalize or equalize the abundance classes, and
libraries can be pre-subtracted of RNA sequences previously
isolated (40 – 42). An alternative approach that is less sensitive
to mRNA abundance is to clone individual RT – PCR products,
targeting specific mRNAs, using primers based on RefSeq
mRNA sequences (43,44).
Other factors can also contribute to the under-representation
of genes and their transcripts in cDNA collections, including
the low abundance of certain mRNAs that are unique to one
or a few tissues, and therefore difficult to obtain in substantial
quantity; cDNAs that encode products toxic to the bacterial
cells used for cloning; cDNAs that contain inverted or direct
repeats that are unstable during the cloning and propagation
steps; and cDNAs greater than 4– 6 kb in size, which are
less efficiently cloned.
Bioinformatic methods suggest that 35–74% of human genes
may utilize alternative splicing (45–48), and additional isoforms
can result from alternative transcript initiation sites (49) and
R35
alternative poly-A addition sites (50,51). The total number of
physiologically relevant RNA isoforms is unknown, but specific
isoforms are known to play important roles in cells, such
as isoforms encoding proteins functioning in ion channels
of nerve, muscle and cardiac cells (46,52–54). The variety of
RNA isoforms in different human cells is almost certainly
under-represented in most current cDNA clone and sequence
collections. For example, less than 10% of the 22 400
FL-CDS clones present in MGC appear to represent splice
variants (L. Wagner, personal communication). Whatever
deficiency of isoforms exists in today’s FL-CDS collections
will need to be remedied for future proteomics studies that
attempt to sample the entire ORFeome.
Finally, single-exon genes and multi-exon genes encoding
small proteins also are generally under-represented in
current collections, in part by design. Although physiologically relevant proteins are encoded by single-exon genes
(55,56), to avoid artifactually short cDNAs MGC and other
programs have purposely excluded transcripts encoding proteins of fewer than 100 amino acids, except where there is
strong protein evidence for their natural occurrence (16).
Practical limitations of FL-CDS clones
The FL-CDS clones available from some of the largest cDNA
collections (Table 1) are generally unsuitable for immediate
use in expression studies, often because they lack an appropriate promoter. Variable lengths of 50 UTR sequence also can
potentially encode unwanted amino acids and complicate
the design of N-terminal fusion constructs where it is important to maintain the proper reading frame into the CDS. To
properly position the coding sequences next to a promoter
of choice or next to sequences encoding N-terminal fusion
proteins, such as reporter proteins or epitope tags, the 50
UTR sequences must be removed. Likewise, to prepare
C-terminal fusion proteins, the natural stop codon must be
removed and variable lengths of 30 UTR sequence preferably
excised.
The protein-coding sequence, with or without its stop
codon, can be excised from cDNAs by PCR or occasionally
by restriction digestion, provided suitable restriction sites are
available. Both methods are time-consuming and can introduce mutations into the subclone. To address this problem,
several groups have transferred the protein-coding sequences
from these FL-CDS clone collections into specialized
expression vector systems.
To conform to popular convention, the full-length proteincoding region (+stop codon), excised of UTR sequence,
will hereafter be referred to as an ‘ORF’. [Note: mRNAs
with multiple ATG codons near the 50 end and in different
phases of reading frame can potentially encode more than
one open reading frame (ORF) of significant length. The annotated CDS of an mRNA is the ORF that is judged to be the
most likely protein coding sequence for that mRNA, based
on the bioinformatic criteria discussed earlier.] Clones
configured in this manner will be called ‘ORF clones’,
and ORF clones in expression vectors will be called
‘Expression ORFs’. The set of human ORF clones representing all protein-coding sequences is referred to as the
‘human ORFeome’.
R36
Human Molecular Genetics, 2006, Vol. 15, Review Issue 1
Figure 1. ORF transfer from a donor vector to multiple acceptor vectors. The ORF sequence in one or more donor vectors can be transferred in a single experiment to multiple acceptor vectors, using a single reaction protocol. The ORF in the donor vector is flanked by recombination sites (Gateway, Creator, Magic) or
by rare-cutting restriction enzyme sites (Flexi Vector) that permit directional transfer and maintain the desired translational reading frame between the ORF and
sequences in the acceptor vectors encoding N-terminal or C-terminal protein tags. Adapted from Hartley et al. (57a), Walhout et al. (57b).
Some of the most commonly used systems for large-scale
cloning and expression of F-ORF clones are listed in
Table 2. As shown in Figure 1, these vector systems permit
the transfer of one or more ORFs from a ‘donor’ vector to
one or multiple different ‘acceptor’ vectors, potentially all
in a single experiment (57a,b). (In most cases, acceptor
vectors are expression vectors with sequences flanking the
ORF that promote its transfer from the donor vector.) Moreover, these transfers are performed using a single protocol
that positions each ORF into each new acceptor vector in a
configuration suitable for native or fusion protein expression,
using reactions that maintain the orientation and proper
reading frame of the coding sequence and that rarely introduce mutations. To prepare C-terminal fusion proteins, a
separate collection of F-ORF clones lacking a stop codon is
generally prepared.
All of the systems listed in Table 3, except for the Univector
system, use sites flanking both ends of the ORF in a donor
vector that are recognized by rare-cutting enzymes (restriction
enzymes or recombinases), virtually eliminating inappropriate
cleavage within the cDNA and vector backbone. These systems
use positive (antibiotic) selection to obtain the desired product,
together with counter-selection to reduce background colonies
resulting from acceptor vector lacking an insert; together
these constraints lead to extremely low backgrounds of
unwanted constructs. Some of the features of these systems
are listed in Table 3. A large number of human cDNA
clones are available from commercial sources, in some cases
as ready-to-use F-ORF expression clones in a variety of
expression vectors (Tables 2 and 3).
Because these systems are suitable for the transfer of anywhere from a few ORFs to thousands of ORFs within a
short period of time—enabling researchers to create rapidly
large numbers of F-ORF expression clones—they are being
used increasingly for large-scale proteomics studies, as discussed below.
ORFeome Collaboration. Though several collections of ORF
expression clones are available (Tables 1 and 2), no single
public collection contains ORFs representing all 18 000
well-defined RefSeq human genes, and clones for many of
these genes are absent from all of the current collections. To
address this need, in 2005, MGC, WTSI, CCSB, HIP, DKFZ
and the RIKEN Yokohama Institute organized an effort,
named the ORFeome Collaboration, to share resources and
new human FL-CDS clones, with the aim of building a complete collection of F-ORF expression clones for all welldefined human genes, configured as Gateway Entry clones.
These fully sequenced F-ORF clones will be distributed
worldwide, without restriction, to academic, government and
commercial researchers.
RECENT APPLICATIONS OF HUMAN ORFeome
COLLECTIONS
ORFeome collections can be used at every scale of research
from experiments on single ORFs to studies of entire
ORFeomes. For example, ORFs can be used one at a time to
study protein localization or for structural experiments. ORF
collections also lend themselves to module-scale experiments,
where a particular pathway or biological function can be
examined in its entirety.
But the greatest value can be extracted from ORFeome
collections when the entire resource is used to carry out
large-scale experiments. Until recently, such studies would
have been impossible to carry out because of low numbers
of cloned ORFs, the lack of a central repository for those
ORFs, and because the ORFs that were available were not
Human Molecular Genetics, 2006, Vol. 15, Review Issue 1
R37
Table 3. Expression-convenient vector systems for ORF expression: design, strengths and weaknesses
System
Source
ORF transfer by
Strengths
Weaknesses
References
GatewayTM
Invitrogen
l-att recombination
Residual 23 –25 bp attB sites; most
Gateway clone collections lack
rare-cutting restriction enzyme
sites flanking ORF; Clonase
enzyme costs; license required for
commercial use.
(57a,b,84a,b)
CreatorTM
Clontech-Takara
Cre-lox recombination
High efficiency, ORF transfer
bi-directional; largest number of
ORF clones available in this
format. Compatible with
MultisiteTM Gateway, allowing
exchange of additional elements,
besides ORF. The system has been
adapted for large-scale insert
transfer by bacterial mating.
Low cost for enzyme and no licensing
costs for commercial uses.
(85)
MAGICTM
Steve Elledge
(Harvard)
and Open
Biosystems
E. coli homologous recombination
Flexible homology arms; transfer by
mating avoids plasmid prep; low
cost; no license required for
commercial uses.
Flexiw
Promega
Rare RE þ
ligation
(SgfI/PmeI)
Univector/EchoTM
Steve Elledge
(Harvard) and
Invitrogen
Cre-lox recombination
RE sites add only three residual
amino acids; low cost; donor
libraries can be created in the
expression vector; ORF transfer
bi-directional for native and
N-terminal, but not C-terminal
fusions.
Low cost
Residual 34 bp lox-P sites; modest
collections of human ORF clones
in this format.
Recombination within vector regions
that are homologous; no
standardized format for
recombination sites and no
sizeable collections of human
clones in this format. Requires
flanking recombination sites of
50 bp.
Enzymatic digest and ligation
required; adds Val to C-terminus
of native proteins. Few human
ORF clones available in
Flexi-compatible vectors. License
required for some commercial
applications.
Requires plasmid fusion, with
inefficient transfer of large inserts;
residual 34 bp lox-P site. Few
human ORF available in this
format.
in the same vector or were not expression ORFs, but rather
cloned cDNAs containing 50 and/or 30 UTR. Three notable
approaches that have begun to flourish with the availability
of ORFeome collections are structural genomics (58),
proteome-wide mapping of PPI primarily using the yeast
2-hybrid (Y2H) system (59a,b,60), and genome-scale cellbased assays including high-content screening using automated imaging analysis (61).
Structural genomics
In the emerging field of structural genomics, the aim is to
lower the cost and expand the coverage of identified protein
folds (62). To reach this goal, there have been several
large-scale initiatives, such as the Protein Structure Initiative
(http://www.structuralgenomics.org/), that aim to generate
structures based on available protein-encoding ORFs.
Although these centers have not yet taken full advantage of
complete mammalian ORFeome collections, they have had
some success with earlier ORFeome collections, such as that
for Caenorhabditis elegans (58).
Protein interaction mapping
Two-hybrid (2-H ) and one-hybrid (1-H ) systems. Highthroughput Y2H approaches generally consist of testing all
available combinations of proteins as DNA-binding domain
(DB-X) and activation domain (AD-Y) fusion proteins (63)
(86)
(87)
(88)
(Fig. 2). Although early versions of this system were criticized
for having a high false positive rate, the implementation of
more stringent Y2H systems and the rigorous retesting of
interactions have in large part eliminated this concern. In
current versions of the Y2H system, low-copy centromeric
vectors are used to reduce the expression level of the
fusions to avoid spurious interactions. In many cases, autoactivating baits have an intrinsic trans-activating activity
and can easily be eliminated, before starting the screen, by
testing for reporter gene activity in yeast cells containing
only the DB-X vector. Other DB-X auto-activators arise de
novo during the screen and are eliminated using a plasmid
shuffling counter-selection. In this method, a counter-selection
relying on cycloheximide sensitivity is used to eliminate
AD-Y from yeast cells to ensure that the Y2H reporter
genes are activated only in the presence of both DB-X and
AD-Y (64).
The Y2H system has been used for high-throughput PPI
mapping for several model systems including yeast, Drosophila melanogaster and C. elegans. Owing to the availability
of extensive ORF collections, similar module-scale and
proteome-scale PPI maps have recently been generated for
human. Maps have been generated for the Smad TGF-beta
signaling pathway (65), mRNA degradation factors (66) and
proteins linked to Huntington’s disease (67).
Two large-scale human PPI maps have recently been published (59b,60). Both groups screened a significant portion
R38
Human Molecular Genetics, 2006, Vol. 15, Review Issue 1
Figure 2. Five types of protein interaction assays. This figure summarizes the most common one- and two-hybrid assays used for PPI screens. (A) A variety of
one- and two-hybrid assays have been developed to test PPIs in different cell types and subcellular locations. (1) The MAPPIT assay takes place in the cytosolic
sub-membrane space of mammalian cells (72). (2) LUMIER is the only system with an extracellular interaction readout and can detect interactions taking place in
any subcellular location (71). (3) The Y1H system screens for protein– DNA interactions and relies on the recruitment of the AD-Y protein to the yeast nucleus (68).
(4) The Y2H system requires both fusion proteins to be located inside the yeast nucleus to detect PPIs (63). (5) The split ubiquitin assay is designed to detect PPIs
between integral membrane proteins that take place in the yeast membrane (70). (B) In each of the assays presented here, baits and preys are represented in red
and dark blue, respectively. (1) In the MAPPIT system, a ligand (L) binds the receptor’s ligand binding domain activating the receptor-associated JAKs. The
mutation Tyr1138-.Phe on the cytosolic domain of the receptor prevents the recruitment and activation of STAT. Two additional mutations, Tyr985-.Phe and
Tyr1077-.Phe, eliminate adaptor and/or negative feedback mechanisms. Upon phosphorylation of these binding sites, STATS are recruited and activated
leading to the formation of STAT complexes that subsequently induce luciferase activity or puromycin resistance under the control of the rPAP1 promoter.
(2) In the LUMIER assay, RL-tagged baits and flag-tagged preys are immunoprecipitated from mammalian cells. Interactions are detected enzymatically in
the form of light emission. (3) The Y1H system detects protein–DNA interactions using a single hybrid protein AD-Y. A positive interaction activates a reporter
gene. (4) In the Y2H system, the interaction of two fusion proteins, DB-X and AD-Y, reconstitute a transcription factor that activates a reporter gene. (5) In the
split ubiquitin system, one integral membrane protein is fused to one half of the ubiquitin protein (NubG), whereas the second membrane protein is fused to the
other half of the ubiquitin protein and a transcription factor (Cub-PLV). Interacting proteins bring the two halves of ubiquitin into proximity thereby reconstituting that protein which is then cleaved by an ubiquitin-specific protease releasing the transcription factor.
Human Molecular Genetics, 2006, Vol. 15, Review Issue 1
R39
Figure 3. Integrating interaction maps with other large-scale datasets. This PPI network represents a sub-network of the human interaction map generated by
Rual et al. All interactions, or edges, represented here have been confirmed by one or more additional functional links. Proteins are depicted as yellow nodes with
adjoining gene symbols. Combined physical interactions and functional links between gene- or protein-pairs are depicted as magenta edges (for gene pairs that
are co-expressed or share a common conserved upstream motif), green edges (for protein pairs that share a common GO term) or orange edges (for protein pairs
that have mouse orthologs that share a common phenotype). Figure adapted from Rual et al. (60).
of the ORFeome by testing all pair-wise combinations for
interaction. Stelzl et al. (59b) used 3500 cDNAs from a
human fetal brain expression library in addition to 2000
MGC ORFs; in contrast, Rual et al. screened a matrix of
8000 ORFs obtained from MGC cDNAs. Stelzl et al. generated a map of 911 high-confidence interactions (‘edges’)
among 401 proteins (‘nodes’), whereas Rual et al. constructed
a map of 2754 core interactions between 1549 proteins
(Fig. 3). Though Y2H has proven to be a powerful and scalable tool for PPI mapping, it results in a high level of false
negatives. Therefore, complementary approaches are needed
to generate a more complete map of the human interactome.
In contrast to the Y2H, the Yeast 1-Hybrid (Y1H) is designed to detect protein –DNA interactions. Y1H protein – DNA
interactions are defined using a single hybrid protein, AD-Y,
where Y is a known or putative DNA binding protein (DB).
Though the Y1H system has not yet been applied to a large
collection of human ORFs, its use and scalability has recently
been demonstrated in C. elegans (68). As in the Y2H system,
reporter gene expression is used as a readout when AD-Y can
bind to a sequence of DNA that has been cloned upstream of a
reporter gene.
Other techniques used to detect interactions. In contrast to the
Y2H system, where interactions occur in the nucleus, the
following other techniques have been developed to test interactions in other cells types and cellular compartments.
R40
Human Molecular Genetics, 2006, Vol. 15, Review Issue 1
Split ubiquitin. Recently, a two-hybrid system called ‘split
ubiquitin membrane Y2H’ has been adapted for large-scale
screening (69). This variant of the 2-hybrid system is currently
the only one that allows large-scale investigation of integral
membrane proteins, a class that cannot be screened using the
traditional Y2H system. A yeast interaction map of 2000
interactions among 500 membrane proteins has been built
using this technique (70).
LUMIER. A luminescence-based mammalian protein– protein
interactome mapping system (LUMIER) has recently been
described. This method was used in a high-throughput
manner to study the transforming growth factor-beta
(TGF-B) pathway (71). In this mammalian cell-based assay,
bait proteins are fused to Renilla luciferase (RL) and the
prey proteins are tagged with the FLAG epitope. Interactions
are determined by performing an RL enzymatic assay on
immunoprecipitates using a Flag-antibody. Though this technology has not yet been applied on a high-throughput scale
the system could easily be adapted for such studies.
MAPPIT. MAPPIT (mammalian protein–protein interaction
trap) uses a cytokine-receptor-based interaction trap to detect
protein–protein interactions (72). The interaction between
bait and prey reconstitutes the receptor by bringing the activated JAKs into proximity of functional STAT recruitment
sites. This recruitment allows for the activation and dimerization of STATS, which then act as a transcription factor to
drive a reporter gene. Because assayed interactions occur in
the cytosolic sub-membrane space, MAPPIT does not rely on
nuclear translocation of bait and prey proteins. Another advantage of this technique is that the readout is ligand-dependent,
adding a unique level of control to monitor interactions. The
usage of heterologous receptors fused to different bait proteins
can also allow for detection of modification-dependent PPI
such as phosphorylation-dependent interactions that might be
too transient to be detected by the standard Y2H. Although
MAPPIT can be used for module-scale screens, this procedure
has not yet been adapted for proteome-scale screens.
Disrupting interactions. Disrupting PPI, with small biomolecules and chemical compounds, can reveal a great deal
about the protein features that generate interactions. For
example, interaction interfaces and interacting domains can
be mapped using reverse-Y2H assays. Such systems can also
be exploited for drug discovery, where small molecules are
identified that can disrupt an interaction of medical relevance.
Reverse-Y2H (73) and MAPPIT (72) systems have been
developed successfully to screen for disruptions of PPI.
These disruptions can be caused by mutations in cis, within
the interacting protein molecules, or in trans, by compounds
that prevent the interactions from taking place. In the
reverse-Y2H system, the interaction between DB-X and
AD-Y can be used to drive the expression of URA3.
Expression of this reporter leads to the conversion of 5-FOA
to 5-FU, a toxin. This counter-selection is used to identify
DB-X and AD-Y pairs that can no longer interact, based on
their survival on media containing 5-FOA. In the reverse
MAPPIT system, a disrupted protein interaction is identified
based on the loss of an interaction between a protein fused
to an inhibitor of JAK/STAT signaling and a bait fused to a
functional cytokine-receptor, thus allowing for a restoration
of reporter gene activity.
Cell-based assays
ORFs can also be used to perform high-throughput cell-based
assays. Typically, expression ORFs are transfected into mammalian cells using a high-throughput transfection technique. A
wide variety of assays have been performed using such
methods. Often these assays depend upon technology that
allows for ‘high-content screening’ of cells, so that changes
in cell-shape and/or protein localization can be detected and
analyzed in an automated fashion. For example Harada et al.
(61) used high-content screening in a live cell assay to identify
proteins which, when over-expressed, increase proliferation.
In this study they identified more than 86 cDNAs that gave
rise to increased proliferation in a cell line. Other studies
have taken similar strategies to study localization of proteins
in the cell (20,74).
Integration of Y2H data with other large-scale datasets
Interaction maps can be used as a scaffold to integrate different large-scale datasets. One example of this approach, done in
yeast, combined PPI data and mRNA expression data to determine the biological role of topological ‘hubs’, defined as
proteins with many interaction partners (75). By coupling
high-quality interaction data with expression data the authors
were able to show that hubs can be split into two categories:
‘date hubs’ which have relatively low correlation over a
large number of conditions as revealed by their expression
profiles and therefore interact with their partners at different
times or locations; and ‘party hubs’, which have a higher correlation and bind their partners simultaneously. These results
support a model of organized modularity where date hubs represent ‘higher level’ connectors between modules, whereas
party hubs function inside modules.
Another example of data integration was carried out for
C. elegans by combining phenotypic profiling and expression
profiling data with PPI data (76). By deleting interactions with
less than two types of functional evidence, a ‘multiple support
network’ of more than 300 proteins and 1000 edges was built.
This network was shown to harbor two types of models:
protein complexes that constitute discrete molecular machines
are represented by clusters of nodes whose edges were supported by both PPI and phenotypic correlation data; proteins
involved in the same cellular processes without participating
in the same biological pathways correspond to nodes whose
edges are supported by phenotypic and expression correlation
but lack support of Y2H data. Functions of previously
unknown proteins were predicted using ‘guilt by association’
and were consistent with the localization patterns of
GFP-tagged proteins.
The combination of PPI data with other large-scale datasets
was also undertaken in both human proteome-wide Y2H
studies, described earlier. To validate their interaction data,
Rual et al. (60) correlated PPI data with expression studies
in human and mouse tissues, conserved upstream motifs, GO
Human Molecular Genetics, 2006, Vol. 15, Review Issue 1
term annotations and phenotype data of orthologous genes in
the mouse (Fig. 3). Furthermore, they functionally annotated
uncharacterized proteins in the interaction map by integrating
PPI data with data from the Online Mendelian Inheritance in
Man (OMIM) database. This resulted in the identification of
424 interacting protein pairs for which at least one partner
was associated with a disease.
In a similar fashion, Stelzl et al. (59b) evaluated their
dataset by comparing it with GO annotation and interaction
maps in other species. They also compared their PPIs to the
Kyoto encyclopedia of genes and genomes (KEGG), which
allowed them to identify proteins that link two or more proteins annotated to act in the same pathway.
6.
7.
8.
9.
10.
FUTURE APPLICATIONS OF ORFeome CLONES
The integration of PPIs with other large-scale data has proven
very useful to build better network models, evaluate Y2H
interactions and infer function for previously uncharacterized
genes. This integration process, however, is still far from
reaching its full potential. In order to gain a more profound
understanding of PPIs in cellular networks, the integration of
more complete datasets is required. Although most current
large-scale studies only use parts of the available proteome,
improved large-scale approaches should take advantage of
the ORFeome to generate truly proteome-wide datasets. Furthermore, most current studies do not take into consideration
alternative splice forms, but rather collapse alternatively
spliced transcripts of single genes into a single ORF.
Because splice variants and other RNA isoforms commonly
follow different expression patterns in time and space, often
related to different biological functions (77), their individual
interactions should be treated as individual data points.
11.
12.
13.
14.
15.
16.
17.
SUPPLEMENTARY MATERIAL
Supplementary Material is available at HMG Online.
18.
19.
ACKNOWLEDGEMENTS
P.L., D.E.H., S.M. and M.V. were supported by grants from
the Ellison Foundation, the NCI, NHGRI and NIGMS
awarded to M.V.
Conflict of Interest statement. None declared.
REFERENCES
1. Furey, T.S., Diekhans, M., Lu, Y., Graves, T.A., Oddy, L.,
Randall-Maher, J., Hillier, L.W., Wilson, R.K. and Haussler, D. (2004)
Analysis of human mRNAs with the reference genome sequence reveals
potential errors, polymorphisms, and RNA editing. Genome Res., 14,
2034–2040.
2. Brent, M.R. (2005) Genome annotation past, present, and future: how to
define an ORF at each locus. Genome Res., 15, 1777–1786.
3. Rual, J.F., Hill, D.E. and Vidal, M. (2004) ORFeome projects: gateway
between genomics and omics. Curr. Opin. Chem. Biol., 8, 20–25.
4. Pearlberg, J. and LaBaer, J. (2004) Protein expression clone repositories
for functional proteomics. Curr. Opin. Chem. Biol., 8, 98– 102.
5. Brizuela, L., Braun, P., LaBaer, J., Marsischky, G. and LaBaer, J. (2001)
FLEXGene repository: from sequenced genomes to gene repositories for
20.
21.
22.
23.
24.
25.
26.
R41
high-throughput functional biology and proteomics. Mol. Biochem.
Parasitol., 118, 155 –165.
Marsischky, G. and LaBaer, J. (2004) Many paths to many clones: a
comparative look at high-throughput cloning methods. Genome Res., 14,
2020–2028.
Weaver, T., Maurer, J. and Hayashizaki, Y. (2004) Sharing genomes: an
integrated approach to funding, managing and distributing genomic
clone resources. Nat. Rev. Genet., 5, 861–866.
Ota, T., Suzuki, Y., Nishikawa, T., Otsuki, T., Sugiyama, T., Irie, R.,
Wakamatsu, A., Hayashi, K., Sato, H., Nagai, K. et al. (2004) Complete
sequencing and characterization of 21 243 full-length human cDNAs.
Nat. Genet., 36, 40 –45.
Yudate, H.T., Suwa, M., Irie, R., Matsui, H., Nishikawa, T., Nakamura,
Y., Yamaguchi, D., Peng, Z.Z., Yamamoto, T., Nagai, K. et al. (2001)
HUNT: launch of a full-length cDNA database from the Helix Research
Institute. Nucleic Acids Res., 29, 185–188.
Imanishi, T., Itoh, T., Suzuki, Y., O’Donovan, C., Fukuchi, S., Koyanagi,
K.O., Barrero, R.A., Tamura, T., Yamaguchi-Kabata, Y., Tanino,
M. et al. (2004) Integrative annotation of 21 037 human genes validated
by full-length cDNA clones. PLoS Biol., 2, e162.
Kikuno, R., Nagase, T., Nakayama, M., Koga, H., Okazaki, N.,
Nakajima, D. and Ohara, O. (2004) HUGE: a database for human KIAA
proteins, a 2004 update integrating HUGEppi and ROUGE. Nucleic
Acids Res., 32, D502–D504.
Ohara, O., Nagase, T., Ishikawa, K., Nakajima, D., Ohira, M., Seki, N.
and Nomura, N. (1997) Construction and characterization of human
brain cDNA libraries suitable for analysis of cDNA clones encoding
relatively large proteins. DNA Res., 4, 53–59.
Suzuki, Y., Yoshitomo-Nakagawa, K., Maruyama, K., Suyama, A.
and Sugano, S. (1997) Construction and characterization of a full
length-enriched and a 50 -end-enriched cDNA library. Gene, 200, 149–156.
Suzuki, Y. and Sugano, S. (2001) Construction of full-length-enriched
cDNA libraries. The oligo-capping method. Meth. Mol. Biol., 175,
143–153.
Suzuki, Y. and Sugano, S. (2003) Construction of a full-length enriched
and a 50 -end enriched cDNA library using the oligo-capping method.
Meth. Mol. Biol., 221, 73–91.
Gerhard, D.S., Wagner, L., Feingold, E.A., Shenmen, C.M., Grouse,
L.H., Schuler, G., Klein, S.L., Old, S., Rasooly, R., Good, P. et al. (2004)
The status, quality, and expansion of the NIH full-length cDNA project:
the mammalian gene collection (MGC). Genome Res., 14, 2121–2127.
Strausberg, R.L., Feingold, E.A., Grouse, L.H., Derge, J.G.,
Klausner, R.D., Collins, F.S., Wagner, L., Shenmen, C.M., Schuler,
G.D., Altschul, S.F. et al. (2002) Generation and initial analysis of more
than 15 000 full-length human and mouse cDNA sequences. Proc. Natl
Acad. Sci. USA, 99, 16899–16903.
MGC-NIH (2006) http://mgc.nci.nih.gov/.
Wiemann, S., Weil, B., Wellenreuther, R., Gassenhuber, J., Glassl, S.,
Ansorge, W., Bocher, M., Blocker, H., Bauersachs, S., Blum, H. et al.
(2001) Toward a catalog of human genes and proteins: sequencing and
analysis of 500 novel complete protein coding human cDNAs. Genome
Res., 11, 422–435.
Simpson, J.C., Wellenreuther, R., Poustka, A., Pepperkok, R. and
Wiemann, S. (2000) Systematic subcellular localization of novel proteins
identified by large-scale cDNA sequencing. EMBO Rep., 1, 287–292.
Wiemann, S., Arlt, D., Huber, W., Wellenreuther, R., Schleeger, S.,
Mehrle, A., Bechtel, S., Sauermann, M., Korf, U., Pepperkok, R. et al.
(2004) From ORFeome to biology: a functional genomics pipeline.
Genome Res., 14, 2136–2144.
Wiemann, S., Bechtel, S., Bannasch, D., Pepperkok, R. and
Poustka, A. (2003) The German cDNA network: cDNAs, functional
genomics and proteomics. J. Struct. Func. Genomics, 4, 87–96.
Rual, J.F., Hirozane-Kishikawa, T., Hao, T., Bertin, N., Li, S., Dricot, A.,
Li, N., Rosenberg, J., Lamesch, P., Vidalain, P.O. et al. (2004) Human
ORFeome version 1.1: a platform for reverse proteomics. Genome Res.,
14, 2128–2135.
Pruitt, K.D., Tatusova, T. and Maglott, D.R. (2005) NCBI Reference
Sequence (RefSeq): a curated non-redundant sequence database of
genomes, transcripts and proteins. Nucleic Acids Res., 33, D501– D504.
Brent, M.R. and Guigo, R. (2004) Recent advances in gene structure
prediction. Curr. Opin. Struct. Biol., 14, 264– 272.
Kozak, M. (2002) Pushing the limits of the scanning mechanism for
initiation of translation. Gene, 299, 1 –34.
R42
Human Molecular Genetics, 2006, Vol. 15, Review Issue 1
27. Kozak, M. (2005) Regulation of translation via mRNA structure in
prokaryotes and eukaryotes. Gene, 361, 13 –37.
28. Hann, S.R. (1994) Regulation and function of non-AUG-initiated
proto-oncogenes. Biochimie, 76, 880 –886.
29. Kochetov, A.V., Sarai, A., Rogozin, I.B., Shumny, V.K. and Kolchanov,
N.A. (2005) The role of alternative translation start sites in the
generation of human protein diversity. Mol. Genet. Genomics, 273,
491 –496.
30. Touriol, C., Bornes, S., Bonnal, S., Audigier, S., Prats, H., Prats, A.C.
and Vagner, S. (2003) Generation of protein isoform diversity by
alternative initiation of translation at non-AUG codons. Biol. Cell, 95,
169 –178.
31. NCBI (2005) CCDS Database. NLM-NCBI.
32. International Human Genome Sequencing Consortium (2004) Finishing
the euchromatic sequence of the human genome. Nature, 431, 931–945.
33. Waterston, R.H., Lindblad-Toh, K., Birney, E., Rogers, J., Abril, J.F.,
Agarwal, P., Agarwala, R., Ainscough, R., Alexandersson, M.,
An, P. et al. (2002) Initial sequencing and comparative analysis of the
mouse genome. Nature, 420, 520– 562.
34. Gibbs, R.A., Weinstock, G.M., Metzker, M.L., Muzny, D.M., Sodergren,
E.J., Scherer, S., Scott, G., Steffen, D., Worley, K.C., Burch, P.E. et al.
(2004) Genome sequence of the Brown Norway rat yields insights into
mammalian evolution. Nature, 428, 493–521.
35. Lindblad-Toh, K., Wade, C.M., Mikkelsen, T.S., Karlsson, E.K., Jaffe,
D.B., Kamal, M., Clamp, M., Chang, J.L., Kulbokas, E.J., III, Zody,
M.C. et al. (2005) Genome sequence, comparative analysis and
haplotype structure of the domestic dog. Nature, 438, 803– 819.
36. NCBI (2003) Gnomon description. NLM-NCBI.
37. Axel, R., Feigelson, P. and Schutz, G. (1976) Analysis of the complexity
and diversity of mRNA from chicken liver and oviduct. Cell, 7,
247 –254.
38. Ryffel, G.U. and McCarthy, B.J. (1975) Complexity of cytoplasmic RNA
in different mouse tissues measured by hybridization of polyadenylated
RNA to complementary DNA. Biochemistry, 14, 1379–1385.
39. Bishop, J.O., Morton, J.G., Rosbash, M. and Richardson, M. (1974)
Three abundance classes in HeLa cell messenger RNA. Nature, 250,
199 –204.
40. Hirozane-Kishikawa, T., Shiraki, T., Waki, K., Nakamura, M., Arakawa,
T., Kawai, J., Fagiolini, M., Hensch, T.K., Hayashizaki, Y. and Carninci,
P. (2003) Subtraction of cap-trapped full-length cDNA libraries to select
rare transcripts. Biotechniques, 35, 510–516, 518.
41. Bonaldo, M.F., Lennon, G. and Soares, M.B. (1996) Normalization and
subtraction: two approaches to facilitate gene discovery. Genome Res., 6,
791 –806.
42. Carninci, P., Shibata, Y., Hayatsu, N., Sugahara, Y., Shibata, K., Itoh,
M., Konno, H., Okazaki, Y., Muramatsu, M. and Hayashizaki, Y. (2000)
Normalization and subtraction of cap-trapper-selected cDNAs to prepare
full-length cDNA libraries for rapid discovery of new genes. Genome
Res., 10, 1617–1630.
43. Baross, A., Butterfield, Y.S., Coughlin, S.M., Zeng, T., Griffith, M.,
Griffith, O.L., Petrescu, A.S., Smailus, D.E., Khattra, J., McDonald, H.L.
et al. (2004) Systematic recovery and analysis of full-ORF human cDNA
clones. Genome Res., 14, 2083–2092.
44. Wu, J.Q., Garcia, A.M., Hulyk, S., Sneed, A., Kowis, C., Yuan, Y.,
Steffen, D., McPherson, J.D., Gunaratne, P.H. and Gibbs, R.A. (2004)
Large-scale RT-PCR recovery of full-length cDNA clones.
Biotechniques, 36, 690–696, 698 –700.
45. Mironov, A.A., Fickett, J.W. and Gelfand, M.S. (1999) Frequent
alternative splicing of human genes. Genome Res., 9, 1288–1293.
46. Modrek, B. and Lee, C. (2002) A genomic view of alternative splicing.
Nat. Genet., 30, 13– 19.
47. Lander, E.S., Linton, L.M., Birren, B., Nusbaum, C., Zody, M.C.,
Baldwin, J., Devon, K., Dewar, K., Doyle, M., FitzHugh, W. et al.
(2001) Initial sequencing and analysis of the human genome. Nature,
409, 860–921.
48. Johnson, J.M., Castle, J., Garrett-Engele, P., Kan, Z., Loerch, P.M.,
Armour, C.D., Santos, R., Schadt, E.E., Stoughton, R. and
Shoemaker, D.D. (2003) Genome-wide survey of human alternative
pre-mRNA splicing with exon junction microarrays. Science, 302,
2141–2144.
49. Carninci, P., Kasukawa, T., Katayama, S., Gough, J., Frith, M.C., Maeda, N.,
Oyama, R., Ravasi, T., Lenhard, B., Wells, C. et al. (2005) The transcriptional
landscape of the mammalian genome. Science, 309, 1559–1563.
50. Beaudoing, E., Freier, S., Wyatt, J.R., Claverie, J.-M. and Gautheret, D.
(2000) Patterns of variant polyadenylation signal usage in human genes.
Genome Res., 10, 1001–1010.
51. Edwalds-Gilbert, G., Veraldi, K.L. and Milcarek, C. (1997) Alternative
poly(A) site selection in complex transcription units: means to an end?
Nucleic Acids Res., 25, 2547–2561.
52. Lu, R., Alioua, A., Kumar, Y., Eghbali, M., Stefani, E. and
Toro, L. (2006) MaxiK channel partners: physiological impact. J.
Physiol. (Lond.), 570, 65–72.
53. Kelley, C.A. and Adelstein, R.S. (1994) Characterization of isoform
diversity in smooth muscle myosin heavy chains. Can. J. Physiol.
Pharmacol., 72, 1351–1360.
54. Diss, J.K., Fraser, S.P. and Djamgoz, M.B. (2004) Voltage-gated Naþ
channels: multiplicity of expression, plasticity, functional implications
and pathophysiological aspects. Eur. Biophys. J., 33, 180– 193.
55. Sakharkar, K.R., Sakharkar, M.K., Culiat, C.T., Chow, V.T.K. and
Pervaiz, S. (2006) Functional and evolutionary analyses on expressed
intronless genes in the mouse genome. FEBS Lett., 580, 1472– 1478.
56. Sakharkar, M.K., Chow, V.T., Ghosh, K., Chaturvedi, I., Lee, P.C.,
Bagavathi, S.P., Shapshak, P., Subbiah, S. and Kangueane, P. (2005)
Computational prediction of SEG (single exon gene) function in
humans. Front Biosci., 10, 1382–1395.
57a. Hartley, J.L., Temple, G.F. and Brasch, M.A. (2000) DNA cloning using
in vitro site-specific recombination. Genome Res., 10, 1788–1795.
57b. Walhout, A.J., Temple, G.F. et al. (2000) GATEWAY recombinational
cloning: application to the cloning of large numbers of open reading
frames or ORFeomes. Methods Enzymol., 328, 575 –592.
58. Luan, C.H., Qiu, S., Finley, J.B., Carson, M., Gray, R.J., Huang, W.,
Johnson, D., Tsao, J., Reboul, J., Vaglio, P. et al. (2004)
High-throughput expression of C. elegans proteins. Genome Res., 14,
2102–2110.
59a. Walhout, A.J., Sordella, R. et al. (2000) Protein interaction mapping
in C. elegans using proteins involved in vulval development. Science,
287, 116–122.
59b. Stelzl, U., Worm, U., Lalowski, M., Haenig, C., Brembeck, F.H.,
Goehler, H., Stroedicke, M., Zenkner, M., Schoenherr, A., Koeppen,
S. et al. (2005) A human protein–protein interaction network: a resource
for annotating the proteome. Cell, 122, 957– 968.
60. Rual, J.F., Venkatesan, K., Hao, T., Hirozane-Kishikawa, T., Dricot, A.,
Li, N., Berriz, G.F., Gibbons, F.D., Dreze, M., Ayivi-Guedehoussou,
N. et al. (2005) Towards a proteome-scale map of the human protein–
protein interaction network. Nature, 437, 1173–1178.
61. Harada, J.N., Bower, K.E., Orth, A.P., Callaway, S., Nelson, C.G., Laris,
C., Hogenesch, J.B., Vogt, P.K. and Chanda, S.K. (2005) Identification of
novel mammalian growth regulatory factors by genome-scale quantitative
image analysis. Genome Res., 15, 1136–1144.
62. Chandonia, J.-M. and Brenner, S.E. (2006) The impact of structural
genomics: expectations and outcomes. Science, 311, 347–351.
63. Fields, S. and Song, O. (1989) A novel genetic system to detect
protein– protein interactions. Nature, 340, 245 –246.
64. Vidalain, P.O., Boxem, M., Ge, H., Li, S. and Vidal, M. (2004)
Increasing specificity in high-throughput yeast two-hybrid
experiments. Methods, 32, 363–370.
65. Colland, F., Jacq, X., Trouplin, V., Mougin, C., Groizeleau, C.,
Hamburger, A., Meil, A., Wojcik, J., Legrain, P. and Gauthier, J.M.
(2004) Functional proteomics mapping of a human signaling pathway.
Genome Res., 14, 1324–1332.
66. Lehner, B. and Sanderson, C.M. (2004) A protein interaction framework
for human mRNA degradation. Genome Res., 14, 1315–1323.
67. Goehler, H., Lalowski, M., Stelzl, U., Waelter, S., Stroedicke, M., Worm,
U., Droege, A., Lindenberg, K.S., Knoblich, M., Haenig, C. et al. (2004)
A protein interaction network links GIT1, an enhancer of huntingtin
aggregation, to Huntington’s disease. Mol. Cell, 15, 853–865.
68. Deplancke, B., Dupuy, D., Vidal, M. and Walhout, A.J. (2004) A gatewaycompatible yeast one-hybrid system. Genome Res., 14, 2093–2101.
69. Obrdlik, P., El-Bakkoury, M., Hamacher, T., Cappellaro, C., Vilarino,
C., Fleischer, C., Ellerbrok, H., Kamuzinzi, R., Ledent, V., Blaudez,
D. et al. (2004) Kþ channel interactions detected by a genetic system
optimized for systematic studies of membrane protein interactions. Proc.
Natl Acad. Sci. USA, 101, 12242–12247.
70. Miller, J.P., Lo, R.S., Ben-Hur, A., Desmarais, C., Stagljar, I., Noble, W.S.
and Fields, S. (2005) Large-scale identification of yeast integral membrane
protein interactions. Proc. Natl Acad. Sci. USA, 102, 12123–12128.
Human Molecular Genetics, 2006, Vol. 15, Review Issue 1
71. Barrios-Rodiles, M., Brown, K.R., Ozdamar, B., Bose, R., Liu, Z.,
Donovan, R.S., Shinjo, F., Liu, Y., Dembowy, J., Taylor, I.W. et al.
(2005) High-throughput mapping of a dynamic signaling network in
mammalian cells. Science, 307, 1621–1625.
72. Eyckerman, S., Verhee, A., der Heyden, J.V., Lemmens, I., Ostade, X.V.,
Vandekerckhove, J. and Tavernier, J. (2001) Design and application of a
cytokine-receptor-based interaction trap. Nat. Cell Biol., 3, 1114–1119.
73. Endoh, H., Walhout, A.J. and Vidal, M. (2000) A green fluorescent
protein-based reverse two-hybrid system: application to the
characterization of large numbers of potential protein– protein
interactions. Meth. Enzymol., 328, 74–88.
74. Conrad, C., Erfle, H., Warnat, P., Daigle, N., Lorch, T., Ellenberg, J.,
Pepperkok, R. and Eils, R. (2004) Automatic identification of subcellular
phenotypes on human cell arrays. Genome Res., 14, 1130–1136.
75. Han, J.D., Bertin, N., Hao, T., Goldberg, D.S., Berriz, G.F., Zhang,
L.V., Dupuy, D., Walhout, A.J., Cusick, M.E., Roth, F.P. et al. (2004)
Evidence for dynamically organized modularity in the yeast protein–
protein interaction network. Nature, 430, 88 –93.
76. Gunsalus, K.C., Ge, H., Schetter, A.J., Goldberg, D.S., Han, J.D., Hao,
T., Berriz, G.F., Bertin, N., Huang, J., Chuang, L.S. et al. (2005)
Predictive models of molecular machines involved in Caenorhabditis
elegans early embryogenesis. Nature, 436, 861–865.
77. Stamm, S., Ben-Ari, S., Rafalska, I., Tang, Y., Zhang, Z., Toiber, D.,
Thanaraj, T.A. and Soreq, H. (2005) Function of alternative splicing.
Gene, 344, 1–20.
78. MGC-NIH (2006) Mammalian Gene Collection: http://mgc.nci.nih.gov/.
79. Okazaki, N., Imai, K., Kikuno, R.F., Misawa, K., Kawai, M.,
Inamoto, S., Ohara, R., Nagase, T., Ohara, O. and Koga, H. (2005)
Influence of the 30 -UTR-length of mKIAA cDNAs and their sequence
features to the mRNA expression level in the brain. DNA Res., 12,
181 –189.
80. DKFZ (2006) DKFZ-Wiemann lab standard protocols: http://www.dkfz.
de/smp-cell/cell.org/groups.asp?siteID¼49.
R43
81. Arlt, D., Huber, W., Liebel, U., Schmidt, C., Majety, M., Sauermann, M.,
Rosenfelder, H., Bechtel, S., Mehrle, A., Bannasch, D. et al. (2005)
Functional profiling: from microarrays via cell-based assays to novel
tumor relevant modulators of the cell cycle. Cancer Res., 65,
7733–7742.
82. OpenBiosystems (2006) Mammalian Resources: cDNAs http://www.
openbiosystems.com/Genomics/Mammalian%20Resources/
cDNA%20Clones/.
83. Promega Flexi Vector System (2006) http://www.promega.com/catalog/
catalogproducts.asp?catalog_name¼Promega_Products&category_na
me¼FlexiþCloningþSystem.
84a. House, B.L., Mortimer, M.W. and Kahn, M.L. (2004) New
recombination methods for Sinorhizobium meliloti genetics. Appl.
Environ. Microbiol., 70, 2806–2815.
84b. Cheo, D.L., Titus, S.A., Byrd, D.R., Hartley, J.L., Temple, G.F. and
Brasch, M.A. (2004) Concerted assembly and cloning of multiple DNA
segments using in vitro site-specific recombination: functional analysis
of multi-segment expression clones. Genome Res., 14, 2111–2120.
85. Clontech (2006) CreatorTM Gene Expression Systems: http://www.
clontech.com/clontech/products/families/creator/index.shtml.
86. Li, M.Z. and Elledge, S.J. (2005) MAGIC, an in vivo genetic method
for the rapid construction of recombinant DNA molecules. Nat. Genet.,
37, 311–319.
87. Blommel, P.G., Martin, P.A., Wrobel, R.L., Steffen, E. and Fox, B.G.
(2005) High efficiency single step production of expression plasmids
from cDNA clones using the flexi vector cloning system. Protein Expr.
Purif., December 5 [Epub ahead of print].
88. Liu, Q., Li, M.Z., Leibham, D., Cortez, D. and Elledge, S.J. (1998)
The univector plasmid-fusion system, a method for rapid construction of
recombinant DNA without restriction enzymes. Curr. Biol., 8,
1300–1309.