What is a gene? An updated operational definition Gene

Gene 417 (2008) 1–4
Contents lists available at ScienceDirect
Gene
j o u r n a l h o m e p a g e : w w w. e l s ev i e r. c o m / l o c a t e / g e n e
Review
What is a gene? An updated operational definition
Graziano Pesole ⁎
Dipartimento di Biochimica e Biologia Molecolare “E. Quagliariello”, Università di Bari, Via Orabona 4, 70126 Bari, Italy
Istituto Tecnologie Biomediche, Consiglio Nazionale delle Ricerche, via Amendola 122/D, 70125 Bari, Italy
A R T I C L E
I N F O
Article history:
Received 17 September 2007
Received in revised form 28 February 2008
Accepted 6 March 2008
Available online 26 March 2008
Keywords:
Genomics
Bioinformatics
Alternative splicing
Alternative transcription start sites
Alternative transcription termination
A B S T R A C T
A crucial pre-requisite for large-scale annotation of eukaryotic genomes is the definition of what constitutes a
gene. This issue is addressed here in the light of novel and surprising gene features that have recently
emerged from large-scale genomic and transcriptomic analyses. The updated operational definition proposed
here is: “a gene is a discrete genomic region whose transcription is regulated by one or more promoters and
distal regulatory elements and which contains the information for the synthesis of functional proteins or
non-coding RNAs, related by the sharing of a portion of genetic information at the level of the ultimate
products (proteins or RNAs)”. This definition is specifically designed for eukaryotic chromosomal genes and
emphasizes the commonality of the genetic material that gives rise to final, functional products (ncRNAs or
proteins) derived from a single gene. It may be useful in several applications and should help in the provision
of a comprehensive inventory of the genes of a given organism, finally allowing answers to the basic question
of “how many genes” are encoded in its genome.
© 2008 Elsevier B.V. All rights reserved.
Contents
1.
2.
Problematic issues in eukaryotic chromosomal gene
An updated operational gene definition . . . . . .
Acknowledgments . . . . . . . . . . . . . . . .
References . . . . . . . . . . . . . . . . . . . .
definition
. . . . .
. . . . .
. . . . .
.
.
.
.
.
.
.
.
.
.
.
.
1. Problematic issues in eukaryotic chromosomal gene definition
A major goal of a genome sequencing project for a specific
organism is the definition of its entire gene complement. To
accomplish this important task several fairly accurate gene prediction
tools are generally used together with large-scale production of
expression evidence (e.g. cDNA and EST sequences).
In this review I will deal with the problem of the definition of what
is a gene, a crucial pre-requisite for large-scale annotation of
eukaryotic genomes. Indeed, gene assessment in prokaryotic genomes
is much simpler owing to their higher gene density (about 80% of a
prokaryotic genome is protein coding) and the lack of introns. The
Abbreviations: AS, alternative splicing; miRNA, micro RNA; ncRNA, non-coding RNA;
ORF, open reading frame; TSS, transcription start site; TTS, transcription termination
site; TU, transcriptional unit; UTR, untranslated region.
⁎ Dipartimento di Biochimica e Biologia Molecolare, University of Bari, Via Orabona,
4, 70125 Bari, Italy. Tel.: +39 080 5443588; fax: +39 080 5443317.
E-mail address: [email protected].
0378-1119/$ – see front matter © 2008 Elsevier B.V. All rights reserved.
doi:10.1016/j.gene.2008.03.010
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
1
2
4
4
identification of significantly large Open Reading Frames (ORFs) is an
obvious solution for the identification of the majority of protein
coding prokaryotic genes. Short prokaryotic genes are more problematic but can generally be identified with suitable bioinformatics
approaches validated by transcription and translation evidence.
To date, 77 eukaryotic genome projects have been completed
(Liolios et al., 2006) but for none of them are we able to answer the
simple question of how many genes they contain. This is mostly due to
the presence of some gaps in the genome sequences and to the
incompleteness of gene annotation. However, even if all gaps were
closed and a full gene annotation was available and validated by
comprehensive transcriptional evidence, we could be still unable to
provide reliable estimate of the gene number — in part because of the
lack of a clear and unambiguous definition of what a gene is.
Several definitions have been proposed — such as this, from one of
the most widely used Molecular Biology textbooks: “A gene is the
segment of DNA specifying a polypeptide chain; it includes regions
preceding and following the coding region (leader and trailer) as well
as intervening sequences (introns) between individual coding
2
G. Pesole / Gene 417 (2008) 1–4
segments (exons).” ((Lewin, 2007), http://www.ergito.com/). This
exemplar definition, apart from its ambiguous use of the term
“exon”, is barely satisfactory as it does not consider some problematic
gene features recently highlighted by work carried out at the RIKEN
Institute on the transcriptional landscape of mouse genome (Carninci
et al., 2005) and most recently by the International Encyclopedia of
DNA elements (ENCODE) project (Gerstein et al., 2007) that strongly
challenge the conventional view of genes. Indeed, the classical “one
gene — one protein” definition is no longer acceptable and is also
impractical (Pearson, 2006).
In particular:
1) A large fraction of genes do not encode for proteins. Indeed, over
50% of the transcriptional units (TUs) identified in mouse do not
appear to be coding and the majority of them are alternatively
spliced and polyadenylated.
2) The same gene locus may encode a large variety of transcripts and
proteins through alternative transcription start sites (TSS), alternative transcription termination sites (TTS) and alternative splicing
(AS). In some cases AS may generate mRNAs encoding for
completely unrelated proteins using different coding frames.
3) Some genes have been found to overlap each other on the same or
opposite strands. The discontinuous structure of eukaryotic genes
potentially allows “Russian doll” gene models, where one gene can
be completely contained inside one or more introns of another
gene without sharing any exonic regions.
4) The ligation of two distinct mRNA molecules encoded by separate
gene loci through the trans-splicing mechanism is another
phenomenon widespread in some eukaryote lineages such as
nematodes and ascidians (Hastings, 2005) which may further
increase the complexity of the gene expression pattern.
5) Finally, recent computational and experimental analyses point to
the existence of chimerical transcripts produced by the cotranscription of tandem gene pairs, and potentially encoding
fusion proteins (Parra et al., 2006).
2. An updated operational gene definition
In the light of the above features one might ask if is still
appropriate to maintain a gene-centric view of molecular biology, or
it is better to just consider functional products (proteins and ncRNAs)
that may be in some way related by the molecular processes involved
in their expression, such as the sharing of a promoter (or TSS), a
transcriptional termination (TTS) or one or more splicing sites. Indeed,
to understand the relationships between the different cellular
components in a system biology framework, it may be more
appropriate to consider functional products rather than genes, in the
light of their specific expression in different conditions (i.e. tissue,
developmental stage or pathological status).
However, I believe that despite the many problems that have
emerged in these last years it would be premature to announce the
death of the gene concept, mostly because the tight connection
between a functional product and its encoding genetic material
cannot be disregarded. However, an updated operational definition is
needed to allow the unambiguous association between transcripts,
proteins, and their encoding genes.
In agreement with Gerstein et al. (2007) this updated definition
should adopt a bottom-up criterion, i.e. emphasize the ultimate
Fig. 1. The discrete genomic region depicted here encodes one non-coding and eight protein coding spliced transcripts (ncRNA in yellow; 5′UTR and 3′UTR in light and dark pink,
respectively; protein coding sequence in green; dotted lines represent RNA removed or spliced out by maturation). Four different genes (numbered 1–4) can be annotated according
to the gene definition proposed here. A specific set of transcripts can be clustered and assigned to the same gene if the transcript projections on the genome sequence — limited at the
regions encoding the final products (e.g. the green and the yellow boxes for the protein coding and non-coding RNA genes, respectively) — overlap each other. The clustering
procedure is iterated and may include in the same gene cluster non-overlapping transcripts. For example, in the case of gene 3, the transcript isoforms encoding for products DE and
FE are clustered because they overlap through the region E, then the transcript FG is added to this cluster because of the overlapping of the region F with one member of the cluster.
The transcript encoding the product AE can be identified as a chimerical transcript originated by the concatenation of two exons belonging to two different genes as these two exons
are prevalently expressed by two unrelated genes (i.e. genes 2 and 3). The gene coordinates, denoted by the arrowed lines, are the leftmost and rightmost mapping positions on the
genome of all transcripts belonging to the same gene cluster. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)
G. Pesole / Gene 417 (2008) 1–4
functional gene products, either ncRNAs (e.g. miRNA) or proteins, and
consider the regulatory regions involved in their expression at both
the transcriptional (i.e. promoter, enhancer, etc.) and post-transcriptional (i.e. 5′UTR and 3′UTR) level as “gene-related”. Thus, the
proposed operational definition can be summarized as: “a gene is a
discrete genomic region whose transcription is regulated by one or
more promoters and distal regulatory elements and which contains
the information for the synthesis of functional proteins or non-coding
RNAs, related by the sharing of a portion of genetic information at the
level of the ultimate products (proteins or RNAs)”. This definition does
not include cis-regulatory regions as sequence elements controlling
the expression of a gene are not necessarily located upstream of it and
may be dispersed throughout the genome (Gerstein et al., 2007)
making the accurate definition of their boundaries unfeasible. In
addition, some of the transcriptional regulatory elements may
themselves be transcribed (Zhu et al., 2007).
An example to illustrate the application of this definition is shown
in Fig. 1, where a genomic region encoding nine different transcripts —
which give rise to one ncRNA and seven functional proteins — is
described. According to the above definition: i) ABC, AC and ii) DE, FE,
FG, form two clusters of related proteins, generated by alternatively
spliced products of genes 2 and 3. I would suggest that two (or more)
proteins are related (i.e. belong to the same gene cluster) if their
encoding genome sequences overlap each other. Indeed, products
with overlapping encoding genome sequences, like DE and FE, have a
strict genetic relationship as a mutation in the shared genomic region
3
(i.e. the E region) would affect both products. It should be noted that
the relationships between two products can be indirect as DE and FG
are related through FE (see also the legend of Fig. 1).
Related proteins may also have completely different sequences, as
in the case of DE and FG, or if the expressed products should use a
different reading frame.
According to the gene definition proposed here the transcript
encoding the product H should be assigned to a different gene (4),
even if it shares the same TSS with transcripts encoding ABC and AC,
given that H and ABC (or AC) are completely unrelated proteins, i.e.
encoded by non-overlapping genomic regions. This is in line with the
recent observation that different genes may share distal 5′UTRs,
possibly providing a specific expression pattern (Denoeud et al., 2007).
Furthermore, the existence of trans-splicing — where exons from two
separate transcripts are spliced together to form a mature mRNA
molecule — has been shown in some eukaryotes (Hastings, 2005).
In the genomic region drawn in Fig. 1, we are also able to identify
an additional gene (1) encoding a ncRNA giving rise to the mature
product X. This situation accounts for miRNA genes, often expressed as
polycistronic primary-miRNA and located in the introns of coding or
non-coding RNAs (Kim and Nam, 2006).
Finally, AE can be identified as a fusion protein originating from the
co-transcription of two tandem genes (2 and 3, expressing nonoverlapping mature transcripts) through the formation of a chimerical
transcript — on the basis that the prevalent expression forms of the
genes which provide exons to this product form two unrelated transcript
Fig. 2. (A) Seven alternative mRNAs expressed by CDKN2A gene in human as determined by the ASPIC program (Castrignano et al., 2006) (RefSeq IDs are shown on the right of known
isoforms). (B) Alternative proteins encoded by transcript isoforms shown in (A).
4
G. Pesole / Gene 417 (2008) 1–4
clusters, i.e. with the 3′ end of the transcripts of the first cluster lying
upstream of the 5′ end of the transcripts of the second cluster, and
encode unrelated and non-overlapping functional products.
Once the related mature products have been defined one can easily
go back to the relevant precursor transcripts, and determine the gene
coordinates on the genome as their leftmost and rightmost mapping
positions (Fig. 1). In this way a single gene locus is defined to encode a
set of “related” products and its genomic coordinates established by
precursor transcripts.
The gene definition proposed here is different form the one
proposed by Gernstein et al. (2007): “A gene is a union of genomic
sequences encoding a coherent set of potentially overlapping
functional products” in that in the current proposal: i) each gene is
assigned a contiguous genomic region; ii) gene coordinates include 5′
and 3′ mRNA untranslated (UTR) sequences included in the precursor
transcript. Therefore, according to the proposed definition a genomic
tract encoding for a trans-spliced leader is not included in the genomic
region assigned to a given gene as we assume that a gene is “a
contiguous genome region” and furthermore the trans-leader corresponds to an “untranslated” region of the transcript which do not
contribute to the final product.
The definition provided in the current paper is not only simpler
but also operationally more appropriate as it unambiguously defines
the genomic region to be considered in the analysis of alternative
splicing — usually carried out by aligning gene-related transcripts
(typically a Unigene cluster) to the relevant genomic region where
alternatively spliced 5′UTRs are frequently observed.
To deal with a real example, Fig. 2 shows the splicing pattern of the
gene CDKN2A, as determined by the ASPIC program (Castrignano et al.,
2006). It should be noted that the first and second transcripts
(CDKN2A.Ref and CDKN2A.Tr2 in Fig. 2A) encode two completely
different proteins, 116 and 173 aa long respectively (Fig. 2B) and the
corresponding coding sequences use different reading frames.
CDKN2A.Tr2, .Tr3 and .Tr4 encode the same product but differ in
their 3′UTR. CDKN2A.Tr5, .Tr6 and .Tr7 encode different partially
overlapping proteins of 105, 146 and 138 residues, respectively. Note
that products of CDKN2A.Ref and CDKN2A.Tr5 are indirectly related
through the product of CDKN2A.Tr6.
This example highlights a possible problem that may arise with the
proposed definition. Indeed, in most real gene predictions we know
neither the location of the coding sequence, if any, nor the function of
the encoded protein. In fact, in this case only CDKN2A.Ref, .Tr2 and
.Tr6 correspond to known transcripts included in the RefSeq collection
(Pruitt et al., 2007). A pragmatic solution to this problem is to annotate
the longest possible open reading frame as a functional product (even
in the absence of strong supporting data). In this way all inferred
transcripts, CDKN2A.Tr1–.Tr7, will be assigned to the same gene locus.
It is now quite clear that an unequivocal and universal gene
definition is not possible and therefore it has been proposed that the
operational units of a genome could be better represented by the
different expressed transcripts as they actually relate the genome
sequence to function and phenotype (Gingeras, 2007). However, the
gene concept, with suitable revision and update still remains a key
issue in Molecular Biology, underlying the centrality of the relation-
ship between genotype and phenotype. An operational definition,
such as that proposed here may be extremely useful for the
unambiguous classification of transcripts in discrete gene loci, such
as those provided by the Unigene database (Wheeler et al., 2007) and
may be more appropriate for computational analysis involving
alignment of genome and transcript sequences. By way of contrast,
the Gerstein et al. (2007) gene definition, which includes a
discontinuous genome region with the exclusion of UTRs, cannot be
used to delineate the genome region to be considered in bioinformatics analyses for the detection of novel splicing isoforms and of
splicing events located in non-coding portion of mRNAs.
The simple operational gene definition proposed here, while not
universal — it is specifically designed for chromosomal eukaryotic
genes (e.g. genes of RNA viruses do not fit this definition) — allows
unambiguous definition of gene coordinates and of gene-related
transcripts. It may have a wide range of applicability and help in the
provision of a comprehensive inventory of the genes of a given
organism, finally allowing answers to the basic question of “how many
genes” are encoded in its genome.
Acknowledgments
This work was supported by the “Italian Ministry of University and
Research” (Fondo Italiano Ricerca di Base: “Laboratorio Internazionale
di Bioinformatica”), Associazione Italiana Ricerca sul Cancro and
Telethon. I thank David Horner (University of Milano) for stimulating
discussions and critical reading of the manuscript.
References
Carninci, P., et al., 2005. The transcriptional landscape of the mammalian genome.
Science 309, 1559–1563.
Castrignano, T., et al., 2006. ASPIC: a web resource for alternative splicing prediction and
transcript isoforms characterization. Nucleic Acids Res. 34, W440–W443.
Denoeud, F., et al., 2007. Prominent use of distal 5¢ transcription start sites and
discovery of a large number of additional exons in ENCODE regions. Genome Res. 17,
746–759.
Gerstein, M.B., et al., 2007. What is a gene, post-ENCODE? History and updated
definition. Genome Res. 17, 669–681.
Gingeras, T.R., 2007. Origin of phenotypes: genes and transcripts. Genome Res. 17,
682–690.
Hastings, K.E., 2005. SL trans-splicing: easy come or easy go? Trends Genet. 21, 240–247.
Kim, V.N., Nam, J.W., 2006. Genomics of microRNA. Trends Genet. 22, 165–173.
Lewin, B., 2007. Genes IX. Jones and Bartlett, Sudbury, Massachusetts.
Liolios, K., Tavernarakis, N., Hugenholtz, P., Kyrpides, N.C., 2006. The Genomes On Line
Database (GOLD) v.2: a monitor of genome projects worldwide. Nucleic Acids Res.
34, D332–D334.
Parra, G., et al., 2006. Tandem chimerism as a means to increase protein complexity in
the human genome. Genome Res 16, 37–44.
Pearson, H., 2006. Genetics: what is a gene? Nature 441, 398–401.
Pruitt, K.D., Tatusova, T., Maglott, D.R., 2007. NCBI reference sequences (RefSeq): a
curated non-redundant sequence database of genomes, transcripts and proteins.
Nucleic Acids Res. 35, D61–D65.
Wheeler, D.L., et al., 2007. Database resources of the National Center for Biotechnology
Information. Nucleic Acids Res. 35, D5–D12.
Zhu, X., Ling, J., Zhang, L., Pi, W., Wu, M., Tuan, D., 2007. A facilitated tracking and
transcription mechanism of long-range enhancer function. Nucleic Acids Res. 35,
5532–5544.