Transcription Factor Families Have Much Higher

Genome Analysis
Transcription Factor Families Have Much Higher
Expansion Rates in Plants than in Animals1
Shin-Han Shiu, Ming-Che Shih, and Wen-Hsiung Li*
Department of Ecology and Evolution, University of Chicago, Chicago, Illinois 60637 (S.-H.S., W.-H.L.);
and Department of Biological Sciences and Roy J. Carver Center for Comparative Genomics, University of
Iowa, Iowa City, Iowa 52242 (M.-C.S.)
Transcription factors (TFs), which are central to the regulation of gene expression, are usually members of multigene families.
In plants, they are involved in diverse processes such as developmental control and elicitation of defense and stress responses.
To investigate if differences exist in the expansion patterns of TF gene families between plants and other eukaryotes, we first
used Arabidopsis (Arabidopsis thaliana) TFs to identify TF DNA-binding domains. These DNA-binding domains were then
used to identify related sequences in 25 other eukaryotic genomes. Interestingly, among 19 families that are shared between
animals and plants, more than 14 are larger in plants than in animals. After examining the lineage-specific expansion of TF
families in two plants, eight animals, and two fungi, we found that TF families shared among these organisms have undergone
much more dramatic expansion in plants than in other eukaryotes. Moreover, this elevated expansion rate of plant TF is not
simply due to higher duplication rates of plant genomes but also to a higher degree of expansion compared to other plant
genes. Further, in many Arabidopsis-rice (Oryza sativa) TF orthologous groups, the degree of lineage-specific expansion in
Arabidopsis is correlated with that in rice. This pattern of parallel expansion is much more pronounced than the wholegenome trend in rice and Arabidopsis. The high rate of expansion among plant TF genes and their propensity for parallel
expansion suggest frequent adaptive responses to selection pressure common among higher plants.
Regulation of gene expression is central to a myriad
of biological processes at the molecular level and is to
a significant extent controlled by transcription factors
(TFs). Most TFs are modular proteins consisting of
a DNA-binding domain that interacts with cis-regulatory elements of its target genes and a protein-protein
interaction domain that facilitates oligomerization between TFs or other regulators (Wray et al., 2003).
Sequence divergence in the DNA-binding domains of
related TFs may lead to differences in affinities to a set
of cis-regulatory elements. Together with the propensity for TFs to homodimerize and/or heterodimerize,
the large TF repertoire in a eukaryote genome provides
a wide range of combinatorial relationships for transcriptional regulation. TFs usually form gene families
that vary considerably in size among organisms
(Riechmann et al., 2000; Wray et al., 2003). The reasons
behind such differences are not known, although it
is suggested that organismal complexity correlates
with an increase in the absolute number and the proportion of TFs in a proteome (Levine and Tjian, 2003).
In Arabidopsis (Arabidopsis thaliana), at least 1,500
genes are TFs, and 45% of these TFs belong to families
common to Caenorhabditis elegans, Drosophila melanogaster, and Saccharomyces cerevisiae (Riechmann et al.,
2000). Some of these TF families are much larger in
Arabidopsis, suggesting differential expansion. For ex1
This work was supported by a National Institutes of Health
(NIH) fellowship to S.-H.S. and NIH grants to W.-H.L.
* Corresponding author; e-mail [email protected]; fax 773–702–
9740.
www.plantphysiol.org/cgi/doi/10.1104/pp.105.065110.
18
ample, the Myb family has 190 members in Arabidopsis but only six in fly, three in worm, and 10 in yeast.
These differences lead to the suggestion that they may
be involved in plant-specific regulatory functions
(Riechmann et al., 2000). Interestingly, genes involved
in signal transduction and transcription have been
preferentially retained after the most recent wholegenome duplication event in the Arabidopsis lineage
(Blanc and Wolfe, 2004; Seoighe and Gehring, 2004).
These findings suggest important roles of TF duplicates in plant evolution. However, it remains to be
determined if the TF family expansion in plants is in
general more dramatic than that in other organisms. In
addition, it is not known if the preferential retention of
TFs has occurred at a longer time scale, such as after
the divergence between rice (Oryza sativa) and Arabidopsis but before the last whole-genome duplication
in Arabidopsis.
In this study, we evaluated whether plant TFs have
a higher rate of lineage-specific expansion than that
in other eukaryotes. We first identified TF families
among 26 eukaryotes based on known Arabidopsis
TFs. We then selected five genome pairs, including
plants, animals, and fungi, that diverged approximately 100 to 250 million years ago (MYA) to evaluate
the degrees of lineage-specific expansions of TF families. To see if TFs have expanded more than other
plant genes, we compared the GeneOntology (GO;
Harris et al., 2004) annotation of Arabidopsis genes
to see if GO categories enriched in TFs have significantly higher than average number of duplicates per
category. Finally, we examined the TF expansion at
the orthologous group (OG) level to determine if
Plant Physiology, September 2005, Vol. 139, pp. 18–26, www.plantphysiol.org Ó 2005 American Society of Plant Biologists
Downloaded from on July 31, 2017 - Published by www.plantphysiol.org
Copyright © 2005 American Society of Plant Biologists. All rights reserved.
Pronounced Expansion of Plant Transcription Factor Families
Table I. Plant TF gene families and their defining protein domains
–, Family name not assigned, domain functions unknown, or domain reference only.
Defining
Domaina
Familyb
Domain
Functions
OGc
GainAc
GainOc
% Paralleld
re
P Valuee
Smaller than:
AP2
AP2-EREBP
DNA binding
37
23 (63)
28 (76)
64.5
0.6884 1.87e-05
ARID
AT_hook
B3
ARID
EMF1
ABI3VP1, ARF
DNA binding
DNA binding
DNA binding
2
8
15
0 (0)
4 (50)
5 (34)
1 (50)
1 (13)
8 (54)
0.0
0.0
62.5
n.d.
n.d.
21
2.20e-16
20.1324 0.755
bZIP
bZIP
DNA binding and
27
15 (56)
13 (49)
33.3
protein-protein
interaction
CBF
CCAAT-HAP2
DNA binding and
5
4 (80)
4 (80) 100.0
protein-protein
interaction
CBFD_NFYB
CCAAT-DR1,
DNA binding and
9
4 (45)
3 (34)
40.0
_HMF
CCAAT-HAP3, protein-protein
CCAAT-HAP5
interaction
CG-1
AtSR
DNA binding
2
0 (0)
0 (0)
n.d.
CXC
CPP
DNA binding
2
2 (100)
1 (50)
50.0
DUF573
GeBP
Unknown
2
2 (100)
2 (100) 100.0
E2F_TDP
E2F-DP
DNA binding
3
0 (0)
3 (100)
0.0
EIN3
EIL
Unknown
2
1 (50)
2 (100) 50.0
FLO_LFY
LFY
Unknown
1
0 (0)
0 (0)
n.d.
GATA
C2C2-Gata
DNA binding
6
6 (100)
6 (100) 100.0
GRAS
GRAS
Unknown
16
3 (19)
5 (32)
14.3
HLH
bHLH
DNA binding
52
31 (60)
26 (50)
32.6
Homeobox
HB
DNA binding
28
10 (36)
12 (43)
46.7
HSF_DNA-bind HSF
DNA binding
8
1 (13)
4 (50)
25.0
90
47 (53)
46 (52)
47.6
Myb_DNAMYB, G2-like, DNA binding
binding
MYB-related,
Trihelix
NAM
NAC
DNA binding,
26
9 (35)
16 (62)
38.9
protein-protein
interaction
PHD
Alfin-like
Unknown
30
12 (40)
7 (24)
35.7
RWP-RK
NIN
Unknown
5
3 (60)
2 (40)
66.7
SBP
SBP
DNA-binding
5
2 (40)
3 (60)
66.7
SRF-TF
MADS
DNA-binding
15
9 (60)
10 (67)
46.2
TCP
TCP
Unknown
8
5 (63)
6 (75)
57.1
Tub
TUB
DNA binding
5
0 (0)
3 (60)
0.0
WRKY
WRKY
DNA binding
21
10 (48)
10 (48)
33.3
YABBY
C2C2-YABBY
DNA binding
3
1 (34)
2 (67)
50.0
zf-B_box
C2C2-CO-like, Unknown
10
8 (80)
8 (80)
60.0
STO
zf-C2H2
C2H2
Nucleic acid binding
40
17 (43)
15 (38)
39.1
zf-C3HC4
C3H
Protein-protein
94
36 (39)
35 (38)
39.2
interaction
zf-Dof
C2C2-Dof
DNA binding
5
2 (40)
4 (80)
50.0
n.a.
GRF
–
2
1 (50)
1 (50)
50.0
All TF families
–
–
582 272 (47) 286 (49)
44.4
All families
–
–
9,553 2,387 (25) 2,549 (27)
31.5
20.2346 0.306
Reference(s)
Ohme-Takagi and
Shinshi (1995);
Weigel (1995)
Herrscher et al. (1995)
Reeves and Nissen (1990)
Suzuki et al. (1997);
Ulmasov et al. (1997)
Landschulz et al.
(1988)
0.9272 0.073
Edskes et al. (1998)
0.0857 0.891
Li et al. (1992)
n.d.
n.d.
n.d.
n.a.
n.d.
n.d.
0.9289
20.6860
0.1739
20.0108
20.4937
0.1835
n.d.
n.d.
n.d.
n.a.
n.d.
n.d.
0.007
0.132
0.265
0.970
0.506
0.150
0.6043 0.008
20.1399
20.5000
0.9878
0.6352
0.7201
n.a.
0.8513
n.d.
0.0696
0.633
0.667
0.099
0.020
0.068
n.a.
5.690e-05
n.d.
0.849
da Costa e Silva (1994)
Hobert et al. (1996)
–
Zheng et al. (1999)
Solano et al. (1998)
Weigel et al. (1992)
Omichinski et al. (1993)
Pysh et al. (1999)
Littlewood and Evan (1995)
Scott et al. (1989)
Fujita et al. (1989)
Klempnauer and
Sippel (1987)
Ernst et al. (2004)
Aasland et al. (1995)
Schauser et al. (1999)
Klein et al. (1996)
Pellegrini et al. (1995)
Cubas et al. (1999)
Kleyn et al. (1996)
Eulgem et al. (2000)
Bowman (2000)
Borden (1998)
0.8706 6.600e-08 Klug and Rhodes (1987)
0.8706 6.600e-08 Borden and
Freemont (1996)
0.9683 0.032
Shimofurutani et al. (1998)
n.d.
n.d.
van der Knaap et al. (2000)
0.4623 2.200e-16
–
0.0699 1.800e-05
–
a
For each family, only sequences with the specified Pfam domain were analyzed. n.a., Pfam domain designation not available, and the family was
b
defined by similarity criteria outlined in ‘‘Materials and Methods.’’
TF families are designated according to Riechmann et al. (2000), AGRIS
(http://arabidopsis.med.ohio-state.edu/),
and
Jen Sheen’s
transcription regulator
site
(http://genetics.mgh.harvard.edu/sheenweb/
c
AraTRs.html).
GainA, Number of OGs with expansion in the Arabidopsis lineage. GainO, Number of OGs with expansion in the rice lineage.
d
The percentages of OGs with expansion in Arabidopsis or rice are shown in parentheses.
Percentage of OGs with lineage-specific gains in both
e
r, Pearson’s
Arabidopsis and rice for each domain family. n.d., Not determined because there is no OG with gains in the domain family.
correlation coefficient of Arabidopsis versus rice gains for each family. OGs with no Arabidopsis or rice gain were excluded. The P values indicate
the significance level of its associated correlation. n.d., Not determined because the sample size is too small. n.a., Not available because this family
has no gain in any OG in Arabidopsis or rice.
Plant Physiol. Vol. 139, 2005
19
Downloaded from on July 31, 2017 - Published by www.plantphysiol.org
Copyright © 2005 American Society of Plant Biologists. All rights reserved.
Shiu et al.
lineage-specific expansions in the Arabidopsis and
rice lineages are correlated with each other.
RESULTS
TF Family Sizes and Organismal Complexity
For the identification of TFs in 26 eukaryote genomes, we first consolidated the Arabidopsis TF family annotations from three databases (see ‘‘Materials
and Methods’’). The DNA-binding domain sequences
of these Arabidopsis TFs (Fig. 1; Table I) were then
used to recover related protein sequences in other
genomes. Figure 1 shows the number of genes in each
TF family among the eukaryote genomes analyzed.
There are more plant TF families because some of the
plant TFs may contain domains that are (1) too divergent from homologous sequences in other genomes or
(2) plant specific. Due to this methodological bias, only
relevant shared families are analyzed in all subsequent
cross-species comparisons.
The numbers of members in the TF families shared
among plants, animals, and fungi roughly correlate
with organismal complexity. The TF families in animals and land plants are larger than those in fungi.
The multicellular land plants have a much larger TF
repertoire compared to the unicellular alga Chlamydo-
monas reinhardtii. In addition, human, chicken, and
Takifugu rubripes in general have larger TF families
than those of other animals with simpler body plans.
However, the TF families in multicellular fungi are
only slightly larger than those in unicellular fungi,
which may be explained by the limited levels of
tissue/organ differentiation in some of these multicellular fungi. Among the 19 families shared between
plants and animals, most families are larger in plants
than in animals. Between Arabidopsis and human, 14
shared families are larger in Arabidopsis, only four are
larger in human, and one family is of equal size. This
finding suggests that the TF duplication and/or retention rate is higher in plants than in animals.
Higher Plant TFs Have Undergone Dramatic
Lineage-Specific Expansions
Comparisons of family sizes between genomes are
rather rudimentary measures of expansion because
family sizes tell us little about the timing and degree
of expansion. To determine if plant TFs have undergone more dramatic expansions than their animal
counterparts, we examined TFs in five species pairs
(Arabidopsis-rice, human-chicken, fly-mosquito, C.
elegans-Caenorhabditis briggsae, and MagnaportheNeurospora) diverged approximately 100 to 250 MYA
Figure 1. The prevalence of different TF families in eukaryotes. The tree on the left indicates the phylogenetic relationships
among the eukaryote genomes analyzed. The top row contains the names of the representative TF domains we analyzed. The
number indicates the count of each representative protein domain in each genome. The metazoan and the plant sections are
enclosed in solid and dotted lines, respectively. The number of TFs in domain families shared between metazoans and higher
plants are in black boxes if they are the highest count in each column.
20
Plant Physiol. Vol. 139, 2005
Downloaded from on July 31, 2017 - Published by www.plantphysiol.org
Copyright © 2005 American Society of Plant Biologists. All rights reserved.
Pronounced Expansion of Plant Transcription Factor Families
and evaluated the lineage-specific gains in OGs. Expansion has occurred if any lineage-specific clade in an
OG has more than one gene. For example, the GATA
family in Arabidopsis and rice contains 10 OGs (Fig.
2A; Table I), and nine of them have expansion in at
least one plant lineage and seven in both.
The degree of expansion was evaluated in 14 TF
families shared among plants, animals, and fungi. We
separated the OGs into four different classes: no
expansion (1:1), expansion in one lineage only (x:1 or
1:y; x,y . 1), or parallel expansion in two lineages (x:y;
Fig. 2B). In the animal and fungal genome pairs
examined, less than 10% of the OGs have undergone
lineage-specific expansion. In contrast, 68% of the OGs
between Arabidopsis and rice have expanded in at
least one lineage. The OGs for other species pairs are
mostly 1:1. The Arabidopsis-rice divergence was approximately 150 MYA (Chaw et al., 2004). The C.
briggsae/C. elegans divergence is estimated to have
occurred 80 to 110 MYA (Stein et al., 2003). The chicken
and human lineages diverged approximately 310 MYA
(Reisz and Modesto, 1996). The D. melanogaster and
Anopheles gambiae lineages diverged approximately
247 to 283 MYA (Gaunt and Miles, 2002). The Magnaporthe grisea-Neurospora crassa divergence is reported to
have occurred approximately 200 MYA (Hamer et al.,
2001). Two points can be made from our findings and
the divergence dates of the eukaryotes analyzed. First,
except in plants, TF expansion is relatively rare regardless of divergence time. Second, the plant TF
family expansion is not simply a consequence of
longer divergence time. A large number of lineagespecific expansions also occur, for example, in D.
melanogaster and A. gambiae (Zdobnov et al., 2002), in
nematodes (Stein et al., 2003), and in mammals (S.-H.
Shiu and W.-H. Li, unpublished data). Our findings
indicate TF families have expanded at a much higher
rate in plants than in other organisms.
Higher Duplicability of TFs than Other Genes in Plants
Since whole-genome duplications occur at a higher
frequency in plants than in animals and fungi, the TF
family expansion may simply be the consequence of
a higher gene duplication rate in plants. Alternatively,
the expansion of TF families may be due to elevated
rates of retention, i.e. higher duplicability. To determine if TFs have higher duplicability than other genes,
we examined the degrees of expansion of GO categories of Arabidopsis. We classified 7,298 OGs between
Arabidopsis and rice into two classes: unexpanded
Figure 2. Determination of OGs and the degrees of lineage-specific
expansion in different genome pairs. A, The similarity cluster of the
GATA family members from Arabidopsis and rice. The tree is subdivided into OGs. Within OGs, the Arabidopsis and rice members are
in orange and blue, respectively. The gene names are in black if they are
not classified into OGs. The OG sizes are shown on the far right
representing the numbers of genes in the Arabidopsis and rice clades. B,
The OGs are classified into four types based on their OG sizes (with x .
1 and y .1). The percentage of total OGs in each type is determined for
five species pairs. The species abbreviations are taken from the species
names shown in Figure 1.
Plant Physiol. Vol. 139, 2005
21
Downloaded from on July 31, 2017 - Published by www.plantphysiol.org
Copyright © 2005 American Society of Plant Biologists. All rights reserved.
Shiu et al.
(1:1) and expanded (x:1 and x:y) in the Arabidopsis
lineage after the Arabidopsis-rice split. For each GO
category, we compared the numbers of genes in expanded and unexpanded OGs against the average
numbers of the whole dataset. The four functional
categories related to transcriptional regulation all have
higher proportions of genes derived from lineagespecific expansion than most other categories (Fig. 3, A
and B). Nearly all the genes in these four categories are
TF family members.
We then determined the expected numbers of genes
in expanded and unexpanded OGs in these categories based on the whole data set (see ‘‘Materials and
Methods’’). These two numbers are compared to the
observed numbers with chi-squared tests (Table II). In
the four TF-related categories, the proportions of genes
in expanded OGs are significantly higher than the
average of all annotated genes. These findings indicate
that TF families in general have higher duplicability
than genes involved in most other functions in Arabidopsis. Interestingly, three of the same four categories
in human and mouse have significantly lower than
average duplicability. The only TF-related category
with higher than average duplicability in these two
mammals is DNA-dependent regulation of transcription, contributed only by the zinc-finger C2H2 family
that has undergone lineage-specific expansions in both
human and mouse. The fact that most TF-related
categories have low duplicability in human and mouse
is consistent with our conclusion that most TF families
have expanded at much higher rates in plants than in
other organisms. In addition, plant TFs are retained at
higher rates compared to most other plant genes.
Pronounced Parallel Expansions of TF Families
in Arabidopsis and Rice
Figure 3. The overrepresentation of transcription regulation categories.
For each GO category, the total percentage of genes that are in
expanded OGs is determined. The total percentages of value distributions are shown for molecular function (A) and biological process
categories (B). Note that each transcription regulation-related category
has higher than average percentage of genes in expanded OGs.
We showed above that 69% of OGs in plant TF
families have undergone expansion (Fig. 2B). Among
these expanded TF OGs, 98 and 115 expanded in only
the Arabidopsis and in only the rice lineage, respectively. The rest of the TF OGs have expanded in a
parallel fashion. While lineage-specifically expanded
TFs may be responsible for lineage-specific adaptation,
the parallel expansion suggests common selection
pressure contributing to the retention of certain TFs
in both lineages. Of all the OGs between Arabidopsis
and rice, 39% (3,672 out of 9,345) show various degrees
of parallel expansion. However, the expansion of OGs
in general does not occur in parallel as indicated by the
rather poor linear fit (r2 5 0.07; Fig. 4A). In contrast, the
OGs of TF families have a much better linear fit (r2 5
0.46; Fig. 4B). Although the degrees of parallel expansion vary greatly among TF families (Table I), our
findings indicate that, if a particular ancestral TF is
duplicated and retained in the Arabidopsis lineage,
the corresponding gene in the rice lineage will tend to
be retained. In addition, this parallel expansion is
more prominent in TFs than in most other plant genes.
Whole-genome duplications have occurred in both
lineages after the Arabidopsis-rice split (Blanc et al.,
2000; Yu et al., 2002). In addition, we showed in the
previous section that TFs have a higher retention rate
compared to other plant genes. Therefore, the higher
degree of parallel expansion among TF OGs may
simply be the consequence of independent gains of
large numbers of TFs. To determine if this is the case,
we randomly shuffled the number of gains in the
Arabidopsis and the rice lineages independently, and
the correlation coefficient of the shuffled dataset was
calculated. This random shuffling was repeated 10,000
times, and none of the r2 values in the randomly
shuffled dataset was larger than 0.037, substantially
lower than the r2 value of Arabidopsis-rice TF gains
(Fig. 4B). This finding indicates that the parallel
22
Plant Physiol. Vol. 139, 2005
Downloaded from on July 31, 2017 - Published by www.plantphysiol.org
Copyright © 2005 American Society of Plant Biologists. All rights reserved.
Pronounced Expansion of Plant Transcription Factor Families
Table II. Overrepresentation of Arabidopsis GO categories related to transcriptional regulation
Categories
Molecular function
GO:0016563
GO:0003700
Biological process
GO:0045449
GO:0006355
Description
Transcriptional activator activity
TF activity
Oi,Ea
Oi,Ua
Obs. % in EOGb
Ei,Ea
Ei,Ua
Exp. % in EOGb
P Valuec
19
4
565 242
82.6
70.0
12.2 10.8
428.7 378.3
53.0
53.0
0.005
6.890e-22
Regulation of transcription
170
73
Regulation of transcription, DNA dependent 398 188
70.0
67.9
126.6 116.4
305.4 280.6
52.0
52.0
2.560e-08
1.870e-14
a
For each category i, Oi,E is the observed number of genes in expanded OGs, Oi,U is the observed number of genes in unexpanded OGs, Ei,E is the
b
Observed (Obs.) or expected (Exp.)
expected number of genes in expanded OGs, and Ei,U is the expected number of genes in unexpanded OGs.
c
percentage of genes in expanded OGs (EOG).
For each category, the observed numbers of genes were compared to the expected in a 2 3 2
contingency table with chi-squared test. The significance (P value) of the chi-squared statistics is shown.
retention of TFs in plants is not simply due to independent gain but likely a property of the ancestral TF.
DISCUSSION
It is commonly believed that changes in cis-regulatory
systems more often underlie the evolution of morphological diversity than do changes in gene number or
protein function (Doebley and Lukens, 1998; Carroll,
2000). While this may be true, plant TF families tend to
expand at much higher rates compared to their animal
counterparts. It is known that polyploidization occurs
frequently in plants (for review, see Wendel, 2000). At
first look, our finding may simply be the consequence
of a higher gene duplication rate in plants. However,
the sizes of the TF families are in general similar
between human and T. rubripes. Since one round of
whole-genome duplication likely occurred in the teleost lineage (Taylor et al., 2001), the similarity in TF
family sizes between human and T. rubripes indicates
that most of the duplicated TFs are not retained. In
addition, similar to the findings of prior studies (Blanc
and Wolfe, 2004; Seoighe and Gehring, 2004), we
showed that the functional categories containing predominantly TFs have significantly higher rates of expansion compared to most other categories in
Arabidopsis. Judging from the gene family and OG
sizes of rice TFs, this is most likely to be true in rice as
well. Our findings indicate TF duplication likely contributes to regulatory novelties in development and/
or responses to external stimuli much more significantly in plants than that in animals. However, other
nonadaptive scenarios may be involved as well, as
discussed below.
We showed that TF OGs have a significantly higher
degree of parallel expansion. It should be noted that
genes with higher duplicability do not necessarily expand in parallel. For example, the receptor-like kinase
family has high duplicability, but most of the OGs in
this gene family have not expanded in parallel (Shiu
et al., 2004). There are several possible explanations.
First, parallel expansion may indicate the presence of
common selection pressure (Hughes and Friedman,
2003). One possible common selection pressure faced
by plants is environmental stresses, biotic or abiotic. It
is conceivable that elaborate systems were selected
for perceiving environmental stresses and for adjusting plant growth and development accordingly. The
expansion of the disease resistance gene family is consistent with the first role, whereas the latter may be fulfilled by TF duplicates. Second, the parallel expansion
Figure 4. Pronounced parallel expansion of TF families between rice
and Arabidopsis. The number of Arabidopsis genes is plotted against
the number of rice genes in the same OGs for all gene families (A) and
TF families only (B). The equations for the linear fits and the r2 values
are shown.
Plant Physiol. Vol. 139, 2005
23
Downloaded from on July 31, 2017 - Published by www.plantphysiol.org
Copyright © 2005 American Society of Plant Biologists. All rights reserved.
Shiu et al.
in TF may be due to the requirement for dosage
balance (Papp et al., 2003; Yang et al., 2003). The
dosage balance hypothesis asserts that duplications
of all genes involved in any protein complex, as one
would expect from whole-genome duplication, would
be more tolerable than single gene duplication. Wholegenome duplications have occurred in both the Arabidopsis and rice lineages (Blanc et al., 2000; Vision
et al., 2000; Yu et al., 2002). The higher duplicability of
plant TFs may be due to stronger deleterious effects of
TF losses than losses of most other plant genes after
whole-genome duplication. If this is true, TFs are more
likely to form complexes than most other plant genes.
Finally, parallel expansion may be due to the ease
of subfunctionalization among TF duplicates. Subfunctionalization is the process by which duplicates
lose different subfunctions of their common ancestor,
resulting in the indispensability of both copies (Force
et al., 1999). If ease of subfunctionalization explains
the higher duplicability of TFs, TFs will have on
average more functional modules than other plant
genes.
These three explanations are not necessarily mutually exclusive, as dosage effect and subfunctionalization may result in the initial retention of duplicates
followed by functional divergence. To elucidate their
relative importance, it will be of great interest to
examine the dosage effect of TF duplicates and the
expression patterns and functional differences of duplicates with outgroup species that do not have duplication. Since TF families have various degrees
of expansion (Table I), between-family comparison
should provide insights into to their differential expansions. Regardless of the mechanisms of retention,
we found the degree of TF family expansion in plants is
substantially higher than that in other eukaryotes or
other plant genes. Given the importance of plant TFs in
plant development and responses to environmental
factors, we argue that the larger repertoire of recently
acquired TF duplicates in plants plays a more significant role in developmental or other regulatory
novelties than their animal counterparts. Several gene
families and functional categories have similar or
even greater rates of expansion compared to TF families, e.g. the kinase family, the proteolysis category,
and the defense response category. The relative importance of different mechanisms in retaining genes with
diverse functions remains an intriguing question.
databases. We include only those with at least one DNA-binding domain that
has been demonstrated to directly modulate gene expression. With this
criterion, a total of 32 DNA-binding domains are present in the Arabidopsis
TF set. To identify these TF-associated DNA-binding domains from other
eukaryotes, the hidden Markov models were retrieved from Pfam to search
against the protein sequences from 26 eukaryote genomes (Fig. 1). The
Arabidopsis genome was analyzed in the same way to account for potential
changes in annotation.
Lineage-Specific Expansion of TFs
Lineage-specific expansions are gene-gain events that occur specifically in
a lineage. The lineage-specific expansion is determined by lineage-specific
gains in putative OGs. The OGs were defined by the Cross-Species Best Match
criterion detailed below. A distance matrix of each TF domain family of each
organism was constructed by determining the pairwise scores in an allagainst-all member BLAST search. For each TF (X) in species A, we first
identified the highest scoring hit (Y) in species B. Then within-species searches
were conducted to identify all TFs in A that have a higher score to X than to Y
(referred to as the X set). Similarly, all TFs in B that have higher score to Y than
to X were identified as the Y set. In this example, X, the X set, and Y, the Y set,
belong to the same OG.
Delineation of Arabidopsis Gene Families
The sequences of Arabidopsis proteins were used in an all-against-all
BLAST search. The expected (E) values were transformed by taking the
absolute values of their logarithm. A score matrix constructed with these
transformed E values was used for similarity clustering with Markov
Clustering (http://micans.org/mcl/; Van Dongen, 2000). The clusters generated were regarded as gene families. OGs in each family were identified as
described in the previous section.
GO Categories and Identification of
Overrepresented Categories
The GO annotations of Arabidopsis genes were obtained from The
Arabidopsis Information Resource (ftp://ftp.arabidopsis.org/home/tair/
Genes/Gene_Ontology/). Only GO categories with at least 10 genes were
analyzed to provide sufficient data points for statistical analyses. Using rice
(Oryza sativa) genes as references, we determined the numbers of genes
residing in OGs with or without expansion in the Arabidopsis lineage for each
category X (GX,Obs,E and GX,Obs,U, respectively). We also determined the
numbers of genes in expanded and unexpanded OGs for all categories (GAll,E
and GAll,U, respectively). For each category X, the expected of genes in
expanded or unexpanded OGs (GX,Exp,E and GX,Exp,U, respectively) are generated by the following.
GX;Exp;E 5
GX;Exp;U 5
GX;obs;E
GX;obs;E 1 GX;obs;U
GX;obs;U
GX;obs;E 1 GX;obs;U
3 GAll;E
3 GAll;U
The expected values were then compared to the observed numbers of genes in
expanded and unexpanded OGs with a chi-squared test to determine if the
observed values were significantly different from that of the expected.
ACKNOWLEDGMENTS
MATERIALS AND METHODS
We thank Donna E. Fernandez, Melissa D. Lehti-Shiu, Geoffrey Morris,
and Arnar Palsson for comments and for discussion.
Received May 4, 2005; revised June 22, 2005; accepted July 11, 2005; published
September 12, 2005.
Identification of TFs
A list of Arabidopsis (Arabidopsis thaliana) TFs was compiled based on two
resources: the Arabidopsis Gene Regulatory Information Server (AGRIS;
http://arabidopsis.med.ohio-state.edu/AtTFDB/index.jsp; Davuluri et al.,
2003) and the Arabidopsis Transcription Regulators homepage from the Jen
Sheen laboratory (http://genetics.mgh.harvard.edu/sheenweb/AraTRs.html).
The protein domains in these putative TFs were identified by searching
against the SMART (Schultz et al., 2000) and Pfam (Sonnhammer et al., 1998)
LITERATURE CITED
Aasland R, Gibson TJ, Stewart AF (1995) The PHD finger: implications for
chromatin-mediated transcriptional regulation. Trends Biochem Sci 20:
56–59
24
Plant Physiol. Vol. 139, 2005
Downloaded from on July 31, 2017 - Published by www.plantphysiol.org
Copyright © 2005 American Society of Plant Biologists. All rights reserved.
Pronounced Expansion of Plant Transcription Factor Families
Blanc G, Barakat A, Guyot R, Cooke R, Delseny M (2000) Extensive
duplication and reshuffling in the Arabidopsis genome. Plant Cell 12:
1093–1101
Blanc G, Wolfe KH (2004) Functional divergence of duplicated genes
formed by polyploidy during Arabidopsis evolution. Plant Cell 16:
1679–1691
Borden KL (1998) RING fingers and B-boxes: zinc-binding protein-protein
interaction domains. Biochem Cell Biol 76: 351–358
Borden KL, Freemont PS (1996) The RING finger domain: a recent example
of a sequence-structure family. Curr Opin Struct Biol 6: 395–401
Bowman JL (2000) The YABBY gene family and abaxial cell fate. Curr Opin
Plant Biol 3: 17–22
Carroll SB (2000) Endless forms: the evolution of gene regulation and
morphological diversity. Cell 101: 577–580
Chaw SM, Chang CC, Chen HL, Li WH (2004) Dating the monocot-dicot
divergence and the origin of core eudicots using whole chloroplast
genomes. J Mol Evol 58: 424–441
Cubas P, Lauter N, Doebley J, Coen E (1999) The TCP domain: a motif
found in proteins regulating plant growth and development. Plant J 18:
215–222
da Costa e Silva O (1994) CG-1, a parsley light-induced DNA-binding
protein. Plant Mol Biol 25: 921–924
Davuluri RV, Sun H, Palaniswamy SK, Matthews N, Molina C, Kurtz M,
Grotewold E (2003) AGRIS: Arabidopsis gene regulatory information
server, an information resource of Arabidopsis cis-regulatory elements
and transcription factors. BMC Bioinformatics 4: 25
Doebley J, Lukens L (1998) Transcriptional regulators and the evolution of
plant form. Plant Cell 10: 1075–1082
Edskes HK, Ohtake Y, Wickner RB (1998) Mak21p of Saccharomyces
cerevisiae, a homolog of human CAATT-binding protein, is essential for
60 S ribosomal subunit biogenesis. J Biol Chem 273: 28912–28920
Ernst HA, Nina Olsen A, Skriver K, Larsen S, Lo Leggio L (2004) Structure
of the conserved domain of ANAC, a member of the NAC family of
transcription factors. EMBO Rep 5: 297–303
Eulgem T, Rushton PJ, Robatzek S, Somssich IE (2000) The WRKY
superfamily of plant transcription factors. Trends Plant Sci 5: 199–206
Force A, Lynch M, Pickett FB, Amores A, Yan YL, Postlethwait J (1999)
Preservation of duplicate genes by complementary, degenerative mutations. Genetics 151: 1531–1545
Fujita A, Kikuchi Y, Kuhara S, Misumi Y, Matsumoto S, Kobayashi H
(1989) Domains of the SFL1 protein of yeasts are homologous to Myc
oncoproteins or yeast heat-shock transcription factor. Gene 85: 321–328
Gaunt MW, Miles MA (2002) An insect molecular clock dates the origin of
the insects and accords with palaeontological and biogeographic landmarks. Mol Biol Evol 19: 748–761
Hamer L, Pan H, Adachi K, Orbach MJ, Page A, Ramamurthy L, Woessner
JP (2001) Regions of microsynteny in Magnaporthe grisea and Neurospora crassa. Fungal Genet Biol 33: 137–143
Harris MA, Clark J, Ireland A, Lomax J, Ashburner M, Foulger R, Eilbeck
K, Lewis S, Marshall B, Mungall C, et al (2004) The Gene Ontology
(GO) database and informatics resource. Nucleic Acids Res (Database
issue) 32: D258–D261
Herrscher RF, Kaplan MH, Lelsz DL, Das C, Scheuermann R, Tucker PW
(1995) The immunoglobulin heavy-chain matrix-associating regions are
bound by Bright: a B cell-specific trans-activator that describes a new
DNA-binding protein family. Genes Dev 9: 3067–3082
Hobert O, Jallal B, Ullrich A (1996) Interaction of Vav with ENX-1,
a putative transcriptional regulator of homeobox gene expression. Mol
Cell Biol 16: 3066–3073
Hughes AL, Friedman R (2003) Parallel evolution by gene duplication in
the genomes of two unicellular fungi. Genome Res 13: 794–799
Klein J, Saedler H, Huijser P (1996) A new family of DNA binding
proteins includes putative transcriptional regulators of the Antirrhinum
majus floral meristem identity gene SQUAMOSA. Mol Gen Genet 250:
7–16
Klempnauer KH, Sippel AE (1987) The highly conserved amino-terminal
region of the protein encoded by the v-myb oncogene functions as
a DNA-binding domain. EMBO J 6: 2719–2725
Kleyn PW, Fan W, Kovats SG, Lee JJ, Pulido JC, Wu Y, Berkemeier LR,
Misumi DJ, Holmgren L, Charlat O, et al (1996) Identification and
characterization of the mouse obesity gene tubby: a member of a novel
gene family. Cell 85: 281–290
Klug A, Rhodes D (1987) Zinc fingers: a novel protein fold for nucleic acid
recognition. Cold Spring Harb Symp Quant Biol 52: 473–482
Landschulz WH, Johnson PF, McKnight SL (1988) The leucine zipper:
a hypothetical structure common to a new class of DNA binding
proteins. Science 240: 1759–1764
Levine M, Tjian R (2003) Transcription regulation and animal diversity.
Nature 424: 147–151
Li XY, Mantovani R, Hooft van Huijsduijnen R, Andre I, Benoist C,
Mathis D (1992) Evolutionary variation of the CCAAT-binding transcription factor NF-Y. Nucleic Acids Res 20: 1087–1091
Littlewood TD, Evan GI (1995) Transcription factors 2: helix-loop-helix.
Protein Profile 2: 621–702
Ohme-Takagi M, Shinshi H (1995) Ethylene-inducible DNA binding
proteins that interact with an ethylene-responsive element. Plant Cell
7: 173–182
Omichinski JG, Clore GM, Schaad O, Felsenfeld G, Trainor C, Appella E,
Stahl SJ, Gronenborn AM (1993) NMR structure of a specific DNA
complex of Zn-containing DNA binding domain of GATA-1. Science
261: 438–446
Papp B, Pal C, Hurst LD (2003) Dosage sensitivity and the evolution of
gene families in yeast. Nature 424: 194–197
Pellegrini L, Tan S, Richmond TJ (1995) Structure of serum response factor
core bound to DNA. Nature 376: 490–498
Pysh LD, Wysocka-Diller JW, Camilleri C, Bouchez D, Benfey PN (1999)
The GRAS gene family in Arabidopsis: sequence characterization and
basic expression analysis of the SCARECROW-LIKE genes. Plant J 18:
111–119
Reeves R, Nissen MS (1990) The A.T-DNA-binding domain of mammalian
high mobility group I chromosomal proteins. A novel peptide motif for
recognizing DNA structure. J Biol Chem 265: 8573–8582
Reisz RR, Modesto SP (1996) Archerpeton anthracos from the Joggins
formation of Nova Scotia: a microsaur, not a reptile. Can J Earth Sci 33:
703–709
Riechmann JL, Heard J, Martin G, Reuber L, Jiang C, Keddie J, Adam L,
Pineda O, Ratcliffe OJ, Samaha RR, et al (2000) Arabidopsis transcription factors: genome-wide comparative analysis among eukaryotes.
Science 290: 2105–2110
Schauser L, Roussis A, Stiller J, Stougaard J (1999) A plant regulator
controlling development of symbiotic root nodules. Nature 402: 191–195
Schultz J, Copley RR, Doerks T, Ponting CP, Bork P (2000) SMART: a webbased tool for the study of genetically mobile domains. Nucleic Acids
Res 28: 231–234
Scott MP, Tamkun JW, Hartzell GW III (1989) The structure and function
of the homeodomain. Biochim Biophys Acta 989: 25–48
Seoighe C, Gehring C (2004) Genome duplication led to highly selective expansion of the Arabidopsis thaliana proteome. Trends Genet 20: 461–464
Shimofurutani N, Kisu Y, Suzuki M, Esaka M (1998) Functional analyses
of the Dof domain, a zinc finger DNA-binding domain, in a pumpkin
DNA-binding protein AOBP. FEBS Lett 430: 251–256
Shiu SH, Karlowski WM, Pan R, Tzeng YH, Mayer KF, Li WH (2004)
Comparative analysis of the receptor-like kinase family in Arabidopsis
and rice. Plant Cell 16: 1220–1234
Solano R, Stepanova A, Chao Q, Ecker JR (1998) Nuclear events in
ethylene signaling: a transcriptional cascade mediated by ETHYLENEINSENSITIVE3 and ETHYLENE-RESPONSE-FACTOR1. Genes Dev 12:
3703–3714
Sonnhammer ELL, Eddy SR, Birney E, Bateman A, Durbin R (1998) Pfam:
multiple sequence alignments and HMM-profiles of protein domains.
Nucleic Acids Res 26: 320–322
Stein LD, Bao Z, Blasiar D, Blumenthal T, Brent MR, Chen N, Chinwalla
A, Clarke L, Clee C, Coghlan A, et al (2003) The genome sequence of
Caenorhabditis briggsae: a platform for comparative genomics. PLoS
Biol 1: E45
Suzuki M, Kao CY, McCarty DR (1997) The conserved B3 domain of
VIVIPAROUS1 has a cooperative DNA binding activity. Plant Cell 9:
799–807
Taylor JS, Van de Peer Y, Braasch I, Meyer A (2001) Comparative genomics
provides evidence for an ancient genome duplication event in fish.
Philos Trans R Soc Lond B Biol Sci 356: 1661–1679
Ulmasov T, Hagen G, Guilfoyle TJ (1997) ARF1, a transcription factor that
binds to auxin response elements. Science 276: 1865–1868
van der Knaap E, Kim JH, Kende H (2000) A novel gibberellin-induced
Plant Physiol. Vol. 139, 2005
25
Downloaded from on July 31, 2017 - Published by www.plantphysiol.org
Copyright © 2005 American Society of Plant Biologists. All rights reserved.
Shiu et al.
gene from rice and its potential regulatory role in stem growth. Plant
Physiol 122: 695–704
Van Dongen SM (2000) Graph clustering by flow simulation. PhD thesis.
University of Utrecht, The Netherlands
Vision TJ, Brown DG, Tanksley SD (2000) The origins of genomic
duplications in Arabidopsis. Science 290: 2114–2117
Weigel D (1995) The APETALA2 domain is related to a novel type of DNA
binding domain. Plant Cell 7: 388–389
Weigel D, Alvarez J, Smyth DR, Yanofsky MF, Meyerowitz EM (1992) LEAFY
controls floral meristem identity in Arabidopsis. Cell 69: 843–859
Wendel JF (2000) Genome evolution in polyploids. Plant Mol Biol 42:
225–249
Wray GA, Hahn MW, Abouheif E, Balhoff JP, Pizer M, Rockman MV,
Romano LA (2003) The evolution of transcriptional regulation in
eukaryotes. Mol Biol Evol 20: 1377–1419
Yang J, Lusk R, Li WH (2003) Organismal complexity, protein complexity,
and gene duplicability. Proc Natl Acad Sci USA 100: 15661–15665
Yu J, Hu S, Wang J, Wong GK, Li S, Liu B, Deng Y, Dai L, Zhou Y, Zhang X,
et al (2002) A draft sequence of the rice genome (Oryza sativa L. ssp.
indica). Science 296: 79–92
Zdobnov EM, von Mering C, Letunic I, Torrents D, Suyama M, Copley
RR, Christophides GK, Thomasova D, Holt RA, Subramanian GM,
et al (2002) Comparative genome and proteome analysis of Anopheles
gambiae and Drosophila melanogaster. Science 298: 149–159
Zheng N, Fraenkel E, Pabo CO, Pavletich NP (1999) Structural basis of
DNA recognition by the heterodimeric cell cycle transcription factor
E2F-DP. Genes Dev 13: 666–674
26
Plant Physiol. Vol. 139, 2005
Downloaded from on July 31, 2017 - Published by www.plantphysiol.org
Copyright © 2005 American Society of Plant Biologists. All rights reserved.