Polymorphism, shared functions and convergent evolution of genes

Human Molecular Genetics, 2003, Vol. 12, No. 22
DOI: 10.1093/hmg/ddg329
2967–2979
Polymorphism, shared functions and
convergent evolution of genes with
sequences coding for polyalanine domains
Hugo Lavoie1, François Debeane1, Quoc-Dien Trinh1, Jean-François Turcotte2,
Louis-Philippe Corbeil-Girard1, Marie-Josée Dicaire1, Anik Saint-Denis1,
Martin Pagé1, Guy A. Rouleau3 and Bernard Brais1,*
1
Laboratoire de Neurogénétique, Centre de Recherche du Centre Hospitalier de l’Université de Montréal,
Montréal, Canada, 2Département de Biochimie, Université de Montréal, Montréal, Canada and
3
Montreal General Hospital, McGill Health Center, McGill University, Montréal, Canada
Received July 14, 2003; Revised and Accepted September 18, 2003
Mutations causing expansions of polyalanine domains are responsible for nine hereditary diseases. Other
GC-rich sequences coding for some polyalanine domains were found to be polymorphic in human. These
observations prompted us to identify all sequences in the human genome coding for polyalanine stretches
longer than four alanines and establish their degree of polymorphism. We identified 494 annotated human
proteins containing 604 polyalanine domains. Thirty-two percent (31/98) of tested sequences coding for more
than seven alanines were polymorphic. The length of the polyalanine-coding sequence and its GCG or GCC
repeat content are the major predictors of polymorphism. GCG codons are over-represented in human
polyalanine coding sequences. Our data suggest that GCG and GCC codons play a key role in polyalaninecoding sequence appearance and polymorphism. The grouping by shared function of polyalanine-containing
proteins in Homo sapiens, Drosophila melanogaster and Caenorhabditis elegans shows that the majority are
involved in transcriptional regulation. Phylogenetic analyses of HOX, GATA and EVX protein families
demonstrate that polyalanine domains arose independently in different members of these families,
suggesting that convergent molecular evolution may have played a role. Finally polyalanine domains in
vertebrates are conserved between mammals and are rarer and shorter in Gallus gallus and Danio rerio.
Together our results show that the polymorphic nature of sequences coding for polyalanine domains makes
them prime candidates for mutations in hereditary diseases and suggests that they have appeared in many
different protein families through convergent evolution.
INTRODUCTION
Since 1996, mutations causing expansions of polyalanine tracts
have been documented in nine human diseases and in one
mouse strain (1–10). Polyalanine expansions have been found
in: synpolydactily (from 15 to 22–29 alanines in HOXD13)
(3,11), cleidocranial dysplasia (from 17 to 27 alanines in
RUNX2) (5), oculopharyngeal muscular dystrophy (from 10 to
11–17 alanines in PABPN1) (1), familial holoprosencephaly
(from 15 to 25 alanines in ZIC2) (2), hand–foot–genital
syndrome (from 18 to 26 alanines in HOXA13) (4), blepharophimosis/ptosis/epicanthus inversus syndrome type II (from 15
to 30 alanines in FOXL2) (6), X-linked mental retardation and
epilepsy (from 12–16 to 20–23 alanines in ARX) (8), X-linked
mental retardation with growth hormone deficiency (from 15 to
26 alanines in SOX3) (9) and congenital central hypoventilation
syndrome (from 25 to 29 alanines in PHOX2B) (10). The
shortest disease-causing expansion of a polyalanine domain was
observed in recessive OPMD, in which the addition of a single
alanine residue caused a recessive late-onset disease (1).
Interestingly, this polymorphism is present in 2% of the general
population and, when inherited in a compound heterozygote
fashion with a dominant expansion, causes a more severe
phenotype (1). Non-pathological polyalanine polymorphisms
*To whom correspondence should be addressed at: Laboratoire de Neurogénétique, M4211-L5, Hôpital Notre-Dame-CHUM, 1560 Sherbrooke est,
Montréal, Québec, Canada H2L 4M1. Tel: þ1 5148908000; Fax: þ1 5144127525; Email: [email protected]
Human Molecular Genetics, Vol. 12, No. 22 # Oxford University Press 2003; all rights reserved
2968
Human Molecular Genetics, 2003, Vol. 12, No. 22
were reported in polyalanine coding sequences in seven other
human genes prior to this study: MICA, GPX1, TGFBR1,
RPL14, MSH3 and FOXE1 (12–17). Two mechanisms leading
to lengthening of polyalanine coding sequences have been
proposed: expansions of stretches of GCN codons in PABPN1,
ARX, MICA, GPX1, TGFBR1, RPL14 and FOXE1 and duplications of mixed (GCN)n polyalanine-coding sequences in
PABPN1, HOXD13, HOXA13, RUNX2, ZIC2, FOXL2, ARX,
SOX3, PHOX2B and MSH3 (1–6, 12–17). These observations
prompted us to identify human proteins with domains of five
alanines or more and to investigate if variations could be found
in their polyalanine-coding sequences.
Knowledge of the functional roles of polyalanine domains
in proteins is very limited. Studies have suggested that
polyalanine may play a role in transcriptional repression in
drosophila kruppel, engrailed and even-skipped, and in human
glucocorticoid receptor and octamer binding protein 1 (18–23).
In contrast, the deletion of the polyalanine domains in SCIP did
not alter its transcriptional activity (24). Together, these
observations suggest that polyalanine stretches in these proteins
play a secondary role in transcription. Polyalanine domains
in vivo are believed to form b-pleated sheets in dragline spider
silk, moth silk and mollusk shell and tendons (25–27). The
interaction between numerous polyalanine domains is known to
play a major role in ensuring fiber tensile strength in spider’s
silk (25). The biophysical characteristics of homopolymeric
alanine stretches in vitro and in silico have been extensively
studied but are still somewhat controversial since polyalanine
polymers have served as models in protein folding research for
both b-sheet and a helical conformations (28–30). Polyalanine
peptides of 10 residues or more form stable fibrillar macromolecules resistant to chemical denaturation and enzymatic
degradation in vitro (29,31). Although the structure of polyalanine peptides in solution is still debated, a recent study
shows that b-sheet structures are favored in hydrophylic
physiological conditions and a helical conformation in hydrophobic medium (28).
The appearance and conservation of polyalanine domains has
not been extensively studied. The polyalanine domains of
HOXA13 and class III POU transcription factors appeared only
in mammals (32,33). In the case of HOXA13, it has been
shown that the human protein is 35% longer than its zebrafish
ortholog. These differences in size are primarily due to
accumulation of polyalanine and flanking regions rich in
proline and glycine (33). Similarly, the Drosophila ortholog of
PABPN1 lacks a polyalanine domain and a large portion of
the protein’s N-terminus (34). It was suggested that polyalanine
domains in arthropod silk and mollusk shell and tendon
have appeared by convergent evolution (35), implying that
these domains appeared independently in orthologs and were
conserved for functional reasons.
The complete sequence of the human genome was a
prerequisite to this study’s identification by data mining of all
human proteins with polyalanine domains and allows the
screening for polymorphisms of a large set of sequences coding
for polyalanine. The increasing number of sequenced eukaryotic genomes further permitted the comparative analyses of
species differences in size and differential appearance of
polyalanine domains through evolution. This study establishes
that polyalanine-coding sequences are frequently polymorphic
in the human population, that polyalanine domains are more
frequent in specific functional categories, especially in
transcription regulators, that they likely appeared by convergent
molecular evolution in some protein families and that they are
longer in mammals than in other vertebrates.
RESULTS
Sequences coding for polyalanine domains in human
are frequently polymorphic
We identified 494 known human proteins with polyalanine
domains longer than four alanines in the human refseq protein
database (Supplementary Material) (36). These 494 proteins
contained a total of 604 polyalanine domains. We selected the
124 proteins with domains longer than seven alanines to
investigate if the polyalanine coding sequences are polymorphic
in humans. We retrieved genomic coding sequences for 135
polyalanine domains (117 genes) and 98 were successfully
amplified by PCR on 42 unrelated DNA samples. Amplification
products were run on denaturing gels and samples showing two
or more alleles were sequenced. Thirty-one of the 98 sequenced
regions (32%) were found to have at least two sequenced-proven
alleles. Table 1 lists the genes with polymorphic sequences
sorted by the frequency of the most common allele observed.
Four of these polymorphisms were previously reported in the
literature (Table 1).
Triplet repeat expansions and contractions are responsible for
77% (24/31) of the observed polymorphisms (Table 1). Most are
due to changes in the number of GCG or GCC codons (20/24).
Changes involving tracts of GCA and GCT codons are observed
only in four genes. In seven genes, the polymorphisms are due to
duplications or deletions of mixed (GCN)n sequences (Table 1).
In 48% of the polymorphic loci, the minor alleles have a
frequency inferior to 5% (Table 1). Codon repeat expansions/
contractions and deletions/insertions of larger sequences are both
responsible for rare and more frequent alleles. It is noteworthy
that we also identified new polymorphisms in two genes
containing small polyalanine domains of seven alanines (GDF1
and CHD4, data not shown). This raises the possibility that
sequences coding for even shorter polyalanine domains could
also be polymorphic. Polymorphisms with high frequencies
could be predicted by aligning Unigene EST sequences for
nine genes (Table 1). We found that four of the nine genes
with known pathogenic lengthening of a polyalanine tract are
polymorphic in the general population: PABPN1, RUNX2,
ZIC2 and HOXA13 (1,2,4,5) (Table 1). This further underlines
that sequences encoding a polymorphic polyalanine domain are
prime candidates for diseases mapped to the same chromosomal
region.
The size of the polyalanine coding sequence and its repeat
content appear as the major predictors of polymorphism. Most
of the polymorphic sequences (62%) and very few nonpolymorphic sequences (11%) code for more than 13 alanines
or contain repeats of more than five codons (Fig. 1A).
Polymorphism of polyalanine domains coded by mixed
GCN codons seems to be mainly influenced by the length of
the sequence. In fact, the mixed (GCN)n polyalanine-coding
Human Molecular Genetics, 2003, Vol. 12, No. 22
2969
Table 1. Polymorphisms observed in gene sequences coding for domains of eight or more alanines. In bold is the number of repeats or repetition of a mixed
(GCN)n sequence present in the more common allele
Gene name
IDa
Polyalanine
length
Variation
typeb
Number
of alleles
Frequency of the most
frequent allele
Polymorphic sequence
Number of
repeatsc
SRP14d
SOX21
FOXF2
C14orf4
ZIC2
ATBF1d
HOXA13
PABPN1
TBL1e
FBS1
HOXD11e
SLC12A2
ZIC3
PHOX2Be
SGCB
MAZd
ZNF358
6727
11166
2295
64207
7546
463
3209
8106
6907
64319
3237
6558
7547
8929
6443
4150
55136
9
13
9
10
15
13
18
10
8
19
13
15
10
20
8
9
17
E
E
E
C
d
C
d
E
C
i
E
C
E
C
E
d
d
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
3
2
99%
99%
99%
99%
99%
99%
98%
98%
98%
98%
97%
97%
97%
97%
97%
95%
95%
9,
6,
9,
8,
1,
4,
1,
6,
5,
1,
6,
5,
8,
4,
4,
1,
0,
10
8
10
9
0
0
0
7
4
2
7
7
9
3
5
0
1
MAP3K4d
SOX1
RPL14d,f
HLXB9d
RUNX2
FOXD1
DMRTA2
KCNN2
EOMES
CC1.3d
TGFBR1e,f
PRDM12
FOXE1d,f
MSH3d,f
4216
6656
9045
3110
860
2297
63950
3781
8320
9584
7046
59335
2304
4437
9
8
8
16
17
8
9
8
14
12
9
12
14
12
E
C
E
C
d
E
C
E
E
C
E/C
E/C
E/C
D/d
2
2
2
2
2
2
2
2
2
3
3
4
3
4
93%
92%
91%
90%
88%
87%
69%
66%
65%
63%
63%
55%
55%
53%
GCA
GCC
GCC
GCC
GCTGCGGCGGCGGCG
GCG
GCTGCCGCGGCTGCCGCT
GCG
GCG
GCAGCG
GCG
GCG
GCC
GCG
GCG
GCGGCAGCA
GCTGCTGCAGCTGCA
GCGGCCGCG
GCT
GCG
GCT
GCC
GCGGCGGCTGCGGCGGCG
GCC
GCG
GCG
GCC
GCA
GCG
GCC
GCC
GCTGCAGCG
9,
3,
8,
9,
1,
4,
6,
4,
7,
5,
6,
5,
7,
1,
10
5
9
11
0
5
8
5
9
6, 8
9, 10
6, 8
9, 11
4, 5, 6
a
LocusLink ID; bExpansion/contraction (E/C) and deletion/duplication (D/d); cnumber of units of repeated sequence responsible for each allele; dgene sequence
predicted to be polymorphic based on EST sequence alignment; eregion unable to sequence; and fgene sequence previously found to be polymorphic.
sequences are more frequently polymorphic when they code for
more than 13 alanines (Fig. 1A, arrows).
GCG codons are over-represented in
polyalanine-coding sequences
There is a statistically significant over-representation of GCG
codons in polyalanine-coding sequences compared with its
average relative occurrence in a control set of open reading
frames (57 33% in polyalanine compared with 11 12% of
GCG usage in 597 control sequences) and according to the
codon usage database (11% of GCG codon usage) (37)
(Fig. 1B). This contrasts with GCC codons that are as
frequently used in polyalanine-coding sequences as in control
sequences (Fig. 1B). GCA and GCT codons are overall underrepresented in polyalanine-coding sequences (Fig. 1B). A
noticeable increase in GCG and GCC codons usage is observed
as the size of the polyalanine domains increase while GCA and
GCT codon usage tend to decrease with length (Fig. 1B).
The analysis of the contribution of codon repeats in
polyalanine-coding sequences demonstrates that GCG and
GCC repeats become predominant at a length of three codons
or more (Fig. 1C). Most of the sequences are not coded by
identical repeated codons. Of the 154 sequences coding for
eight or more alanines, only 12% are entirely or in part coded
by tracts of eight or more identical codons (Fig. 1C). Therefore,
long polyalanine domains are usually coded by mixed (GCN)n
sequences. Since GCGGCG and GCCGCC are complementary,
their over-representation in polyalanine-coding sequences
points toward a particular involvement of these GC-rich
sequences in the appearance and lengthening of sequences
coding for polyalanine domains.
The longest polyalanine domains are found
in eukaryotes
To establish if polyalanine domain-containing proteins share
functional roles, we identified all such proteins in three near
complete eukaryotic genomes: Homo sapiens, Drosophila
melanogaster, and Caenorhabditis elegans (Supplementary
Material). Figure 2A, C and E presents the distributions of
polyalanine domains according to their size in each species.
The number of polyalanine domains encoded in each genome
is variable (Fig. 2A, C and E). Based on previous estimates of
gene number in these species, polyalanine-containing proteins
account approximately for 1.4, 3.9 and 0.80% in H. sapiens,
D. melanogaster and C. elegans, respectively (38–42). Thus
polyalanine domains appear to be relatively more frequent in
2970
Human Molecular Genetics, 2003, Vol. 12, No. 22
Figure 1. General characteristics of tested (A) and all human polyalanine coding sequences (B and C). (A) Polymorphic (solid symbols) and non-polymorphic (open symbols) polyalanine-coding sequences were plotted for their
longest run of a single codon and their overall polyalanine length. The rectangle
delimits a region out of which 62% of polymorphic and 11% of non-polymorphic sequences are present. All polyalanine-coding sequences from H.
sapiens were retrieved from mRNA sequence and their codon usage (B) and
repeat content (C) was analyzed. (B) Distribution of the four alanine codons
usage in percentages according to length of the polyalanine domain. GCG
codon is significantly over-represented in polyalanine coding sequences compared with its normal occurrence while GCA and GCT codons are sometimes
under-represented (*P < 0.005; one-way ANOVA). The gray zone delimits a
region defining the mean codon usage in control sequences (black line) 1
SD. The alanine codon usage given by the codon usage database (37) is
depicted by a doted line. (C) GCG and GCC codons are far more frequently
repeated than GCA and GCT codons in polyalanine-coding sequences.
only 39 polyalanine-containing proteins (about 0.65% of
S. cerevisiae genes) and none had a domain of more than 10
alanines (data not shown). The number of polyalanine domains
was even smaller in prokaryotes. Polyalanine-coding genes
account for a mean of 0.74% of archeal genomes and 0.84% of
eubacterial genomes. In fact, 10 completely sequenced archea
and 31 completely sequenced eubacteria contained respectively
162 and 746 polyalanine-coding genes, and the size of the
domains never exceeds nine alanines. Therefore, domains
longer than nine alanines appear to be exclusively found in
eukaryotes. Although 85% of human proteins contain a single
polyalanine domain, some contain more: 48 proteins have two;
15 have three; four have four; and five contain five polyalanine
domains. Proteins bearing more than one domain are also
found in D. melanogaster (563 have one, 108 have two, 49
have three, 18 have four, 18 have five, one has seven and one
contains 10 polyalanine domains) and in C. elegans (122 have
one, 12 have two, two have three and two contain five
polyalanine domains).
Proteins with polyalanine domains are often
involved in transcription regulation
D. melanogaster (Fig. 2C). Interestingly, no domain longer
than 20 alanines was uncovered in the analyzed genomes. This
maximum length was observed in D. melanogaster ovo and asx
and in H. sapiens PHOX2B. In C. elegans, the length of
polyalanine domains never exceeded 14 alanines (Fig. 2E).
We also established that the genome of S. cerevisiae contains
The polyalanine-containing proteins from C. elegans,
D. melanogaster and H. sapiens were grouped by high level
molecular function categories according to Gene Ontology
Consortium functional classification (www.geneontology.org/;
Fig. 2B, D and F). Human, Drosophila and worm proteins were
divided for analysis based on a size threshold of eight alanines
(Fig. 2B, D and F). For all three organisms the ‘binding’
molecular function is the most represented in polyalaninecontaining proteins followed by ‘transcription regulator’ and
‘enzyme’ annotations. The patterns of functions distribution obtained for H. sapiens, D. melanogaster and C. elegans
polyalanine-containing proteins are very similar (Fig. 2B, D
and F). The significant enrichment of polyalanine-containing
proteins in ‘binding’ molecular function is mostly explained by
the presence of DNA binding transcription regulators (Fig. 2B,
D and F). The organization of the ontology tree leads to this
observation. In fact, the ontologies are organized so that one
low level function could be linked to more than one high
level annotation. Transcription regulators account for 7, 9 and
10% of H. sapiens, D. melanogaster and C. elegans complete
Human Molecular Genetics, 2003, Vol. 12, No. 22
2971
Figure 2. Polyalanine domain-containing proteins are enriched in some Gene Ontology molecular functions. Distribution of length of polyalanine domains
retrieved in H. sapiens, D. melanogaster and C. elegans (A, C and E). Bar diagram illustrating the molecular functions of polyalanine-containing proteins and
whole proteomes according to Gene Ontology Consortium annotations in H. sapiens, D. melanogaster and C. elegans (B, D and F). Empty bars correspond
to whole proteome annotations; gray bars to proteins containing polyalanine domains from five to seven alanines long; and black bars, to polyalanine domains
more than sevem alanines. All functions shown on the graph contribute to at least 2% of the whole human proteome. Significant over-representation of functional
categories in five to seven or more than seven alanines polyalanine domain-containing proteins compared with whole proteome is denoted by an asterisk while
significant over-representation of functional categories in proteins having a domain more than seven alanines compared with five to seven alanines group is denoted
by a plus sign (P < 0.001; chi-square calculation).
annotated proteomes respectively (Fig. 2B, D and F). Strikingly,
36% of H. sapiens, 35% of D. melanogaster and 25% of
C. elegans polyalanine-containing proteins bearing a domain
of five to seven alanines are annotated as ‘transcription
regulators’. The latter functional group is significantly enriched
in both the five to seven alanines and more than seven alanines
polyalanine-containing proteins compared with whole proteomes for the three organisms (Fig. 2B, D and F). The groups
of proteins with more than seven alanines domains contain an
even greater proportion of transcription regulators (Fig. 2B, D
and F), but this increased proportion is only significant for D.
melanogaster polyalanine-containing proteins. It is also noteworthy that polyalanine-containing proteins having a protein
binding function often bind transcription factors (data not
shown). Finally, the number of genes annotated as ‘enzyme’ is
significantly smaller in five to seven alanines and more than
seven alanine polyalanine-containing proteins from H. sapiens
and D. melanogaster (Fig. 2B and D).
2972
Human Molecular Genetics, 2003, Vol. 12, No. 22
Polyalanine domains appeared by convergent
evolution in different protein families
Polyalanine domains are more frequent in certain protein
families. For example, polyalanine domains are present in 17%
of proteins from the homeobox superfamily (35 out of 212
homeobox proteins retrieved with PSI-BLAST in the human
refseq database) (43,44). To investigate the appearance and
conservation of polyalanine domains within protein lineages we
first established which families had many members in our set
of proteins. There are 21 protein families represented by three
or more members in the human data set, 13 of which are
transcription factor families: CLCN, DMRT, GATA, FOX,
GPR, HOX, IRX, KCN, MAPK, NRG, PBX, POU, PPP1R,
PRDM, RPL, SLC, SMARC, SOX, TBX, TLE and ZIC. The
over-representation of polyalanine domains in certain protein
families raises the possibility that they may have either
appeared in a common ancestral protein or been acquired
independently.
Polyalanine domains have been studied mostly in proteins of
the HOX family (33). It has been suggested that all homeobox
DNA-binding proteins may have a common origin and form
a single superfamily (43,45,46). In vertebrates 4 HOX gene
clusters (A–D) are believed to have arisen by duplications of
a single ancestral cluster followed by divergence (47–50).
Phylogenic analysis suggests that polyalanine domains
appeared recently and in only some orthologs within paralogous
groups of the HOX family (Fig. 3A). For example, phylogenic
analysis of the paralogous groups HOXA13 and HOXD13
does not support a shared common ancestral origin for their
polyalanine domains but an independent appearance specific to
the mammalian lineage (Fig. 3A, in red and green). The same
phenomenon is observed for the paralogs HOXA11 and
HOXD11 (Fig. 3A, yellow and purple). In addition to these
paralogs, HOXD8 has also acquired a polyalanine domain. This
phenomenon of independent appearance is also observed in the
GATA-zinc-fingers family where mammalian GATA6, GATA4
and GATA1 have acquired a polyalanine sequence independently (Fig. 3C). Convergence is not only observed at the level
of paralogous groups, it is also seen between orthologs as is
exemplified by even-skipped proteins from various species
(Fig. 3E). In this particular case, a domain appeared in all
vertebrate evx2 (Fig. 3E, blue branches) and a second domain
is unique to mammalian evx2 (Fig. 3E, red). A third event
occurred only in drosophila eve of all known insects evenskipped proteins (Fig. 3E, green). Finally, a fourth event of
polyalanine appearance is specific to nematodes VAB-7
(Fig. 3E, yellow). The same phenomenon of evolutionary
convergence within orthologous proteins was also observed in
runt and engrailed (data not shown). This further supports the
independent convergent appearance of polyalanine domains in
proteins sharing similar functional role. It is noteworthy that
polyalanine domains appeared only in mammals in HOXD8,
HOXA11, HOXD11, HOXA13, HOXD13, GATA1, GATA4
and GATA6 phylogenetic trees.
The possible convergent evolutionary appearance of polyalanine domains in proteins of the same family is further supported
by their different positions relative to functional domains or to
other conserved regions. Figure 3B depicts the position of polyalanine insertions (arrowheads) within sequence alignments of
HOX13 and HOX11 paralogous groups. The same was observed
in an alignment of GATA4 with GATA6 (Fig. 3D). This
phenomenon is even more clearly illustrated by the N-terminal
position of the polyalanine domain in vab-7, the ortholog of
even-skipped in nematodes, compared with C-terminal insertions in vertebrates evx2 and drosophila eve (Fig. 3F). The
relative position of polyalanine domains relative to the DNA
binding domain in different transcription factor superfamilies is
clearly variable. For example, 89% of polyalanine are located
in N-terminus in homeobox-containing proteins, 71% are in Cterminus in SOX-containing proteins and 73% are in C-terminus
in forkhead box domain-containing proteins.
Polyalanine domains are longer in mammals
than in other vertebrates
The independent appearance of polyalanine domains in related
proteins and the observation that polyalanine domains arose
mainly in mammals (Fig. 3A and C) led us to assess polyalanine
conservation between human and mouse (Mus musculus),
human and chicken (Gallus gallus) and human and zebrafish
(Danio rerio). Figure 4A depicts the high level of conservation
of polyalanine domains length between H. sapiens and M.
musculus. Figure 4B and C shows that polyalanine domains are
poorly conserved between H. sapiens and G. gallus and H.
sapiens and D. rerio. The Pearson correlation coefficient (r)
between human and the three species polyalanine length
decreases from M. musculus to D. rerio (Fig. 4A–C). The
percentage of identity of pairwise blast alignments seems not to
correlate with the conservation of polyalanine tracts. Even
highly conserved protein segments (80–100% identity; open
triangles) do not share the same polyalanine length (Fig. 4A–C).
Polyalanine size is clearly larger on average in mammals than
in other vertebrates. In fact, the mean polyalanine length
difference between H. sapiens and M. musculus, H. sapiens
and G. gallus and H. sapiens and D. rerio was respectively
1.06 2.22 alanines, 4.53 3.79 alanines and 6.10 2.22
alanines. Figure 4D illustrates the changes in size of 28
polyalanine domains in 23 proteins sharing orthologs in human,
chicken and zebrafish. Polyalanine domains in chicken and
zebrafish proteins are shorter overall than their orthologs in
human. Only two out of the 28 domains studied are conserved in
the three organisms (NRF1 and MAF) and only one is longer
in zebrafish than in human and chicken (PBX1) (Fig. 4D).
DISCUSSION
This study identified by data mining 494 known or predicted
human proteins containing a total of 604 polyalanine domains
longer than four alanines. We demonstrate that sequences
coding for polyalanine domains are frequently polymorphic in
the human genome. In fact, 31 out of 98 tested polyalaninecoding sequences were polymorphic. We established that the
presence of a GCN repeat longer than five or a sequence
coding for polyalanine stretches longer than 13 alanines is a
good predictor of polymorphism. Based on these criteria,
82 polyalanine-coding sequences of our set are likely to be polymorphic in human. Of these, 28 were not tested because either
their genomic sequence was not available or they fell under our
Human Molecular Genetics, 2003, Vol. 12, No. 22
2973
Figure 3. Polyalanine domains appeared independently in many protein phylogenies. The polyalanine domains arose independently in members of the HOX (A)
and GATA (C) protein families. Convergence was also observed over longer evolutionary time as exemplified by the EVX proteins phylogeny (E). Note that vertebrate evx2, Drosophila eve and nematode VAB-7 all contain a polyalanine tract they acquired independently. For (A) and (B), a core tree was generated using the
alignment of the conserved domains only and sub-trees were generated with alignments of the full-length protein sequences. Sites of insertion of polyalanine
domains are shown as arrowheads on schematic representations of alignments of HOX, GATA and EVX (B, D and F). All phylogenies were generated by
the Neighbor joining method based on the alignment of conserved residues of homeobox, zinc-finger and homeobox domains for HOX, GATA and EVX respectively. Bootstrap values in percent (1000 replicates) are indicated at the nodes. Species abbreviations are as follows: Agama agama (Aag), Ambystoma mexicanum
(Ame), Bombyx mori (Bmo), Branchiostoma floridae (Bfl), Caenorhabditis elegans (Cel), Charina trivirgata (Ctr), Cupiennius salei (Csa), Danio aequipinatus
(Dae), Danio rerio (Dre), Drosophila melanogaster (Dme), Eumeces inexpectatus (Ein), Gallus gallus (Gga), Heterodontus francisci (Hfr), Homo sapiens (Hsa),
Latimeria chalumnae (Lch), Monodidelphis domesticus (Mdo), Mus musculus (Mmu), Notophtalmus viridens (Nvi), Petromyzon marinus (Pma), Pristionchus
pacificus (Ppa), Rattus norvegicus (Rno), Sceloporus undulatus (Sun), Schistocerca americana (Sam), Silurana tropicalis (Str), Tribolium castaneum (Tca),
Typhlops richardi (Tri) and Xenopus laevis (Xla).
eight alanines size threshold for screening. Polymorphism of
polyalanine-coding sequences is probably more widespread
because some shorter sequences were also found to have more
than one allele and rarer polymorphisms may have been missed
because of the limited size of our DNA panel (84 chromosomes).
The observation that rare polymorphisms exist in four of the
nine genes with known disease-causing mutations further
emphasizes the importance of screening similar sequences in
diverse pathologies. Furthermore, the finding that the minor
allele of PABPN1 carried by 2% of the population can act as
both a modulator of severity for dominant OPMD and a
recessive OPMD mutation raises the possibility that the rare
2974
Human Molecular Genetics, 2003, Vol. 12, No. 22
Figure 4. Conservation of polyalanine domain length between human and three vertebrate species. Polyalanine domain length in H. sapiens proteins was plotted
against corresponding polyalanine domain length in M. musculus (A), G. gallus (B) and D. rerio (C). Each polyalanine domain pair was associated with a protein
pair. The polyalanine domains were ranked by their associated protein pair percentage of identity in blast alignments and they were given a distinct symbol corresponding to the category they fell into (50–59, 60–69, 70–79, 80–89 and 90–100% of identity). N corresponds to the number of polyalanine domain pairs
between two species and R is the Pearson correlation coefficient between human and species of interest polyalanine sizes. (D) Changes in size of 28 polyalanine
domains found in 23 different proteins that share an ortholog in H. sapiens, G. gallus and D. rerio. Series are labeled according to human protein name.
polymorphisms observed in this study may play similar roles in
other traits (1).
The selection of these polyalanine-coding genes as candidates for numerous conditions is expedited by their known
chromosomal location in most cases. It has been shown that
other types of mutations in HOXD13, HOXA13, RUNX2, ZIC2
and FOXL2 cause phenotypes similar to expansions of the
polyalanine domain in these proteins (2,4–6,51). In this light,
of particular interest are 10 genes with polyalanine domains in
which other types of mutations have been identified in human:
APC, ELN, EMX2, FOXE1, ZIC3, AGPS, AR, IRS2, OFD1
and TBX3 (www.ncbi.nlm.nih.gov/omim). Interestingly, mice
knock-outs for Hoxd13, Hoxa13 and Runx2 show phenotypes
similar to human polyalanine expansions in these genes
(52–55). This phenotypic overlap between knock-out mice
and human disease allows the flagging of at least 34 human
genes with sequences coding for domains of more than seven
alanines with knock-out models as prime candidate genes for
Human Molecular Genetics, 2003, Vol. 12, No. 22
human diseases with phenotypic overlap. For example, this
raises the possibility that polyalanine expansions in SLC12A2
(17 alanines in man) may cause deafness in humans as point
mutations have been found in the Shaker-with-syndactylism
(sy) deaf mouse strain (56). Interestingly, this strain also has
syndactyly, an observation found in two of the already
described polyalanine diseases (56). The ages of onset of the
diseases caused by expansions of polyalanine domains found
to date suggests that such mutations may be involved in traits
and diseases that span human life. Further studies will establish
the functional importance of these relatively rare polymorphisms,
but they already allow the identification of a large set of
mapped genes as prime candidates for disease-causing
mutations considering that the chromosomal localizations are
available for the majority of human genes with sequences
coding for polyalanine domains.
The mechanism responsible for polyalanine-coding sequence
polymorphisms or mutations is still debated. Two hypotheses
have been proposed: expansion/contractions of repeated
sequences through slippage during replication and deletions/
duplications of mixed (GCN)n sequences due to unequal
crossovers during meiosis. Although our sequence data
demonstrate that most changes in size are due to changes in
the number of repeated codons (77%, 24/31), only unequal
crossover can explain both the presence of expanded repeat size
and mixed (GCN)n sequence mutations at the same locus (57).
This work clearly shows that GCG codons are significantly
over-represented in polyalanine-coding sequences. In fact,
GCG codons are on average three times more frequent in
polyalanine-coding sequences than in control coding
sequences. Furthermore, GCG and GCC repeats are more
frequently represented in polyalanine coding sequences than
repeats of GCA or GCT. These results together with
the observation that most of polyalanine polymorphisms are
associated with GCG or GCC repeats suggest that these
codons may be particularly important in the appearance and
polymorphism of polyalanine domains. GCGGCG and
GCCGCC being complementary, they can form stable hairpins
or other higher order structures that may play a key role in
this process and explain the relative abundance of GCG/CGC
and GCC/CGG repeats in polymorphic sequences coding
for polyalanine. It has been shown by many groups that
biophysical properties of GCC/CGG repeats observed in
fragile X mental retardation makes them good substrate for
replication slippage. In fact, oligos consisting of CGG repeats
(corresponding to GCG polyalanine repeats) fold in vitro
as either stable Hogsteen-bounded tetrahelical structures or as
hairpins (58–62). It is also noteworthy that, except for
Friedreich ataxia, diverse frames of GCN/CGN triplet reiteration are the substrates for all other pathogenic triplet repeat
mutations (58). The observation that CGG, CCG and CGA
repeats are in fact prone to proliferation in protein coding
sequences also points to their importance in the appearance and
expansion of amino acid reiterations (63).
Using Gene Ontology molecular function classes we grouped
polyalanine-containing proteins from three phylogenetically
distant organisms. It is striking that polyalanine proteins in the
three eukaryotes studied are mostly involved in transcription
regulation. Proteins with polyalanine domains more than seven
alanines are even more frequently involved in transcription than
2975
proteins with shorter domains. Transcription regulators are at
least five times more abundant in polyalanine-containing
proteins than in the human proteome. This enrichment of
polyalanine domains in proteins sharing the same molecular
function in distantly related organisms indicates that polyalanine probably participates in function. Previous studies have
suggested that, indeed, polyalanine domains play a role in
transcriptional repression (18–23,64,65). It is, however,
unknown how the hydrophobic polyalanine domains participate
in transcription. It is also noteworthy that the relative
importance of each molecular function was similar between
human and drosophila.
This study also demonstrates that polyalanine domains have
appeared independently in paralogous genes. Homeobox
containing proteins account for the largest single functional
group of polyalanine proteins: 49/494 in human and 43/758 in
drosophila. It is believed that all homeoboxes descend from the
same ancestor gene (43,45,46). We have shown that polyalanine domains have appeared independently in many of these
proteins often in a different position relatively to their
homeobox domain. The repeated observation of this phenomenon in many other protein families lead us to conclude that
evolutionary pressure has led to this convergence. Convergent
molecular evolution has been postulated as an explanatory
principle for the presence of polyalanine domains in spider
silks and mollusk shell and tendon (35). In these two examples
polyalanine domains have appeared in very distant species,
suggesting that the acquired functional properties were
responsible for the evolutionary pressure and led to convergence at the molecular level. Molecular convergence was also
invoked in the appearance of a repetitive sequence in antifreeze
glycoproteins shared between phylogenetically distant
Antarctic notothenioid fishes and Arctic cod (66). Another
example of molecular convergence was documented for
glycine-rich domains in cyanobacterial and eukaryotic RNA
binding proteins (67). These various examples illustrate that
repeated codons may be good substrates for convergent
molecular evolution. These sequences probably promote their
own variation, allowing for the appearance of new domains.
The evolutionary pressure responsible for the convergent
appearance of polyalanine domains in eukaryotes is unknown.
As previously discussed, polyalanine domains are frequent in
transcription regulators and they may play a role in protein–
protein interactions (3). Our hierarchical classification of
polyalanine-containing proteins involved in ‘binding’ uncovered that many were transcription factors while another
important group was involved in transcription factor binding.
These observations led us to search the literature for interaction
data involving polyalanine-containing proteins. We found that
many proteins with polyalanine domains interact with each
other. For example, GATA4 and GATA6 interact to regulate
cardiac development (68). The most impressive polyalanine protein network uncovered is the gro/TLE transcription corepressor family (TLE1, 3 and 4) that interact with
polyalanine-containing transcription factors RUNX2, FOXD1/
2/3, EN1, EVX2, PAX7, IRX3, NKX6.1, NKX2D, HEY2 and
UTX1 to form a regulatory network involved in various aspects
of vertebrate development (69–79). It was further found that
HEY2 homo and heterodimerize with HAND1 and HAND2
that also contain a polyalanine-domain (80). This polyalanine
2976
Human Molecular Genetics, 2003, Vol. 12, No. 22
protein network can be enlarged with the interaction of EN1
with PBX1, 2 and 3 (81). TLEs interact with the eh1 motif of
many of these proteins and the role of the polyalanine domains
in these interactions has not been studied. Stable polyalanine
hydrophobic protein–protein interactions could play a role,
considering the known biophysical properties of polyalanine
stretches (28–31). It is possible that polyalanine protein–protein
interfaces may stabilize interactions between transcription
regulators. Thus, new or more stable protein–protein interactions introduced by the appearance of polyalanine domains
contribute to the evolutionary convergence observed in
numerous protein families.
In most cases, when convergent evolution is observed,
polyalanine domains are exclusively found in the mammalian
lineage. Fast evolution of polyalanine domains in mammals has
previously been reported for HOXA13 and POU class III genes
(32,33). We thus carried out a study of polyalanine length
conservation among vertebrates. We comprehensively assessed
the conservation of polyalanine domains between human and
three vertebrate species and we observed good conservation
between human and mouse but poor conservation between
human and chicken and human and zebrafish. We also followed
the evolution of 28 individual polyalanine domains and found
that, except for three of them, all were shorter in chicken
and zebrafish with respect to human domain lengths. The clear
trend is one of increase in the polyalanine domain sizes
between H. sapiens > G. gallus > D. rerio. Together, this data
argues for a lengthening of most of polyalanine domains in
terrestrial vertebrates and especially in the mammalian lineage.
In conclusion, the roles of the polyalanine domains in
evolution, development and diseases clearly need to be further
investigated. The boundaries between monogenic and multifactorial traits are becoming blurred by the identification of
genetic modifiers such as the (GCG)7 recessive PABPN1 allele
in OPMD (1,82,83). Mutations causing polyalanine expansions
in some of the 494 human proteins identified are likely to
contribute to monogenic or polygenic traits having onsets at all
stages of life. Therefore, gene and protein databases should be
annotated to identify the occurrence of such domains and their
variations. This study emphasizes the importance of screening
sequences coding for polyalanine domains for mutations and
functional polymorphisms and design strategies to establish the
function and evolutionary role of these domains.
MATERIALS AND METHODS
Identification of proteins with polyalanine domains
and alanine codon usage analysis
We used a program consisting of a regular expression search
to retrieve the ID and polyalanine length of proteins containing
one or more polyalanine domains (ActivePerl 5.6.1; www.
activeperl.com). The protein databases of H. sapiens, M.
musculus, G. gallus, D. rerio, D. melanogaster and C. elegans
available through NCBI’s taxonomy resource were analyzed
(updated 1 September 2001; www.ncbi.nlm.nih.gov/Taxonomy/
taxonomyhome.html/). Sequence redundancy was eliminated
by automatic means using cd-hi and the data set was manually
curated by using unique identifiers provided by LocusLink
and Refseq for H. sapiens (www.ncbi.nlm.nih.gov/RefSeq/;
www.ncbi.nlm.nih.gov/Taxonomy/taxonomyhome.html/), FlyBase
for D. melanogaster (http://flybase.bio.indiana.edu/) and
WormBase for C. elegans (www.wormbase.org/) (36,84–86).
The polyalanine size threshold of five was based on the
observation that the smallest polymorphic sequences previously
observed were the five alanine domains of MICA and GPX1
and on a study of polyglutamine size evolution (12,13,87).
Chromosomal localization, Unigene identifiers, representative
protein and mRNA sequences for human loci coding for
polyalanine were retrieved by using Stanford SOURCE retrieval
tool (88); http://genome-www5.stanford.edu/cgi-bin/SMD/
source/sourceSearch). Contigs of Unigene ESTs for tested
genes were constructed using LASERGENE2000 Seqman
software. All the data were stored in a Microsoft Access
database and most are provided as Supplementary Material at
www.genome.org. Alanine codon usage was determined by
using a perl script (ActivePerl 5.6.1; www.activeperl.com)
extracting ORF from mRNA sequences and counting alanine
codons in ORF sequence. The control gene set was obtained
through RefSeq database (www.ncbi.nlm.nih.gov/ RefSeq/;
Supplementary Material). A one-way ANOVA was used to test
the null hypothesis of no difference between polyalanine and
control alanine codon usage.
Detection of polymorphisms in human sequences
coding for polyalanine domains
The genomic DNA from 42 unrelated participants in our
OPMD study were genotyped. All had signed an informed
consent. Twenty-six were carriers of a dominant PABPN1
OPMD mutation while the others were healthy family
members. The 124 human proteins with polyalanine domains
of eight or more alanines were chosen for polymorphism
search. The genomic sequence was retrieved by tBLASTn for
117 of these genes (135 domains). Primers based on these
genomic sequences were designed using LASERGENE2000
PrimerSelect software. Fragments were amplified by PCR with
either 7-deaza-dGTP or betaine at 1 M concentration (1,89).
Amplicons were run on a 5% polyacrylamide denaturing
gel. Gels were dried and autoradiograms were obtained (1).
Samples that demonstrated two alleles were reamplified and the
purified PCR products were sequenced using an ABIprism
automated DNA sequencer (Applied Biosystems).
Functional annotations of proteins, phylogenetic
analyses and pairwise comparisons of
polyalanine domains
Polyalanine-containing proteins were functionally classified by
using molecular function annotations available through the
Gene Ontology Consortium web page (www.geneontology.org/).
These annotations were based on EBI annotations for human,
Flybase for D. melanogaster and Wormbase for C. elegans
(86,90,91). Given the structure of the ontology tree, one
single protein could have more than one annotation
(www.geneontology.org/ ). Chi-square calculations were used
to test for statistically significant differences in molecular
function distributions between whole proteomes and proteins
with polyalanine domains. The identification of all known
Human Molecular Genetics, 2003, Vol. 12, No. 22
proteins from a specific family was achieved by searching the
human refseq database with PSI-BLAST (blastpgp) and the
consensus sequence for the motifs of interest as a query
sequence (BLAST version 2.1.2) (43,44,92).
Known protein sequences from families of interest were
retrieved by running BLASTp against the GenBank database.
Sequences obtained were aligned with the CLUSTAL W
program and the alignments were manually edited (93). For tree
construction, a distance approach was used (PHYLIP 3.6) (93).
PROTDIST was used to calculate a distance matrix (94). A
gamma distribution was applied to correct for unequal rates of
change at different amino acid positions. The tree topology was
finally inferred using WEIGHBOUR (95). The final unrooted
trees were edited using Treetool. Bootstrapping was carried out
(1000 replicates).
Pairwise comparisons of polyalanine domains between
vertebrate species was done using BLASTp with a threshold
E value of 1E 7 20 and a threshold alignment length of 200
amino acids (BLAST version 2.1.2) (44). We assessed the
conservation of the polyalanine tract by blasting a query file
containing all human polyalanine proteins against a database of
all proteins from the species of interest (mouse, chicken and
zebrafish). The query was also done by running blast with three
files containing all polyalanine proteins of the three species
against a database of all available human proteins. The resulting
pairwise alignments were manually curated for overall protein
percent identity and presence of corresponding polyalanine
domains between aligned protein pairs. Polyalanine domains
were considered orthologous if protein sequence identity was
observed on at least one side of the aligned polyalanine
domains. If more than one domain was present in a pairwise
BLAST alignment, they were considered separately.
SUPPLEMENTARY MATERIAL
Supplementary Material is available at HMG Online.
ACKNOWLEDGEMENTS
We especially want to thank Franz Lang for his help and
suggestions concerning philogenies. H.L. received a scholarship
from the Association française contre les myopathies (AFM).
B.B. is a chercheur-boursier of the Fonds de Recherche en
Santé du Québec (FRSQ). This project was supported by grants
from the AFM and the Neuromuscular Research Partnership
Program of the Canadian Institute of Health Research (CIHR),
the Muscular Dystrophy Association of Canada (MDAC) and
the Canadian Amyotrophic Lateral Sclerosis Association.
REFERENCES
1. Brais, B., Bouchard, J.-P., Xie, Y.-G., Rochefort, D.L., Chrétien, N.,
Tomé, F.M.S., Lafrenière, R.G., Rommens, J.M., Eichiro, U., Osamu, N.
et al. (1998) Short GCG expansions in the PABP2 gene cause
oculopharyngeal muscular dystrophy. Nat. Genet., 18, 164–167.
2. Brown, S.A., Warburton, D., Brown, L.Y., Yu, C.-Y., Roeder, E.R.,
Stengel-Rutkowski, S., Hennekam, R.C.M. and Muenke, M. (1998)
Holoprosencephaly due to mutation in ZIC2 a homologue of Drosophila
odd-paired. Nat. Genet., 20, 180–183.
2977
3. Muragaki, Y., Mundlos, S., Upton, J. and Olsen, B.R. (1996) Altered
growth and branching patterns in synpolydactyly caused by mutations in
HOXD13. Science, 272, 548–551.
4. Goodman, F.R., Bacchelli, C., Brady, A.F., Brueton, L.A., Fryns, J.P.,
Mortlock, D.P., Innis, J.W., Holmes, L.B., Donnenfeld, A.E., Feingold, M.
et al. (2000) Novel HOXA13 mutations and the phenotypic spectrum of
hand-foot-genital syndrome. Am. J. Hum. Genet., 67, 197–202.
5. Mundlos, S., Otto, F., Mundlos, C., Mulliken, J.B., Aylsworth, A.S.,
Albright, S., Lindhout, D., Cole, W.G., Henn, W., Knoll, J.H.M. et al.
(1997) Mutations involving the transcription factor CBFA1 cause
cleidocranial dysplasia. Cell, 89, 773–779.
6. Crisponi, L., Deiana, M., Loi, A., Chiappe, F., Uda, M., Amati, P.,
Bisceglia, L., Zelante, L., Nagaraja, R., Porcu, S. et al. (2001) The putative
forkhead transcription factor FOXL2 is mutated in blepharophimosis/
ptosis/epicanthus inversus syndrome. Nat. Genet., 27, 159–166.
7. Johnson, K.R., Sweet, H.O., Donahue, L.R., Ward-Bailey, P., Bronson, R.T.
and Davisson, M.T. (1998) A new spontaneous mouse mutation of
Hoxd13 with a polyalanine expansion and phenotype similar to human
synpolydactyly. Hum. Mol. Genet., 7, 1033–1038.
8. Stromme, P., Mangelsdorf, M.E., Shaw, M.A., Lower, K.M., Lewis, S.M.,
Bruyere, H., Lutcherath, V., Gedeon, A.K., Wallace, R.H., Scheffer, I.E.
et al. (2002) Mutations in the human ortholog of Aristaless cause X-linked
mental retardation and epilepsy. Nat. Genet., 30, 441–445.
9. Laumonnier, F., Ronce, N., Hamel, B.C., Thomas, P., Lespinasse, J.,
Raynaud, M., Paringaux, C., Van Bokhoven, H., Kalscheuer, V., Fryns, J.P. et al.
(2002) Transcription factor SOX3 is involved in X-linked mental retardation
with growth hormone deficiency. Am. J. Hum. Genet., 71, 1450–1455.
10. Amiel, J., Laudier, B., Attie-Bitach, T., Trang, H., De Pontual, L., Gener, B.,
Trochet, D., Etchevers, H., Ray, P., Simonneau, M. et al. (2003) Polyalanine
expansion and frameshift mutations of the paired-like homeobox gene
PHOX2B in congenital central hypoventilation syndrome. Nat. Genet.,
33, 459–461.
11. Goodman, F.R., Mundlos, S., Muragaki, Y., Donnai, D.,
Giovannucci-Uzielli, M.L., Lapi, E., Majewski, F., McGaughran, J.,
McKeown, C., Reardon, W. et al. (1997) Synpolydactyly phenotypes
correlate with size of expansions in HOXD13 polyalanine tract. Proc. Natl
Acad. Sci. USA, 94, 7458–7463.
12. Mizuki, N., Ota, M., Kimura, M., Ohno, S., Ando, H., Katsuyama, Y.,
Yamazaki, M., Watanabe, K., Goto, K., Nakamura, S. et al. (1997) Triplet
repeat polymorphism in the transmembrane region of the MICA gene: a
strong association of six GCT repetitions with Behcet disease. Proc. Natl
Acad. Sci. USA, 94, 1298–1303.
13. Shen, Q., Townes, P.L., Padden, C. and Newburger, P.E. (1994) An in-frame
trinucleotide repeat in the coding region of the human cellular glutathione
peroxidase (GPX1) gene: in vivo polymorphism and in vitro instability.
Genomics, 23, 292–294.
14. Pasche, B., Luo, Y., Rao, P.H., Nimer, S.D., Dmitrovsky, E., Caron, P.,
Luzzatto, L., Offit, K., Cordon-Cardo, C., Renault, B. et al. (1998) Type I
transforming growth factor beta receptor maps to 9q22 and exhibits a
polymorphism and a rare variant within a polyalanine tract. Cancer Res.,
58, 2727–2732.
15. Shriver, S.P., Shriver, M.D., Tirpak, D.L., Bloch, L.M., Hunt, J.D.,
Ferrell, R.E. and Siegfried, J.M. (1998) Trinucleotide repeat length variation
in the human ribosomal protein L14 gene (RPL14): localization to 3p21.3
and loss of heterozygosity in lung and oral cancers. Mutat. Res., 406, 9–23.
16. Acharya, S., Wilson, T., Gradia, S., Kane, M.F., Guerrette, S.,
Marsischky, G.T., Kolodner, R. and Fishel, R. (1996) hMSH2 forms
specific mispair-binding complexes with hMSH3 and hMSH6. Proc. Natl
Acad. Sci. USA, 93, 13629–13634.
17. Hishinuma, A., Ohyama, Y., Kuribayashi, T., Nagakubo, N., Namatame, T.,
Shibayama, K., Arisaka, O., Matsuura, N. and Ieiri, T. (2001)
Polymorphism of the polyalanine tract of thyroid transcription factor-2 gene
in patients with thyroid dysgenesis. Eur. J. Endocrinol., 145, 385–389.
18. Kim, M.K., Lesoon-Wood, L.A., Weintraub, B.D. and Chung, J.H. (1996)
A soluble transcription factor, Oct-1, is also found in the insoluble nuclear
matrix and possesses silencing activity in its alanine-rich domain. Mol.
Cell. Biol., 16, 4366–4377.
19. Licht, J.D., Grossel, M.J., Figge, J. and Hansen, U.M. (1990) Drosophila
Kruppel protein is a transcriptional repressor. Nature, 346, 76–79.
20. Licht, J.D., Ro, M., English, M.A., Grossel, M. and Hansen, U. (1993)
Selective repression of transcriptional activators at a distance by the
Drosophila Kruppel protein. Proc. Natl Acad. Sci. USA, 90,
11361–11365.
2978
Human Molecular Genetics, 2003, Vol. 12, No. 22
21. Licht, J.D., Hanna-Rose, W., Reddy, J.C., English, M.A., Ro, M.,
Grossel, M., Shaknovich, R. and Hansen, U. (1994) Mapping and
mutagenesis of the amino-terminal transcriptional repression domain of
the Drosophila Kruppel protein. Mol. Cell. Biol., 14, 4057–4066.
22. Han, K. and Manley, J.L. (1993) Transcriptional repression by the
Drosophila even-skipped protein: definition of a minimal repression
domain. Genes Dev., 7, 491–503.
23. Han, K. and Manley, J.L. (1993) Functional domains of the Drosophila
Engrailed protein. EMBO J., 12, 2723–2733.
24. Monuki, E.S., Kuhn, R. and Lemke, G. (1993) Cell-specific action and
mutable structure of a transcription factor effector domain. Proc. Natl Acad.
Sci. USA, 90, 9978–9982.
25. Simmons, A.H., Michal, C.A. and Jelinski, L.W. (1996) Molecular
orientation and two-component nature of the crystalline fraction of spider
dragline silk. Science, 271, 84–87.
26. Sudo, S., Fujikawa, T., Nagakura, T., Ohkubo, T., Sakaguchi, K.,
Tanaka, M., Nakashima, K. and Takahashi, T. (1997) Structures of
mollusc shell framework proteins. Nature, 387, 563–564.
27. Qin, X.X., Coyne, K.J. and Waite, J.H. (1997) Tough tendons. Mussel byssus
has collagen with silk-like domains. J. Biol. Chem., 272, 32623–32627.
28. Levy, Y., Jortner, J. and Becker, O.M. (2001) Solvent effects on the
energy landscapes and folding kinetics of polyalanine. Proc. Natl Acad. Sci.
USA, 98, 2188–2193.
29. Forood, B., Pérez-Payá, E., Houghten, R.A. and Blondelle, S.E. (1995)
Formation of an extremely stable polyalanine B-sheet macromolecule.
Biochem. Biophys. Res. Commun., 211, 7–13.
30. Vila, J.A., Ripoll, D.R. and Scheraga, H.A. (2000) Physical reasons for the
unusual alpha-helix stabilization afforded by charged or neutral polar
residues in alanine-rich peptides. Proc. Natl Acad. Sci. USA, 97,
13075–13079.
31. Blondelle, S.E., Forood, B., Houghten, R.A. and Pérez-Payá, E. (1997)
Polyalanine-based peptides as models for self-associated B-pleated-sheet
complexes. Biochemistry (Mosc.), 36, 8393–8400.
32. Sumiyama, K., Washio-Watanabe, K., Saitou, N., Hayakawa, T. and
Ueda, S. (1996) Class III POU genes: generation of homopolymeric amino
acid repeats under GC pressure in mammals. J. Mol. Evol., 43, 170–178.
33. Mortlock, D.P., Sateesh, P. and Innis, J.W. (2000) Evolution of N-terminal
sequences of the vertebrate HOXA13 protein. Mamm. Genome, 11, 151–158.
34. Benoit, B., Nemeth, A., Aulner, N., Kuhn, U., Simonelig, M., Wahle, E. and
Bourbon, H.M. (1999) The Drosophila poly(A)-binding protein II is
ubiquitous throughout Drosophila development and has the same function
in mRNA polyadenylation as its bovine homolog in vitro. Nucl. Acids Res.,
27, 3771–3778.
35. Gatesy, J., Hayashi, C., Motriuk, D., Woods, J. and Lewis, R. (2001)
Extreme diversity, conservation, and convergence of spider silk fibroin
sequences. Science, 291, 2603–2605.
36. Wheeler, D.L., Church, D.M., Federhen, S., Lash, A.E., Madden, T.L.,
Pontius, J.U., Schuler, G.D., Schriml, L.M., Sequeira, E., Tatusova, T.A.
et al. (2003) Database resources of the National Center for Biotechnology.
Nucl. Acids Res., 31, 28–33.
37. Nakamura, Y., Gojobori, T. and Ikemura, T. (2000) Codon usage tabulated
from international DNA sequence databases: status for the year 2000.
Nucl. Acids Res., 28, 292.
38. Lander, E.S., Linton, L.M., Birren, B., Nusbaum, C., Zody, M.C., Baldwin, J.,
Devon, K., Dewar, K., Doyle, M., FitzHugh, W. et al. (2001) Initial
sequencing and analysis of the human genome. Nature, 409, 860–921.
39. Reboul, J., Vaglio, P., Tzellas, N., Thierry-Mieg, N., Moore, T., Jackson, C.,
Shin-i.T., Kohara, Y., Thierry-Mieg, D., Thierry-Mieg, J. et al. (2001)
Open-reading-frame sequence tags (OSTs) support the existence of at least
17,300 genes in C. elegans. Nat. Genet., 27, 332–336.
40. Gopal, S., Schroeder, M., Pieper, U., Sczyrba, A., Aytekin-Kurban, G.,
Bekiranov, S., Fajardo, J.E., Eswar, N., Sanchez, R., Sali, A. et al. (2001)
Homology-based annotation yields 1,042 new candidate genes in the
Drosophila melanogaster genome. Nat. Genet., 27, 337–340.
41. Roest Crollius, H., Jaillon, O., Bernot, A., Dasilva, C., Bouneau, L.,
Fischer, C., Fizames, C., Wincker, P., Brottier, P., Quetier, F. et al. (2000)
Estimate of human gene number provided by genome-wide analysis using
Tetraodon nigroviridis DNA sequence. Nat. Genet., 25, 235–238.
42. Ewing, B. and Green, P. (2000) Analysis of expressed sequence tags
indicates 35,000 human genes. Nat. Genet., 25, 232–234.
43. Banerjee-Basu, S. and Baxevanis, A.D. (2001) Molecular evolution of
the homeodomain family of transcription factors. Nucl. Acids Res.,
29, 3258–3269.
44. Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W.
and Lipman, D.J. (1997) Gapped BLAST and PSI-BLAST: a new
generation of protein database search programs. Nucl. Acids Res., 25,
3389–3402.
45. Bharathan, G., Janssen, B.J., Kellogg, E.A. and Sinha, N. (1997) Did
homeodomain proteins duplicate before the origin of angiosperms, fungi,
and metazoa? Proc. Natl Acad. Sci. USA, 94, 13749–13753.
46. Kappen, C., Schughart, K. and Ruddle, F.H. (1993) Early evolutionary
origin of major homeodomain sequence classes. Genomics, 18, 54–70.
47. Meyer, A. and Malaga-Trillo, E. (1999) Vertebrate genomics: more fishy
tales about Hox genes. Curr. Biol., 9, R210–213.
48. Ruddle, F.H., Bartels, J.L., Bentley, K.L., Kappen, C., Murtha, M.T. and
Pendleton, J.W. (1994) Evolution of Hox genes. Annu. Rev. Genet., 28,
423–442.
49. Finnerty, J.R. and Martindale, M.Q. (1998) The evolution of the Hox
cluster: insights from outgroups. Curr. Opin. Genet. Dev., 8, 681–687.
50. Gehring, W.J., Affolter, M. and Burglin, T. (1994) Homeodomain proteins.
Annu. Rev. Biochem., 63, 487–526.
51. Goodman, F., Giovannucci-Uzielli, M.L., Hall, C., Reardon, W., Winter, R.
and Scambler, P. (1998) Deletions in HOXD13 segregate with an identical,
novel foot malformation in two unrelated families. Am. J. Hum. Genet.,
63, 992–1000.
52. Davis, A.P. and Capecchi, M.R. (1996) A mutational analysis of the 50
HoxD genes: dissection of genetic interactions during limb development in
the mouse. Development, 122, 1175–1185.
53. Fromental-Ramain, C., Warot, X., Messadecq, N., LeMeur, M., Dolle, P.
and Chambon, P. (1996) Hoxa-13 and Hoxd-13 play a crucial role in the
patterning of the limb autopod. Development, 122, 2997–3011.
54. Otto, F., Thornell, A.P., Crompton, T., Denzel, A., Gilmour, K.C.,
Rosewell, I.R., Stamp, G.W.H., Beddington, R.S.P., Mundlos, S.,
Olsen, B.R. et al. (1997) Cbfa1, a candidate gene for cleidocranial
dysplasia syndrome, is essential for osteoblast differentiation and bone
development. Cell, 89, 765–771.
55. Zakany, J. and Duboule, D. (1996) Synpolydactyly in mice with a targeted
deficiency in the HoxD complex. Nature, 384, 69–71.
56. Dixon, M.J., Gazzard, J., Chaudhry, S.S., Sampson, N., Schulte, B.A. and
Steel, K.P. (1999) Mutation of the Na–K–Cl co-transporter gene Slc12a2
results in deafness in mice. Hum. Mol. Genet., 8, 1579–1584.
57. Warren, S.T. (1997) Polyalanine expansion in synpolydactyly might result
from unequal crossing-over of HOXD13. Science, 275, 408–409.
58. Mitas, M. (1997) Trinucleotide repeats associated with human disease.
Nucl. Acids Res., 25, 2245–2254.
59. Usdin, K. and Woodford, K.J. (1995) CGG repeats associated with DNA
instability and chromosome fragility form structures that block DNA
synthesis in vitro. Nucl. Acids Res., 23, 4202–4209.
60. Fry, M. and Loeb, L.A. (1994) The fragile X syndrome d(CGG)n nucleotide
repeats form a stable tetrahelical structure. Proc. Natl Acad. Sci. USA,
91, 4950–4954.
61. Mariappan, S.V., Catasti, P., Chen, X., Ratliff, R., Moyzis, R.K.,
Bradbury, E.M. and Gupta, G. (1996) Solution structures of the individual
single strands of the fragile X DNA triplets (GCC)n(GGC)n. Nucl. Acids
Res., 24, 784–792.
62. Mitas, M., Yu, A., Dill, J. and Haworth, I.S. (1995) The trinucleotide repeat
sequence d(CGG)15 forms a heat-stable hairpin containing GsynGanti
base pairs. Biochemistry (Mosc.), 34, 12803–12811.
63. Borstnik, B. and Pumpernik, D. (2002) Tandem repeats in protein coding
regions of primate genes. Genome Res., 12, 909–915.
64. Lanz, R.B., Wieland, S., Hug, M. and Rusconi, S. (1995) A transcriptional
repressor obtained by alternative translation of a trinucleotide repeat. Nucl.
Acids Res., 23, 138–145.
65. Meijer, D., Graus, A. and Grosveld, G. (1992) Mapping the transactivation
domain of the Oct-6 POU transcription factor. Nucl. Acids Res., 20,
2241–2247.
66. Chen, L., DeVries, A.L. and Cheng, C.H. (1997) Convergent evolution
of antifreeze glycoproteins in Antarctic notothenioid fish and Arctic cod.
Proc. Natl Acad. Sci. USA, 94, 3817–3822.
67. Maruyama, K., Sato, N. and Ohta, N. (1999) Conservation of structure and
cold-regulation of RNA-binding proteins in cyanobacteria: probable
convergent evolution with eukaryotic glycinerich RNA-binding proteins.
Nucl. Acids Res., 27, 2029–2036.
68. Charron, F., Paradis, P., Bronchain, O., Nemer, G. and Nemer, M. (1999)
Cooperative interaction between GATA-4 and GATA-6 regulates
myocardial gene expression. Mol. Cell. Biol., 19, 4355–4365.
Human Molecular Genetics, 2003, Vol. 12, No. 22
69. McLarren, K.W., Theriault, F.M. and Stifani, S. (2001) Association with
the nuclear matrix and interaction with Groucho and RUNX proteins
regulate the transcription repression activity of the basic helix loop helix
factor Hes1. J. Biol. Chem., 276, 1578–1584.
70. Jimenez, G., Paroush, Z. and Ish-Horowicz, D. (1997) Groucho acts as a
corepressor for a subset of negative regulators, including Hairy and
Engrailed. Genes Dev., 11, 3072–3082.
71. Tolkunova, E.N., Fujioka, M., Kobayashi, M., Deka, D. and Jaynes, J.B.
(1998) Two distinct types of repression domain in engrailed: one interacts
with the groucho corepressor and is preferentially active on integrated target
genes. Mol. Cell. Biol., 18, 2804–2814.
72. Araki, I. and Nakamura, H. (1999) Engrailed defines the position of dorsal
di-mesencephalic boundary by repressing diencephalic fate. Development,
126, 5127–5135.
73. Kobayashi, M., Goldstein, R.E., Fujioka, M., Paroush, Z. and Jaynes, J.B.
(2001) Groucho augments the repression of multiple Even skipped target
genes in establishing parasegment boundaries. Development, 128,
1805–1815.
74. Muhr, J., Andersson, E., Persson, M., Jessell, T.M. and Ericson, J. (2001)
Groucho-mediated transcriptional repression establishes progenitor cell
pattern and neuronal fate in the ventral neural tube. Cell, 104, 861–873.
75. Briscoe, J., Pierani, A., Jessell, T.M. and Ericson, J. (2000) A homeodomain
protein code specifies progenitor cell identity and neuronal fate in the
ventral neural tube. Cell, 101, 435–445.
76. Aronson, B.D., Fisher, A.L., Blechman, K., Caudy, M. and Gergen, J.P.
(1997) Groucho-dependent and -independent repression activities of Runt
domain proteins. Mol. Cell. Biol., 17, 5581–5587.
77. Grbavec, D. and Stifani, S. (1996) Molecular interaction between TLE1 and
the carboxyl-terminal domain of HES-1 containing the WRPW motif.
Biochem. Biophys. Res. Commun., 223, 701–705.
78. Grbavec, D., Lo, R., Liu, Y., Greenfield, A. and Stifani, S. (1999) Groucho/
transducin-like enhancer of split (TLE) family members interact with the
yeast transcriptional co-repressor SSN6 and mammalian SSN6- related
proteins: implications for evolutionary conservation of transcription
repression mechanisms. Biochem. J., 337, 13–17.
79. Grbavec, D., Lo, R., Liu, Y. and Stifani, S. (1998) Transducin-like
Enhancer of split 2, a mammalian homologue of Drosophila Groucho, acts
as a transcriptional repressor, interacts with Hairy/Enhancer of split
proteins, and is expressed during neuronal development. Eur. J. Biochem.,
258, 339–349.
80. Firulli, B.A., Hadzic, D.B., McDaid, J.R. and Firulli, A.B. (2000) The basic
helix–loop–helix transcription factors dHAND and eHAND exhibit
dimerization characteristics that suggest complex regulation of function.
J. Biol. Chem., 275, 33567–33573.
81. Peltenburg, L.T. and Murre, C. (1996) Engrailed and Hox homeodomain
proteins contain a related Pbx interaction motif that recognizes a common
structure present in Pbx. EMBO J., 15, 3385–3393.
2979
82. Rubinsztein, D.C., Leggo, J., Chiano, M., Dodge, A., Norbury, G.,
Rosser, E. and Craufurd, D. (1997) Genotypes at the GluR6 kainate
receptor locus are associated with variation in the age of onset of
Huntington disease. Proc. Natl Acad. Sci. USA, 94, 3872–3876.
83. Hayes, S., Turecki, G., Brisebois, K., Lopes-Cendes, I., Gaspar, C.,
Riess, O., Ranum, L.P., Pulst, S.M. and Rouleau, G.A. (2000) CAG repeat
length in RAI1 is associated with age at onset variability in spinocerebellar
ataxia type 2 (SCA2). Hum. Mol. Genet., 9, 1753–1758.
84. Li, W., Jaroszewski, L. and Godzik, A. (2001) Clustering of highly
homologous sequences to reduce the size of large protein databases.
Bioinformatics, 17, 282–283.
85. Consortium, T.F. (2003) The FlyBase database of the Drosophila genome
projects and community literature. Nucl. Acids Res., 31, 172–175.
86. Harris, T.W., Lee, R., Schwarz, E., Bradnam, K., Lawson, D., Chen, W.,
Blasier, D., Kenny, E., Cunningham, F., Kishore, R. et al. (2003)
WormBase: a cross-species database for comparative genomics. Nucl. Acids
Res., 31, 133–137.
87. Alba, M.M., Santibanez-Koref, M.F. and Hancock, J.M. (1999)
Conservation of polyglutamine tract size between mice and humans
depends on codon interruption. Mol. Biol. Evol., 16, 1641–1644.
88. Diehn, M., Sherlock, G., Binkley, G., Jin, H., Matese, J.C.,
Hernandez-Boussard, T., Rees, C.A., Cherry, J.M., Botstein, D., Brown, P.O.
et al. (2003) SOURCE: a unified genomic resource of functional
annotations, ontologies, and gene expression data. Nucl. Acids Res.,
31, 219–223.
89. Henke, W., Herdel, K., Jung, K., Schnorr, D. and Loening, S.A. (1997)
Betaine improves the PCR amplification of GC-rich DNA sequences. Nucl.
Acids Res., 25, 3957–3958.
90. Camon, E., Magrane, M., Barrell, D., Binns, D., Fleischmann, W.,
Kersey, P., Mulder, N., Oinn, T., Maslen, J., Cox, A. et al. (2003) The
Gene Ontology Annotation (GOA) Project: implementation of GO in
SWISS-PROT, TrEMBL, and InterPro. Genome Res., 13, 662–672.
91. Consortium, T.G.O. (2001) Creating the gene ontology resource: design
and implementation. Genome Res., 11, 1425–1433.
92. Schaffer, A.A., Aravind, L., Madden, T.L., Shavirin, S., Spouge, J.L.,
Wolf, Y.I., Koonin, E.V. and Altschul, S.F. (2001) Improving the accuracy
of PSI-BLAST protein database searches with composition-based statistics
and other refinements. Nucl. Acids Res., 29, 2994–3005.
93. Thompson, J.D., Higgins, D.G. and Gibson, T.J. (1994) CLUSTAL W:
improving the sensitivity of progressive multiple sequence alignment
through sequence weighting, position-specific gap penalties and weight
matrix choice. Nucl. Acids Res., 22, 4673–4680.
94. Felsenstein, J. (2001) Taking variation of evolutionary rates between sites
into account in inferring phylogenies. J. Mol. Evol., 53, 447–455.
95. Bruno, W.J., Socci, N.D. and Halpern, A.L. (2000) Weighted neighbor
joining: a likelihood-based approach to distance-based phylogeny
reconstruction. Mol. Biol. Evol., 17, 189–197.