Yes (P=1.8e-09) 9.56 4.06 homeobox domain

Genomic organization and functional characterization
of regulatory elements in higher eukaryotes
Boris Lenhard
Computational Biology Unit
Bergen Center for Computational Science
University of Bergen, Norway
Genome comparison
reveals unknown functional elements
% IDENTITY
% IDENTITY
Actin gene compared between human and mouse.
Ultraconserved non-coding regions (UCR) in
vertebrate genomes
a.k.a. Conserved non-coding elements (CNE)
a.k.a. Conserved non-genic sequences (CNG)
a.k.a. Highly conserved non-coding regions (HCNR)
There exist unusually highly conserved noncoding
elements in vertebrate genomes
Ultraconserved regions (UCR) in vertebrate genomes

Definition of UCR






>= 50 bp
human:mouse identity >95%
no coding potential
3583 human:mouse UCRs have detectable conservation in Fugu
A few [dozen] characterized, all as long-range enhancers
Many UCRs occur in clusters spanning hundreds of kilobases:
What genes are UCRs associated with?
Nr.
Nr
UCR
s
Gene Symbol
1
84
MEIS2
Meis1, myeloid ecotropic viral
integration site 1 homolog 2 (mouse)
Homeobox
2
81
ZFHX1B
zinc finger homeobox 1b
Homeobox
Zn-finger, C2H2 type
3
80
KIAA0390
KIAA0390 gene product
Znf_C2H2, NLS_BP
4
79
EBF-3
COE3_HUMAN , Transcription factor
COE3 (Early B-cell factor 3) (EBF-3)
COE
5
77
ZNF503
zinc finger protein 503
6
64
IRX-3
IRX-5
IRX-6
Iroquis-class protein IRX-3
Iroquis-class protein IRX-5
Iroquis-class protein IRX-6
Homeobox
Homeobox
Homeobox
7
62
PBX3
pre-B-cell leukemia transcription factor
3
PBX
Homeobox
8
62
NR2F1
nuclear receptor subfamily 2, group F,
member 1
Hormone_rec_lig
Stdhrmn_receptor
Str_ncl_receptor
Znf_C4steroid
9
60
FOXP2
------TFEC
forkhead box P2 (immune tolerance
development)
----Similar to transcription factor EC
Involucrin_rpt TF_Fork_head
Znf_C2H2
------HLH_basic
10
52
DACH
dachshund homolog (Drosophila)
Transform_Ski
Description
Interpro domains
Znf_PHD
Znf_C2H2 Eggshell
What genes are UCRs associated with?

10 top UCR
clusters:
Sandelin A, Bailey P, Bruce S,
Engstrom PG, Klos JM, Wasserman
WW, Ericson J, Lenhard B. (2004)
Arrays of ultraconserved noncoding regions span the loci of key
developmental genes in vertebrate
genomes. BMC Genomics 5:99.
11
52
PAX2
paired box gene 2 (kidney,
differentiation, eyes, CNS)
Paired_box Homeobox
12
52
FOXP1
forkhead box P1
(specification and differentiation of
lung epithelium)
TF_Fork_head Znf_C2H2
13
48
BCL11A
B-cell lymphoma/leukemia 11A (Bcell CLL/lymphoma 11A) (COUP-TF
interacting protein 1) (Ecotropic viral
integration site 9 protein) (EVI-9)
Znf_C2H2
14
46
IRX-4
IRX-2
IRX-1
IRX-4
IRX-2
IRX-1
Homeobox
Homeobox
Homeobox
15
46
ATF-2
--------EVX-2
------HOX-D*
activating transcription factor 2
(brain)
------HOMEOBOX EVEN-SKIPPED
HOMOLOG PROTEIN 2 (EVX-2)
------HOX-D cluster
Znf_C2H2 TF_bZIP
-------------------Homeobox Antifreeze_1
HTH_lambrepressr
CytC_heme_bind
-------------------Homeobox
HTH_lambrepressr
16
41
NR4A2
nuclear receptor subfamily 4, group
A, member 2 (brain)
Znf_C4steroid
hormone_rec_lig
17
39
FOXD3
forkhead, box D3, at
chr1:63146833-63147169
TF_Fork_head
18
38
LMO4
---------KIAA1221
LIM domain only 4
----------KIAA1221 (brain)
LIM
----------Znf_C2H2
19
38
ZNF407
zinc finger protein 407
Znf_C2H2
20
35
MEIS1
Meis1, myeloid ecotropic viral
integration site 1 homolog (mouse)
Homeobox
21
35
ZFPM2
(FOG-2)
zinc finger protein, multitype 2
(Friend of GATA-2) (cardiogenesis,
hematopoiesis)
Znf_C2H2
22
35
TNRC9
trinucleotide repeat containing 9
Highmoblty_12
HMG-box
HMG_12_box
23
33
ZFH4
zinc finger homeodomain 4
AMP-bind
Homeobox
Somatotropin
Znf_C2H2
Znf_U1
24
32
SOX6
SRY (sex determining region Y)-box
6
HMG_12_box ATP_GTP_A
NLS_BP
25
31
FLJ20043
Hypothetical protein FLJ20043
CytC_heme_BS
Znf_C2H2
26
31
OTP
orthopedia homolog (development
of the neuroendocrine
hypothalamus)
Homeobox Homeo_OAR
HTH_lambrepressr
27
30
TCF7L2
transcription factor 7-like 2 (T-cell
specific, HMG-box)
HMG_box
28
30
SALL3
Sal-like protein 3 (Zinc finger
protein SALL3) (hSALL3)
Znf_C2H2
29
27
BUB3
Mitotic checkpoint protein BUB3
WD40
30
26
TFAP2A
Transcription factor AP-2 alpha
(AP2-alpha) (Activating enhancerbinding protein 2 alpha) (AP-2
transcription factor) (Activator
protein-2) (AP-2).
TF_AP2
TF_AP2_alpha
What genes are UCRs associated with?


Out of 150 most prominent UCR clusters, at least 144 concide
with one or more genes for DNA binding proteins (generally
transcription factors)
Among them are most key regulators of animal development:


HOX clusters, Iroquois genes, GSH1, GSH2, PPARg, LMO1…
Many are associated with malignancies and recurring
chromosomal breakpoints/rearrangement sites:

MEIS2, PBX3, BCL11A, MEIS1, LMO4, BCL11B, EVI1...
Quantitative evidence I:
Categories of genes in the vicinity of UCRs
Sandelin A, Bailey P, Bruce S, Engstrom PG, Klos JM, Wasserman WW, Ericson J, Lenhard B.
(2004) Arrays of ultraconserved non-coding regions span the loci of key developmental genes
in vertebrate genomes. BMC Genomics 5:99.
Domain description
HTH_lambrepressr
Homeobox
Antennapedia
Paired_box
HLH_basic
POU_domain
Homeo_OAR
TF_Fork_head
Znf_C4steroid
Hormone_rec_lig
HMG_12_box
Stdhrmn_receptor
COUP_TF
LIM
RtnoidX_receptor
FN_III
INTERPRO
ID
IPR000047
IPR001356
IPR001827
IPR001523
IPR001092
IPR000327
IPR003654
IPR001766
IPR001628
IPR000536
IPR000910
IPR001723
IPR003068
IPR001781
IPR000003
IPR003961
Fisher test P Corrected P
value
value
6.40E-20
5.36E-17
1.60E-12
1.37E-10
2.39E-05
2.40E-05
3.06E-05
3.08E-05
6.15E-05
7.45E-05
1.06E-04
1.81E-04
2.63E-04
7.62E-04
1.10E-03
1.28E-03
2.57E-03
1.34E-09
1.15E-07
2.00E-02
2.01E-02
2.56E-02
2.58E-02
5.15E-02
6.23E-02
8.86E-02
1.51E-01
2.20E-01
6.38E-01
9.18E-01
1.07E+00
2.15E+00
Over-representation of protein domains in genes
flanking UCRs. Bonferroni-corrected and uncorrected
Fisher Exact Test p-values are shown for the 16 most
over-represented INTERPRO domains. Typical
transcription factor domains are in bold.


50% of Homeobox-containing genes,
20% of forkeads, 20% of nuclear
receptors and 8% of zinc finger
proteins are within 200 kb of a UCR
Only 3% of random genes are within
200 kb of a UCR
What is the function of UCRs (cont’d)?


Most known ones: enhancers
A very small fraction: pre-microRNA genes



can be easily distinguished from putative enhancer elements
A distinct conservation pattern between mammals and fish
Different binding site pattern composition than most other UCRs
Pre-miRNA gene
Putative conserved regulatory elements show distinct
motif compositions
Mean number of sites (per Mean number of sites (per 400
400 bp) detected in
bp bp) detected in conserved
conserved noncoding
noncoding elements associated
elements containing known with embryonic development
pre-miRNA genes
genes
Is the diference
significant
(T-test)?
sox-5
4.46
6.12
Yes (P=0.002)
oct family
2.54
4.66
Yes (P=2.3e-06)
homeobox
domain
4.06
9.56
Yes (P=1.8e-09)
gsh2
1.80
5.13
Yes (P=7.8e-15)
nkx2.2
1.11
1.05
No (P=0.83)
nkx6.1
2.43
5.83
Yes (P=1.1e-08)
pax6
0.32
0.50
No (P=0.07)
MOST UCRs CONTAIN A HIGH DENSITY OF BINDING SITES FOR
KEY DEVELOPMENTAL TRANSCRIPTION FACTORS.
Can we recognize the “neural” ultraconserved
enhancers?

Most UCRs show a high overrepresentation of a number of
putative transcription factor binding site motifs:


General homeobox motifs, Sox (SRY) and Oct (POU)
Sox2 and Oct3/4 are highly expressed in mouse ES cells
(Nagano K et al (200 5) Proteomics 5:1346-61)

Oct and Sox transcription factors
control many different aspects of
neural development and
embryogenesis, often binding to
adjacent sites on DNA
Williams, D. C. et al. (2004) J. Biol. Chem.279:1449-1457
The SPH (Sox-Oct-Homeobox) model:
A simple screen to select UCRs governing neural expression

The model measures the combined probability of ocurrence of
Sox, Oct(POU) and core homeobox motifs in 400 bp regions
centered on UCRs
SPH-enriched UCRs around genes coding for known
neural patterning regulators
SPH-model detects genomic regions
with neural expression
UCRs: common to all metazoan genomes?
Drosophila:
Vertebrates:
ETS
Domain description
HTH_lambrepressr
Homeobox
Antennapedia
Paired_box
HLH_basic
POU_domain
Homeo_OAR
TF_Fork_head
Znf_C4steroid
Hormone_rec_lig
HMG_12_box
Stdhrmn_receptor
COUP_TF
LIM
RtnoidX_receptor
FN_III
INTERPRO
ID
IPR000047
Fisher test P Corrected P
value
value
6.40E-20
5.36E-17
TIR
Homeobox
Paired
IPR001356
IPR001827
IPR001523
IPR001092
IPR000327
IPR003654
IPR001766
IPR001628
IPR000536
IPR000910
IPR001723
IPR003068
IPR001781
IPR000003
IPR003961
1.60E-12
1.37E-10
2.39E-05
2.40E-05
3.06E-05
3.08E-05
6.15E-05
7.45E-05
1.06E-04
1.81E-04
2.63E-04
7.62E-04
1.10E-03
1.28E-03
2.57E-03
1.34E-09
1.15E-07
2.00E-02
2.01E-02
2.56E-02
2.58E-02
5.15E-02
6.23E-02
8.86E-02
1.51E-01
2.20E-01
6.38E-01
9.18E-01
1.07E+00
2.15E+00
Cfc4
NHR ligand
von Willebrand factor type C domain
Laminin g
Imunoglobulin
Fibronectin type III
Cadherin
Cyclic nucleotide-binding domain
Neurotransmitter-gated ion-channel
transmembrane region
Ligand-gated ion channel
Neurotransmitter-gated ion-channel ligand
binding domain
BTB/POZ domain
UCRs in Drosophila

twist locus
Core promoters and responsiveness to
long-range enhancers
A textbook-type core promoter
GC-box
CAAT
TATA
Large-scale mapping oftranscription start sites
using CAGE (Cap Analysis of Gene Expression)


Like SAGE, but 5’ ends of cDNAs (using RIKEN 5’ GTP
cap trapping technology)
Large-scale sequencing of 5’ ends (CAGE tags of 2022 nucleoties) of mRNAs:

~6.5 million mouse and ~4 million human CAGE tags
uniquely mapped to genome
CAGE tags mapped to genome
demarcate transcription start sites
Myosin heavy chain 3 (Myh3), 1725 CAGE tags
TATA
Betaine-homocysteine methyltransferase (Bhmt), 1659 CAGE tags
TATA
CAGE tags mapped to genome
demarcate transcription start sites
Oxoglutarate dehydrogenase (Ogdh), 1496 CAGE tags
Adenylosuccinate lyase (Adsl), 278 CAGE tags
Single-peak (SP) vs. broad (BR) core promoters:
“shape classes” of core promoters
Association of shape classes with different core
promoter elements
Overrepresentation and underrepresentation
of core promoter elements in different shape classes.
SP
A
BR
PB
MU
TATA (all)
3.1e-73
1.9e-16
1.8e-10
2.4e-09
CCAAT (all)
0.04
0.42
0.37
0.49
GC (all)
1e-4
0.20
0.40
0.33
CpG (all)
1.0e-137
1.4e-65
8.7e-06
0.02
B
SP
BR
PB
MU
TATA (no CpG)
2.6e-77
1.6e-16
2.8e-16
1.0e-09
CCAAT (no CpG)
6.8e-23
9.2e-16
0.11
0.42
GC (no CpG)
7.8e-25
5.9e-18
0.48
0.35
CpG (no TATA, CCAAT or GC)
4.8e-45
4.7e-17
3.4e-05
0.87
SP (single peak) promoters: strongly associated with TATA boxes
BR (broad) promoters: strongly associated with CpG islands and absence of TATA box
Association of shape classes with tissue specificity
Tissue
SP
BR
PB
MU
adipose
1.98
P=0.14
0.27
P=0.11
1.58
P=0.29
0.44
P=0.47
cns
1.02
P=0.86
0.69
P=0.0020
1.22
P=0.10
1.23
P=0.10
embryo
4.11
P=1.21e-22
0.00
P=6.22e-08
0.30
P=0.009
9
0.00
P=8.096e-05
liver
2.15
P=3.56e-21
0.41
P=1.14e-14
0.71
P=0.005
3
1.07
P=0.56
lung
2.41
P=1.37e-10
0.23
P=1.42e-08
1.11
P=0.61
0.58
P=0.049
macrophag
e
1.39
P=0.024
0.64
P=0.0041
0.89
P=0.59
1.26
P=0.14
other
3.59
P=3.87e-19
0.11
P=4.029e-07
0.33
P=0.004
9
0.36
P=0.016
4.36
0.00
testis
P=7.70e-06
Overrepresented
1e-10 P=0.058
1e-06
0.00
0.00
0.000
P=0.21 0.01 P=0.21
1.00
1
Underrrepresented
0.000
1
1e-10
1e-06
0.01
1.00
SP (single peak) promoters
(and by association, TATA-box
promoters):
strongly associated with tissuespecific genes (except brain)
BR (broad) promoters (and, by
association, CpG island
overlapping & TATA-less
promoters):
strongly associated with
housekeeping genes (and
developmental regulatory genes)
Conclusions



Key vertebrate (and most likely invertebrate) transcription factor
genes are controlled by arrays of highly conserved regulatory
elements; the arrays ofter span more than a megabase around
their target genes.
Highly conserved regulatory elements contain clusters of
putative transcription factor binding sites indicative of their
function, enabling the building of predictive models.
There are fundamentally different classes of vertebrate core
promoters, differing in mechanism of transcriptional initiation
and choice of TSS, tissue specificity, evolutionary dynamics and
responsivneness to long-range enhancers.
Acknowledgements

Lenhard Group at CGB, Karolinska Institutet (now at Bergen Center for
Computational Science, University of Bergen)





Pär Engström (PhD student)
Ying Sheng (PhD student)
Albin Sandelin (Postdoc) – now at RIKEN GSC
Sara Bruce (Project student) – now at Dept. Of Bioscience, Karolinska
Institutet
Collaborators

RIKEN Genome Science Center


Wyeth Wasserman group (University of British Columbia)


Piero Carninci and the members of FANTOM3 Consortium
Shannan Ho Sui, David Arenillas
Johan Ericson group (CMB, Karolinska Institutet)

Peter Bailey, Joanna Klos