Genomic organization and functional characterization of regulatory elements in higher eukaryotes Boris Lenhard Computational Biology Unit Bergen Center for Computational Science University of Bergen, Norway Genome comparison reveals unknown functional elements % IDENTITY % IDENTITY Actin gene compared between human and mouse. Ultraconserved non-coding regions (UCR) in vertebrate genomes a.k.a. Conserved non-coding elements (CNE) a.k.a. Conserved non-genic sequences (CNG) a.k.a. Highly conserved non-coding regions (HCNR) There exist unusually highly conserved noncoding elements in vertebrate genomes Ultraconserved regions (UCR) in vertebrate genomes Definition of UCR >= 50 bp human:mouse identity >95% no coding potential 3583 human:mouse UCRs have detectable conservation in Fugu A few [dozen] characterized, all as long-range enhancers Many UCRs occur in clusters spanning hundreds of kilobases: What genes are UCRs associated with? Nr. Nr UCR s Gene Symbol 1 84 MEIS2 Meis1, myeloid ecotropic viral integration site 1 homolog 2 (mouse) Homeobox 2 81 ZFHX1B zinc finger homeobox 1b Homeobox Zn-finger, C2H2 type 3 80 KIAA0390 KIAA0390 gene product Znf_C2H2, NLS_BP 4 79 EBF-3 COE3_HUMAN , Transcription factor COE3 (Early B-cell factor 3) (EBF-3) COE 5 77 ZNF503 zinc finger protein 503 6 64 IRX-3 IRX-5 IRX-6 Iroquis-class protein IRX-3 Iroquis-class protein IRX-5 Iroquis-class protein IRX-6 Homeobox Homeobox Homeobox 7 62 PBX3 pre-B-cell leukemia transcription factor 3 PBX Homeobox 8 62 NR2F1 nuclear receptor subfamily 2, group F, member 1 Hormone_rec_lig Stdhrmn_receptor Str_ncl_receptor Znf_C4steroid 9 60 FOXP2 ------TFEC forkhead box P2 (immune tolerance development) ----Similar to transcription factor EC Involucrin_rpt TF_Fork_head Znf_C2H2 ------HLH_basic 10 52 DACH dachshund homolog (Drosophila) Transform_Ski Description Interpro domains Znf_PHD Znf_C2H2 Eggshell What genes are UCRs associated with? 10 top UCR clusters: Sandelin A, Bailey P, Bruce S, Engstrom PG, Klos JM, Wasserman WW, Ericson J, Lenhard B. (2004) Arrays of ultraconserved noncoding regions span the loci of key developmental genes in vertebrate genomes. BMC Genomics 5:99. 11 52 PAX2 paired box gene 2 (kidney, differentiation, eyes, CNS) Paired_box Homeobox 12 52 FOXP1 forkhead box P1 (specification and differentiation of lung epithelium) TF_Fork_head Znf_C2H2 13 48 BCL11A B-cell lymphoma/leukemia 11A (Bcell CLL/lymphoma 11A) (COUP-TF interacting protein 1) (Ecotropic viral integration site 9 protein) (EVI-9) Znf_C2H2 14 46 IRX-4 IRX-2 IRX-1 IRX-4 IRX-2 IRX-1 Homeobox Homeobox Homeobox 15 46 ATF-2 --------EVX-2 ------HOX-D* activating transcription factor 2 (brain) ------HOMEOBOX EVEN-SKIPPED HOMOLOG PROTEIN 2 (EVX-2) ------HOX-D cluster Znf_C2H2 TF_bZIP -------------------Homeobox Antifreeze_1 HTH_lambrepressr CytC_heme_bind -------------------Homeobox HTH_lambrepressr 16 41 NR4A2 nuclear receptor subfamily 4, group A, member 2 (brain) Znf_C4steroid hormone_rec_lig 17 39 FOXD3 forkhead, box D3, at chr1:63146833-63147169 TF_Fork_head 18 38 LMO4 ---------KIAA1221 LIM domain only 4 ----------KIAA1221 (brain) LIM ----------Znf_C2H2 19 38 ZNF407 zinc finger protein 407 Znf_C2H2 20 35 MEIS1 Meis1, myeloid ecotropic viral integration site 1 homolog (mouse) Homeobox 21 35 ZFPM2 (FOG-2) zinc finger protein, multitype 2 (Friend of GATA-2) (cardiogenesis, hematopoiesis) Znf_C2H2 22 35 TNRC9 trinucleotide repeat containing 9 Highmoblty_12 HMG-box HMG_12_box 23 33 ZFH4 zinc finger homeodomain 4 AMP-bind Homeobox Somatotropin Znf_C2H2 Znf_U1 24 32 SOX6 SRY (sex determining region Y)-box 6 HMG_12_box ATP_GTP_A NLS_BP 25 31 FLJ20043 Hypothetical protein FLJ20043 CytC_heme_BS Znf_C2H2 26 31 OTP orthopedia homolog (development of the neuroendocrine hypothalamus) Homeobox Homeo_OAR HTH_lambrepressr 27 30 TCF7L2 transcription factor 7-like 2 (T-cell specific, HMG-box) HMG_box 28 30 SALL3 Sal-like protein 3 (Zinc finger protein SALL3) (hSALL3) Znf_C2H2 29 27 BUB3 Mitotic checkpoint protein BUB3 WD40 30 26 TFAP2A Transcription factor AP-2 alpha (AP2-alpha) (Activating enhancerbinding protein 2 alpha) (AP-2 transcription factor) (Activator protein-2) (AP-2). TF_AP2 TF_AP2_alpha What genes are UCRs associated with? Out of 150 most prominent UCR clusters, at least 144 concide with one or more genes for DNA binding proteins (generally transcription factors) Among them are most key regulators of animal development: HOX clusters, Iroquois genes, GSH1, GSH2, PPARg, LMO1… Many are associated with malignancies and recurring chromosomal breakpoints/rearrangement sites: MEIS2, PBX3, BCL11A, MEIS1, LMO4, BCL11B, EVI1... Quantitative evidence I: Categories of genes in the vicinity of UCRs Sandelin A, Bailey P, Bruce S, Engstrom PG, Klos JM, Wasserman WW, Ericson J, Lenhard B. (2004) Arrays of ultraconserved non-coding regions span the loci of key developmental genes in vertebrate genomes. BMC Genomics 5:99. Domain description HTH_lambrepressr Homeobox Antennapedia Paired_box HLH_basic POU_domain Homeo_OAR TF_Fork_head Znf_C4steroid Hormone_rec_lig HMG_12_box Stdhrmn_receptor COUP_TF LIM RtnoidX_receptor FN_III INTERPRO ID IPR000047 IPR001356 IPR001827 IPR001523 IPR001092 IPR000327 IPR003654 IPR001766 IPR001628 IPR000536 IPR000910 IPR001723 IPR003068 IPR001781 IPR000003 IPR003961 Fisher test P Corrected P value value 6.40E-20 5.36E-17 1.60E-12 1.37E-10 2.39E-05 2.40E-05 3.06E-05 3.08E-05 6.15E-05 7.45E-05 1.06E-04 1.81E-04 2.63E-04 7.62E-04 1.10E-03 1.28E-03 2.57E-03 1.34E-09 1.15E-07 2.00E-02 2.01E-02 2.56E-02 2.58E-02 5.15E-02 6.23E-02 8.86E-02 1.51E-01 2.20E-01 6.38E-01 9.18E-01 1.07E+00 2.15E+00 Over-representation of protein domains in genes flanking UCRs. Bonferroni-corrected and uncorrected Fisher Exact Test p-values are shown for the 16 most over-represented INTERPRO domains. Typical transcription factor domains are in bold. 50% of Homeobox-containing genes, 20% of forkeads, 20% of nuclear receptors and 8% of zinc finger proteins are within 200 kb of a UCR Only 3% of random genes are within 200 kb of a UCR What is the function of UCRs (cont’d)? Most known ones: enhancers A very small fraction: pre-microRNA genes can be easily distinguished from putative enhancer elements A distinct conservation pattern between mammals and fish Different binding site pattern composition than most other UCRs Pre-miRNA gene Putative conserved regulatory elements show distinct motif compositions Mean number of sites (per Mean number of sites (per 400 400 bp) detected in bp bp) detected in conserved conserved noncoding noncoding elements associated elements containing known with embryonic development pre-miRNA genes genes Is the diference significant (T-test)? sox-5 4.46 6.12 Yes (P=0.002) oct family 2.54 4.66 Yes (P=2.3e-06) homeobox domain 4.06 9.56 Yes (P=1.8e-09) gsh2 1.80 5.13 Yes (P=7.8e-15) nkx2.2 1.11 1.05 No (P=0.83) nkx6.1 2.43 5.83 Yes (P=1.1e-08) pax6 0.32 0.50 No (P=0.07) MOST UCRs CONTAIN A HIGH DENSITY OF BINDING SITES FOR KEY DEVELOPMENTAL TRANSCRIPTION FACTORS. Can we recognize the “neural” ultraconserved enhancers? Most UCRs show a high overrepresentation of a number of putative transcription factor binding site motifs: General homeobox motifs, Sox (SRY) and Oct (POU) Sox2 and Oct3/4 are highly expressed in mouse ES cells (Nagano K et al (200 5) Proteomics 5:1346-61) Oct and Sox transcription factors control many different aspects of neural development and embryogenesis, often binding to adjacent sites on DNA Williams, D. C. et al. (2004) J. Biol. Chem.279:1449-1457 The SPH (Sox-Oct-Homeobox) model: A simple screen to select UCRs governing neural expression The model measures the combined probability of ocurrence of Sox, Oct(POU) and core homeobox motifs in 400 bp regions centered on UCRs SPH-enriched UCRs around genes coding for known neural patterning regulators SPH-model detects genomic regions with neural expression UCRs: common to all metazoan genomes? Drosophila: Vertebrates: ETS Domain description HTH_lambrepressr Homeobox Antennapedia Paired_box HLH_basic POU_domain Homeo_OAR TF_Fork_head Znf_C4steroid Hormone_rec_lig HMG_12_box Stdhrmn_receptor COUP_TF LIM RtnoidX_receptor FN_III INTERPRO ID IPR000047 Fisher test P Corrected P value value 6.40E-20 5.36E-17 TIR Homeobox Paired IPR001356 IPR001827 IPR001523 IPR001092 IPR000327 IPR003654 IPR001766 IPR001628 IPR000536 IPR000910 IPR001723 IPR003068 IPR001781 IPR000003 IPR003961 1.60E-12 1.37E-10 2.39E-05 2.40E-05 3.06E-05 3.08E-05 6.15E-05 7.45E-05 1.06E-04 1.81E-04 2.63E-04 7.62E-04 1.10E-03 1.28E-03 2.57E-03 1.34E-09 1.15E-07 2.00E-02 2.01E-02 2.56E-02 2.58E-02 5.15E-02 6.23E-02 8.86E-02 1.51E-01 2.20E-01 6.38E-01 9.18E-01 1.07E+00 2.15E+00 Cfc4 NHR ligand von Willebrand factor type C domain Laminin g Imunoglobulin Fibronectin type III Cadherin Cyclic nucleotide-binding domain Neurotransmitter-gated ion-channel transmembrane region Ligand-gated ion channel Neurotransmitter-gated ion-channel ligand binding domain BTB/POZ domain UCRs in Drosophila twist locus Core promoters and responsiveness to long-range enhancers A textbook-type core promoter GC-box CAAT TATA Large-scale mapping oftranscription start sites using CAGE (Cap Analysis of Gene Expression) Like SAGE, but 5’ ends of cDNAs (using RIKEN 5’ GTP cap trapping technology) Large-scale sequencing of 5’ ends (CAGE tags of 2022 nucleoties) of mRNAs: ~6.5 million mouse and ~4 million human CAGE tags uniquely mapped to genome CAGE tags mapped to genome demarcate transcription start sites Myosin heavy chain 3 (Myh3), 1725 CAGE tags TATA Betaine-homocysteine methyltransferase (Bhmt), 1659 CAGE tags TATA CAGE tags mapped to genome demarcate transcription start sites Oxoglutarate dehydrogenase (Ogdh), 1496 CAGE tags Adenylosuccinate lyase (Adsl), 278 CAGE tags Single-peak (SP) vs. broad (BR) core promoters: “shape classes” of core promoters Association of shape classes with different core promoter elements Overrepresentation and underrepresentation of core promoter elements in different shape classes. SP A BR PB MU TATA (all) 3.1e-73 1.9e-16 1.8e-10 2.4e-09 CCAAT (all) 0.04 0.42 0.37 0.49 GC (all) 1e-4 0.20 0.40 0.33 CpG (all) 1.0e-137 1.4e-65 8.7e-06 0.02 B SP BR PB MU TATA (no CpG) 2.6e-77 1.6e-16 2.8e-16 1.0e-09 CCAAT (no CpG) 6.8e-23 9.2e-16 0.11 0.42 GC (no CpG) 7.8e-25 5.9e-18 0.48 0.35 CpG (no TATA, CCAAT or GC) 4.8e-45 4.7e-17 3.4e-05 0.87 SP (single peak) promoters: strongly associated with TATA boxes BR (broad) promoters: strongly associated with CpG islands and absence of TATA box Association of shape classes with tissue specificity Tissue SP BR PB MU adipose 1.98 P=0.14 0.27 P=0.11 1.58 P=0.29 0.44 P=0.47 cns 1.02 P=0.86 0.69 P=0.0020 1.22 P=0.10 1.23 P=0.10 embryo 4.11 P=1.21e-22 0.00 P=6.22e-08 0.30 P=0.009 9 0.00 P=8.096e-05 liver 2.15 P=3.56e-21 0.41 P=1.14e-14 0.71 P=0.005 3 1.07 P=0.56 lung 2.41 P=1.37e-10 0.23 P=1.42e-08 1.11 P=0.61 0.58 P=0.049 macrophag e 1.39 P=0.024 0.64 P=0.0041 0.89 P=0.59 1.26 P=0.14 other 3.59 P=3.87e-19 0.11 P=4.029e-07 0.33 P=0.004 9 0.36 P=0.016 4.36 0.00 testis P=7.70e-06 Overrepresented 1e-10 P=0.058 1e-06 0.00 0.00 0.000 P=0.21 0.01 P=0.21 1.00 1 Underrrepresented 0.000 1 1e-10 1e-06 0.01 1.00 SP (single peak) promoters (and by association, TATA-box promoters): strongly associated with tissuespecific genes (except brain) BR (broad) promoters (and, by association, CpG island overlapping & TATA-less promoters): strongly associated with housekeeping genes (and developmental regulatory genes) Conclusions Key vertebrate (and most likely invertebrate) transcription factor genes are controlled by arrays of highly conserved regulatory elements; the arrays ofter span more than a megabase around their target genes. Highly conserved regulatory elements contain clusters of putative transcription factor binding sites indicative of their function, enabling the building of predictive models. There are fundamentally different classes of vertebrate core promoters, differing in mechanism of transcriptional initiation and choice of TSS, tissue specificity, evolutionary dynamics and responsivneness to long-range enhancers. Acknowledgements Lenhard Group at CGB, Karolinska Institutet (now at Bergen Center for Computational Science, University of Bergen) Pär Engström (PhD student) Ying Sheng (PhD student) Albin Sandelin (Postdoc) – now at RIKEN GSC Sara Bruce (Project student) – now at Dept. Of Bioscience, Karolinska Institutet Collaborators RIKEN Genome Science Center Wyeth Wasserman group (University of British Columbia) Piero Carninci and the members of FANTOM3 Consortium Shannan Ho Sui, David Arenillas Johan Ericson group (CMB, Karolinska Institutet) Peter Bailey, Joanna Klos
© Copyright 2026 Paperzz