DESIGNING TRAINING REGULATORY DATASETS Enrique Blanco Xavier Messeguer Roderic Guigó OUR APPROACH 1. SEQUENCE AND FUNCTION Transthyretin NP_000362 (human) - NP_036813 (rat): MASHRLLLLCLAGLVFVSEAGPTGTGESKCPLMVKVLDAVRGSPAINVAV MASLRLFLLCLAGLIFASEAGPGGAGESKCPLMVKVLDAVRGSPAVDVAV *** **:*******:*.***** *:********************::*** SIMILAR SEQUENCE HVFRKAADDTWEPFASGKTSESGELHGLTTEEEFVEGIYKVEIDTKSYWK KVFKRTADGSWEPFASGKTAESGELHGLTTDEKFTEGVYRVELDTKSYWK :**:::**.:*********:**********:*:*.**:*:**:******* ALGISPFHEHAEVVFTANDSGPRRYTIAALLSPYSYSTTAVVTNPKE ALGISPFHEYAEVVFTANDSGHRHYTIAALLSPYSYSTTAVVSNPQN *********:*********** *:******************:**:: SIMILAR FUNCTION Enrique Blanco [http://genome.imim.es] - RegCreative Jamboree 2006 2. FUNCTION AND SEQUENCE MACGEFSLIARYFDRVRSSRLDVETGIG-DDCALLNIPEKQTLAISTDTL --MSEECIENPERIKIGTDLINIRNKMNLKELIHPNEDENSTLLILNQKI .* .: :: :. :::.. :. .: * *:.** * .:.: VAGNHFLPDIDPADLAYKALAVNLSDLAAMGADPAWLTLALTLPEVDEPW DIPRPLFYKIWKLHDLKVCADGAANRLYDYLDDDETLRIKY-LPNYIIGD . :: .* . . . * * * : **: LEAFSDSLFALLNYYDMQLIGGDTTRG-PLSMTLGIHGYIPAGRALKRSG LDSLSEKVYKYYRKNKVTIIKQTTQYSTDFTKCVNLISLHFNSPEFRSLI *:::*:.:: . .: :* * . :: :.: . . :: SIMILAR SEQUENCE ThiL gene (S. typhimurium) encoding thiamin phosphate kinase can be displaced (functionally equivalent) by THI80 (S. cerevisiae), encoding thiamin pyrophosphokinase. AKPGDWIYVTGTPGDSAAG--LAVLQNRLQVSEETDAHYLIQR----HLR SNKDNLQSNHGIELEKGIHTLYNTMTESLVFSKVTPISLLALGGIGGRFD :: .: * :.. .: : * .*: * * :: PTPRILHGQALRDIASAAIDLSDGLISDLGHIVKASGCGARVDVDALPKS QTVHSITQLYTLSENASYFKLCYMTPTDLIFLIKKNGTLIEYDPQFRNTC * : : . :: :.*. :** .::* .* . * : .. DAMMRHVDDGQALRWALSGGEDYELCFTVPELNRGALDVAIGQLGVPFTC IGNCGLLPIGEATLVKETRGLKWDVKNWPTSVVTGRVSSSNRFVGDNCCF . : *:* : * .::: ..: * :. : :* SIMILAR FUNCTION Comparison of the known structure of THI80 with the structure of ThiL reveals different folds. Thus, two different folds might catalyze the same reaction. Systematic discovery of analogous enzymes in Thiamin biosynthesis. Morett, Korbel, Rajan, Saab-Rincon, Olvera, Olvera, Schmidt, Snel, Bork. Nature Biotechnology 21, 790 - 795 (2003). IGQMSADIEGLNFVRDGMPVTFDWKGYDHFATP IDTKDDIILNVEIFVDKLIDFL----------*. . * .:::. * : : Enrique Blanco [http://genome.imim.es] - RegCreative Jamboree 2006 3. FUNCTION AND SEQUENCE (TFBSs) SIMILAR SEQUENCE HNF1- binding sites (human) SIMILAR FUNCTION ------------AGTTAATCATTGGCC---------------------GTTAATTATTGGCAAATGTCCC-------GTATGGGTTACTTATTCTCTCTTTGTTGA ------------GGTTAAGACTCTAAT---------------AGTCTAGTTAATAATCTACAATT--------------TGAGATTAATA------------------------AATGATTAAAA----------------------------GTCAAACATTAAC-------------------CCGATTAACCATTAACCCCCACCCC-------------GTTAATCAGAAAA---------GGATGTATGTAGAATTACATAAGAA-----------------------CTTACTCAATAAC---------- Enrique Blanco [http://genome.imim.es] - RegCreative Jamboree 2006 4. TF-MAPS: A NEW ALPHABET Enrique Blanco [http://genome.imim.es] - RegCreative Jamboree 2006 5. TF-MAPS: A NEW FORM OF ALIGNMENT MAP 1 MAP 2 We can align the TF-MAPS in this new alphabet: •Mapping score •Gaps •Positional conservation Enrique Blanco [http://genome.imim.es] - RegCreative Jamboree 2006 6. TF-MAP ALIGNMENT in PROMOTER CHARACTERIZATION TTR gene: ENSG00000118271 TTR RECONSTRUCTION Pairwise TF-map alignments between TTR and 83 COREG(TTR) in CISRED A.G. Robertson et al. cisRED: a database system for genome-scale computational discovery of regulatory elements. Nucleic Acids Research, 34:D68–D73, 2006. Enrique Blanco [http://genome.imim.es] - RegCreative Jamboree 2006 PROMOTER 7. ACCURACY IN ABSENCE OF SEQUENCE SIMILARITY The HRCZ-set (36 genes) SEQUENCE ALIGNMENT Vs TF-MAP ALIGNMENT Enrique Blanco [http://genome.imim.es] - RegCreative Jamboree 2006 8. RESULTS NO TF-map alignments are a simple reflection of sequence conservation? TF-MAP ALIGNMENT CLUSTALW Genomic region TOP 1 Avg. Score TOP 1 Avg. Score Coding 27 3706.72 6 17.15 5’UTR 4 2671.78 2 10.48 PROMOTER 4 2005.67 18 25.41 3’UTR 1 1994.22 7 15.85 Intronic 0 1267.89 2 8.34 Downstream 0 1174.28 0 6.85 5’Intergenic 0 1052.92 0 5.42 3’Intergenic 0 974.69 1 4.14 Enrique Blanco [http://genome.imim.es] - RegCreative Jamboree 2006 DESIGN OF THE DATASET 9. PAIRWISE TF-MAP ALIGNMENT TRAINING Predictions obtained with the database TRANSFAC: V. Matys et al. TRANSFAC® and its module TRANSCompel®: transcriptional gene regulation in eukaryotes. Nucleic Acids Research 34: D108 - D110 (2006) Plots with the program gff2ps: J. F. Abril and R. Guigó. gff2ps: visualizing genomic annotations. Bioinformatics, 8:743–744 (2000) TRAINING: To systematically estimate the parameters that are globally optimal, in terms of real TFBS detection, in a set of well-annotated promoter pairs Enrique Blanco [http://genome.imim.es] - RegCreative Jamboree 2006 10. ACCURACY TESTS Levels: • Nucleotide • Site H M REAL pair of TFBS H M TF-MAP ALIGNMENT Measures: • Sensitivity [0,1] • Specificity (PPV) [0,1] • Correlation Coefficient [-1,1] Coverage A set of experimentally annotated promoters: • The promoter sequences (mapping) • Coordinates of the real TFBSs (alignment) • TFBSs present in both promoters (alignment) Human/Mouse orthologous genes Enrique Blanco [http://genome.imim.es] - RegCreative Jamboree 2006 11. SOURCES OF INFORMATION • General Regulatory Repositories • Publications: - The datasets of other programs - Individual experimental works FORMATS / QUALITY / AVAILABILITY / STABILITY Enrique Blanco [http://genome.imim.es] - RegCreative Jamboree 2006 12. ABS: ANNOTATED BINDING SITES Enrique Blanco [http://genome.imim.es] - RegCreative Jamboree 2006 13. MY OWN EXPERIENCE (1) MANUAL DATA CURATION: * FINDING THE PROMOTER SEQUENCES IN THE GENOME: 1. The original promoter entry does not exist (GenBank) 2. The gene has another name 3. The gene has not been annotated yet (RefSeq) 4. The promoter sequence does not match the current TSS (RefSeq) 5. The promoter sequence is not a promoter sequence (RefSeq) * FINDING THE MOTIFS IN THE PROMOTER SEQUENCES: 1. The binding motif is not in the original promoter sequence 2. The motif is not in the coordinates that it was expected to be 3. The motif has changed slightly (a few nucleotides) 4. There are several motifs that could correspond to the real one 5. The relative position among the motifs of the same gene is wrong Enrique Blanco [http://genome.imim.es] - RegCreative Jamboree 2006 14. MY OWN EXPERIENCE (2) * TF-MAPS AND ANNOTATIONS: 1. The mapping function is not defined for a given TF 2. The TFBS is not predicted by the mapping function in one of the orthologs * MATCHING THE ALIGNMENTS AND THE ANNOTATIONS: 1. There are several mapping definitions that recognize the same motif Enrique Blanco [http://genome.imim.es] - RegCreative Jamboree 2006 NEW CHALLENGES: DESIGN OF FUTURE DATASETS 15. NON-COLLINEAR CONSERVATION Enrique Blanco [http://genome.imim.es] - RegCreative Jamboree 2006 16. SITES IN OTHER SPECIES COLLAGENASE-3 GENE (MMP13) promoters kindly provided by Dr. López-Otín (Universidad de Oviedo) Enrique Blanco [http://genome.imim.es] - RegCreative Jamboree 2006 17. ENCODE ChIP data TRANSFAC: V$E2F1_Q3 10,000 bps mouse human Enrique Blanco [http://genome.imim.es] - RegCreative Jamboree 2006 CONCLUSION RESEARCH ON GENE REGULATION: DUAL PERSONALITY? COMPUTER SCIENTIST BIOINFORMATICIAN EXPERT 1 EXPERT 2 EXPERIMENTALIST? EXPERT 3 RESEARCHER Enrique Blanco [http://genome.imim.es] - RegCreative Jamboree 2006 [email protected]
© Copyright 2025 Paperzz