DESIGNING TRAINING REGULATORY DATASETS

DESIGNING TRAINING
REGULATORY DATASETS
Enrique Blanco
Xavier Messeguer
Roderic Guigó
OUR APPROACH
1. SEQUENCE AND FUNCTION
Transthyretin NP_000362 (human) -

NP_036813 (rat):
MASHRLLLLCLAGLVFVSEAGPTGTGESKCPLMVKVLDAVRGSPAINVAV
MASLRLFLLCLAGLIFASEAGPGGAGESKCPLMVKVLDAVRGSPAVDVAV
*** **:*******:*.***** *:********************::***
SIMILAR
SEQUENCE
HVFRKAADDTWEPFASGKTSESGELHGLTTEEEFVEGIYKVEIDTKSYWK
KVFKRTADGSWEPFASGKTAESGELHGLTTDEKFTEGVYRVELDTKSYWK
:**:::**.:*********:**********:*:*.**:*:**:*******
ALGISPFHEHAEVVFTANDSGPRRYTIAALLSPYSYSTTAVVTNPKE
ALGISPFHEYAEVVFTANDSGHRHYTIAALLSPYSYSTTAVVSNPQN
*********:*********** *:******************:**::
SIMILAR
FUNCTION
Enrique Blanco [http://genome.imim.es] - RegCreative Jamboree 2006
2. FUNCTION AND SEQUENCE
MACGEFSLIARYFDRVRSSRLDVETGIG-DDCALLNIPEKQTLAISTDTL
--MSEECIENPERIKIGTDLINIRNKMNLKELIHPNEDENSTLLILNQKI
.* .:
:: :. :::.. :. .:
* *:.** * .:.:

VAGNHFLPDIDPADLAYKALAVNLSDLAAMGADPAWLTLALTLPEVDEPW
DIPRPLFYKIWKLHDLKVCADGAANRLYDYLDDDETLRIKY-LPNYIIGD
. :: .*
.
.
. *
*
* :
**:
LEAFSDSLFALLNYYDMQLIGGDTTRG-PLSMTLGIHGYIPAGRALKRSG
LDSLSEKVYKYYRKNKVTIIKQTTQYSTDFTKCVNLISLHFNSPEFRSLI
*:::*:.::
. .: :*
* . :: :.: .
. ::
SIMILAR
SEQUENCE
ThiL gene (S. typhimurium) encoding
thiamin phosphate kinase can be
displaced (functionally equivalent) by
THI80 (S. cerevisiae), encoding
thiamin pyrophosphokinase.
AKPGDWIYVTGTPGDSAAG--LAVLQNRLQVSEETDAHYLIQR----HLR
SNKDNLQSNHGIELEKGIHTLYNTMTESLVFSKVTPISLLALGGIGGRFD
:: .:
*
:..
.: : * .*: *
*
::

PTPRILHGQALRDIASAAIDLSDGLISDLGHIVKASGCGARVDVDALPKS
QTVHSITQLYTLSENASYFKLCYMTPTDLIFLIKKNGTLIEYDPQFRNTC
* : :
. :: :.*.
:** .::* .*
. * :
..
DAMMRHVDDGQALRWALSGGEDYELCFTVPELNRGALDVAIGQLGVPFTC
IGNCGLLPIGEATLVKETRGLKWDVKNWPTSVVTGRVSSSNRFVGDNCCF
.
: *:*
: * .:::
..: * :. :
:*
SIMILAR
FUNCTION
Comparison of the known structure of
THI80 with the structure of ThiL reveals
different folds. Thus, two different folds
might catalyze the same reaction.
Systematic discovery of analogous enzymes in
Thiamin biosynthesis. Morett, Korbel, Rajan,
Saab-Rincon, Olvera, Olvera, Schmidt, Snel,
Bork. Nature Biotechnology 21, 790 - 795 (2003).
IGQMSADIEGLNFVRDGMPVTFDWKGYDHFATP
IDTKDDIILNVEIFVDKLIDFL----------*. . * .:::. * :
:
Enrique Blanco [http://genome.imim.es] - RegCreative Jamboree 2006
3. FUNCTION AND SEQUENCE (TFBSs)
SIMILAR
SEQUENCE

HNF1- binding sites (human)
SIMILAR
FUNCTION
------------AGTTAATCATTGGCC---------------------GTTAATTATTGGCAAATGTCCC-------GTATGGGTTACTTATTCTCTCTTTGTTGA
------------GGTTAAGACTCTAAT---------------AGTCTAGTTAATAATCTACAATT--------------TGAGATTAATA------------------------AATGATTAAAA----------------------------GTCAAACATTAAC-------------------CCGATTAACCATTAACCCCCACCCC-------------GTTAATCAGAAAA---------GGATGTATGTAGAATTACATAAGAA-----------------------CTTACTCAATAAC----------
Enrique Blanco [http://genome.imim.es] - RegCreative Jamboree 2006
4. TF-MAPS: A NEW ALPHABET
Enrique Blanco [http://genome.imim.es] - RegCreative Jamboree 2006
5. TF-MAPS: A NEW FORM OF ALIGNMENT
MAP 1
MAP 2
We can align the TF-MAPS in this new alphabet:
•Mapping score
•Gaps
•Positional conservation
Enrique Blanco [http://genome.imim.es] - RegCreative Jamboree 2006
6. TF-MAP ALIGNMENT in PROMOTER CHARACTERIZATION
TTR gene: ENSG00000118271
TTR
RECONSTRUCTION
Pairwise TF-map alignments between
TTR and 83 COREG(TTR) in CISRED
A.G. Robertson et al. cisRED: a database system for
genome-scale computational discovery of regulatory
elements. Nucleic Acids Research, 34:D68–D73,
2006.
Enrique Blanco [http://genome.imim.es] - RegCreative Jamboree 2006
PROMOTER
7. ACCURACY IN ABSENCE OF SEQUENCE SIMILARITY
The HRCZ-set
(36 genes)
SEQUENCE
ALIGNMENT
Vs
TF-MAP
ALIGNMENT
Enrique Blanco [http://genome.imim.es] - RegCreative Jamboree 2006
8. RESULTS
NO
TF-map alignments are a simple reflection of sequence conservation?
TF-MAP ALIGNMENT
CLUSTALW
Genomic region
TOP 1
Avg. Score
TOP 1
Avg. Score
Coding
27
3706.72
6
17.15
5’UTR
4
2671.78
2
10.48
PROMOTER
4
2005.67
18
25.41
3’UTR
1
1994.22
7
15.85
Intronic
0
1267.89
2
8.34
Downstream
0
1174.28
0
6.85
5’Intergenic
0
1052.92
0
5.42
3’Intergenic
0
974.69
1
4.14
Enrique Blanco [http://genome.imim.es] - RegCreative Jamboree 2006
DESIGN OF THE
DATASET
9. PAIRWISE TF-MAP ALIGNMENT TRAINING
Predictions obtained with the database TRANSFAC: V.
Matys et al. TRANSFAC® and its module TRANSCompel®:
transcriptional gene regulation in eukaryotes. Nucleic Acids
Research 34: D108 - D110 (2006)
Plots with the program gff2ps: J. F. Abril and R. Guigó.
gff2ps: visualizing genomic annotations. Bioinformatics,
8:743–744 (2000)
TRAINING:
To systematically estimate the parameters that are globally optimal, in terms
of real TFBS detection, in a set of well-annotated promoter pairs
Enrique Blanco [http://genome.imim.es] - RegCreative Jamboree 2006
10. ACCURACY TESTS
Levels:
•
Nucleotide
•
Site
H
M
REAL pair of TFBS
H
M
TF-MAP ALIGNMENT
Measures:
•
Sensitivity [0,1]
•
Specificity (PPV) [0,1]
•
Correlation Coefficient [-1,1]
Coverage
A set of experimentally annotated promoters:
• The promoter sequences (mapping)
• Coordinates of the real TFBSs (alignment)
• TFBSs present in both promoters (alignment)
Human/Mouse orthologous genes
Enrique Blanco [http://genome.imim.es] - RegCreative Jamboree 2006
11. SOURCES OF INFORMATION
• General Regulatory Repositories
• Publications:
- The datasets of other programs
- Individual experimental works
FORMATS / QUALITY / AVAILABILITY / STABILITY
Enrique Blanco [http://genome.imim.es] - RegCreative Jamboree 2006
12. ABS: ANNOTATED BINDING SITES
Enrique Blanco [http://genome.imim.es] - RegCreative Jamboree 2006
13. MY OWN EXPERIENCE (1)
MANUAL DATA CURATION:
* FINDING THE PROMOTER SEQUENCES IN THE GENOME:
1.
The original promoter entry does not exist (GenBank)
2.
The gene has another name
3.
The gene has not been annotated yet (RefSeq)
4.
The promoter sequence does not match the current TSS (RefSeq)
5.
The promoter sequence is not a promoter sequence (RefSeq)
* FINDING THE MOTIFS IN THE PROMOTER SEQUENCES:
1.
The binding motif is not in the original promoter sequence
2.
The motif is not in the coordinates that it was expected to be
3.
The motif has changed slightly (a few nucleotides)
4.
There are several motifs that could correspond to the real one
5.
The relative position among the motifs of the same gene is wrong
Enrique Blanco [http://genome.imim.es] - RegCreative Jamboree 2006
14. MY OWN EXPERIENCE (2)
* TF-MAPS AND ANNOTATIONS:
1.
The mapping function is not defined for a given TF
2.
The TFBS is not predicted by the mapping function in one of the orthologs
* MATCHING THE ALIGNMENTS AND THE ANNOTATIONS:
1.
There are several mapping definitions that recognize the same motif
Enrique Blanco [http://genome.imim.es] - RegCreative Jamboree 2006
NEW CHALLENGES:
DESIGN OF FUTURE
DATASETS
15. NON-COLLINEAR CONSERVATION
Enrique Blanco [http://genome.imim.es] - RegCreative Jamboree 2006
16. SITES IN OTHER SPECIES
COLLAGENASE-3 GENE (MMP13) promoters kindly provided by Dr. López-Otín (Universidad de Oviedo)
Enrique Blanco [http://genome.imim.es] - RegCreative Jamboree 2006
17. ENCODE ChIP data
TRANSFAC:
V$E2F1_Q3
10,000 bps
mouse
human
Enrique Blanco [http://genome.imim.es] - RegCreative Jamboree 2006
CONCLUSION
RESEARCH ON GENE REGULATION: DUAL PERSONALITY?
COMPUTER SCIENTIST BIOINFORMATICIAN
EXPERT 1
EXPERT 2
EXPERIMENTALIST?
EXPERT 3
RESEARCHER
Enrique Blanco [http://genome.imim.es] - RegCreative Jamboree 2006
[email protected]