Top Banner
DESIGNING TRAINING DESIGNING TRAINING REGULATORY DATASETS REGULATORY DATASETS Enrique Blanco Xavier Messeguer Roderic Guigó
24

DESIGNING TRAINING REGULATORY DATASETS Enrique Blanco Xavier Messeguer Roderic Guigó.

Dec 18, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: DESIGNING TRAINING REGULATORY DATASETS Enrique Blanco Xavier Messeguer Roderic Guigó.

DESIGNING TRAINING DESIGNING TRAINING REGULATORY DATASETSREGULATORY DATASETS

Enrique BlancoXavier Messeguer

Roderic Guigó

Page 2: DESIGNING TRAINING REGULATORY DATASETS Enrique Blanco Xavier Messeguer Roderic Guigó.

OUR APPROACHOUR APPROACH

Page 3: DESIGNING TRAINING REGULATORY DATASETS Enrique Blanco Xavier Messeguer Roderic Guigó.

Enrique Blanco [http://genome.imim.es] - RegCreative Jamboree 2006

1. SEQUENCE AND FUNCTION

SIMILARSEQUENCE

SIMILAR FUNCTION

Transthyretin NP_000362 (human) - NP_036813 (rat):

MASHRLLLLCLAGLVFVSEAGPTGTGESKCPLMVKVLDAVRGSPAINVAVMASLRLFLLCLAGLIFASEAGPGGAGESKCPLMVKVLDAVRGSPAVDVAV*** **:*******:*.***** *:********************::***

HVFRKAADDTWEPFASGKTSESGELHGLTTEEEFVEGIYKVEIDTKSYWKKVFKRTADGSWEPFASGKTAESGELHGLTTDEKFTEGVYRVELDTKSYWK:**:::**.:*********:**********:*:*.**:*:**:*******

ALGISPFHEHAEVVFTANDSGPRRYTIAALLSPYSYSTTAVVTNPKEALGISPFHEYAEVVFTANDSGHRHYTIAALLSPYSYSTTAVVSNPQN*********:*********** *:******************:**::

Page 4: DESIGNING TRAINING REGULATORY DATASETS Enrique Blanco Xavier Messeguer Roderic Guigó.

Enrique Blanco [http://genome.imim.es] - RegCreative Jamboree 2006

2. FUNCTION AND SEQUENCE

SIMILARSEQUENCE

SIMILAR FUNCTION

ThiL gene (S. typhimurium) encoding thiamin phosphate kinase can be displaced (functionally equivalent) by THI80 (S. cerevisiae), encoding thiamin pyrophosphokinase.

Comparison of the known structure of THI80 with the structure of ThiL reveals different folds. Thus, two different folds might catalyze the same reaction.

Systematic discovery of analogous enzymes in Thiamin biosynthesis. Morett, Korbel, Rajan, Saab-Rincon, Olvera, Olvera, Schmidt, Snel, Bork. Nature Biotechnology 21, 790 - 795 (2003).

MACGEFSLIARYFDRVRSSRLDVETGIG-DDCALLNIPEKQTLAISTDTL--MSEECIENPERIKIGTDLINIRNKMNLKELIHPNEDENSTLLILNQKI.* .: :: :. :::.. :. .: * *:.** * .:.:

VAGNHFLPDIDPADLAYKALAVNLSDLAAMGADPAWLTLALTLPEVDEPWDIPRPLFYKIWKLHDLKVCADGAANRLYDYLDDDETLRIKY-LPNYIIGD. :: .* . . . * * * : **:

LEAFSDSLFALLNYYDMQLIGGDTTRG-PLSMTLGIHGYIPAGRALKRSGLDSLSEKVYKYYRKNKVTIIKQTTQYSTDFTKCVNLISLHFNSPEFRSLI*:::*:.:: . .: :* * . :: :.: . . ::

AKPGDWIYVTGTPGDSAAG--LAVLQNRLQVSEETDAHYLIQR----HLRSNKDNLQSNHGIELEKGIHTLYNTMTESLVFSKVTPISLLALGGIGGRFD:: .: * :.. .: : * .*: * * ::

PTPRILHGQALRDIASAAIDLSDGLISDLGHIVKASGCGARVDVDALPKSQTVHSITQLYTLSENASYFKLCYMTPTDLIFLIKKNGTLIEYDPQFRNTC* : : . :: :.*. :** .::* .* . * : ..

DAMMRHVDDGQALRWALSGGEDYELCFTVPELNRGALDVAIGQLGVPFTCIGNCGLLPIGEATLVKETRGLKWDVKNWPTSVVTGRVSSSNRFVGDNCCF. : *:* : * .::: ..: * :. : :*

IGQMSADIEGLNFVRDGMPVTFDWKGYDHFATPIDTKDDIILNVEIFVDKLIDFL-----------*. . * .:::. * : :

Page 5: DESIGNING TRAINING REGULATORY DATASETS Enrique Blanco Xavier Messeguer Roderic Guigó.

Enrique Blanco [http://genome.imim.es] - RegCreative Jamboree 2006

3. FUNCTION AND SEQUENCE (TFBSs)

HNF1- binding sites (human)

------------AGTTAATCATTGGCC--------- -------------GTTAATTATTGGCAAATGTCCC- -------GTATGGGTTACTTATTCTCTCTTTGTTGA ------------GGTTAAGACTCTAAT--------- -------AGTCTAGTTAATAATCTACAATT------ ---------TGAGATTAATA---------------- ---------AATGATTAAAA---------------- -------------GTCAAACATTAAC---------- ----------CCGATTAACCATTAACCCCCACCCC- -------------GTTAATCAGAAAA---------- GGATGTATGTAGAATTACATAAGAA----------- -------------CTTACTCAATAAC----------

SIMILARSEQUENCE

SIMILAR FUNCTION

Page 6: DESIGNING TRAINING REGULATORY DATASETS Enrique Blanco Xavier Messeguer Roderic Guigó.

Enrique Blanco [http://genome.imim.es] - RegCreative Jamboree 2006

4. TF-MAPS: A NEW ALPHABET

Page 7: DESIGNING TRAINING REGULATORY DATASETS Enrique Blanco Xavier Messeguer Roderic Guigó.

Enrique Blanco [http://genome.imim.es] - RegCreative Jamboree 2006

5. TF-MAPS: A NEW FORM OF ALIGNMENT

MAP 1

MAP 2

We can align the TF-MAPS in this new alphabet:•Mapping score•Gaps•Positional conservation

Page 8: DESIGNING TRAINING REGULATORY DATASETS Enrique Blanco Xavier Messeguer Roderic Guigó.

Enrique Blanco [http://genome.imim.es] - RegCreative Jamboree 2006

6. TF-MAP ALIGNMENT in PROMOTER CHARACTERIZATION

TTR gene: ENSG00000118271

Pairwise TF-map alignments between TTR and 83 COREG(TTR) in CISRED

A.G. Robertson et al. cisRED: a database system for genome-scale computational discovery of regulatory elements. Nucleic Acids Research, 34:D68–D73, 2006.

TTR PROMOTER RECONSTRUCTION

Page 9: DESIGNING TRAINING REGULATORY DATASETS Enrique Blanco Xavier Messeguer Roderic Guigó.

Enrique Blanco [http://genome.imim.es] - RegCreative Jamboree 2006

7. ACCURACY IN ABSENCE OF SEQUENCE SIMILARITY

The HRCZ-set(36 genes)

SEQUENCE ALIGNMENT

VsTF-MAP

ALIGNMENT

Page 10: DESIGNING TRAINING REGULATORY DATASETS Enrique Blanco Xavier Messeguer Roderic Guigó.

Enrique Blanco [http://genome.imim.es] - RegCreative Jamboree 2006

8. RESULTS

TF-map alignments are a simple reflection of sequence conservation?

Genomic region TOP 1 Avg. Score

Coding 27 3706.72

5’UTR 4 2671.78

PROMOTER 4 2005.67

3’UTR 1 1994.22

Intronic 0 1267.89

Downstream 0 1174.28

5’Intergenic 0 1052.92

3’Intergenic 0 974.69

CLUSTALW

NO

TOP 1 Avg. Score

6 17.15

2 10.48

18 25.41

7 15.85

2 8.34

0 6.85

0 5.42

1 4.14

TF-MAP ALIGNMENT

Page 11: DESIGNING TRAINING REGULATORY DATASETS Enrique Blanco Xavier Messeguer Roderic Guigó.

DESIGN OF THE DESIGN OF THE DATASETDATASET

Page 12: DESIGNING TRAINING REGULATORY DATASETS Enrique Blanco Xavier Messeguer Roderic Guigó.

Enrique Blanco [http://genome.imim.es] - RegCreative Jamboree 2006

9. PAIRWISE TF-MAP ALIGNMENT TRAINING

TRAINING:To systematically estimate the parameters that are globally optimal, in terms of real TFBS detection, in a set of well-annotated promoter pairs

Predictions obtained with the database TRANSFAC: V. Matys et al. TRANSFAC® and its module TRANSCompel®: transcriptional gene regulation in eukaryotes. Nucleic Acids Research 34: D108 - D110 (2006)

Plots with the program gff2ps: J. F. Abril and R. Guigó. gff2ps: visualizing genomic annotations. Bioinformatics, 8:743–744 (2000)

Page 13: DESIGNING TRAINING REGULATORY DATASETS Enrique Blanco Xavier Messeguer Roderic Guigó.

Enrique Blanco [http://genome.imim.es] - RegCreative Jamboree 2006

10. ACCURACY TESTS

REAL pair of TFBS

H

H

M

M TF-MAP ALIGNMENT

Levels:• Nucleotide• Site

Measures:• Sensitivity [0,1]• Specificity (PPV) [0,1] • Correlation Coefficient [-1,1]

Coverage

A set of experimentally annotated promoters:• The promoter sequences (mapping)• Coordinates of the real TFBSs (alignment)• TFBSs present in both promoters (alignment)

Human/Mouse orthologous genes

Page 14: DESIGNING TRAINING REGULATORY DATASETS Enrique Blanco Xavier Messeguer Roderic Guigó.

Enrique Blanco [http://genome.imim.es] - RegCreative Jamboree 2006

11. SOURCES OF INFORMATION

• General Regulatory Repositories

• Publications:- The datasets of other programs- Individual experimental works

FORMATS / QUALITY / AVAILABILITY / STABILITY

Page 15: DESIGNING TRAINING REGULATORY DATASETS Enrique Blanco Xavier Messeguer Roderic Guigó.

Enrique Blanco [http://genome.imim.es] - RegCreative Jamboree 2006

12. ABS: ANNOTATED BINDING SITES

Page 16: DESIGNING TRAINING REGULATORY DATASETS Enrique Blanco Xavier Messeguer Roderic Guigó.

Enrique Blanco [http://genome.imim.es] - RegCreative Jamboree 2006

13. MY OWN EXPERIENCE (1)

MANUAL DATA CURATION:

* FINDING THE PROMOTER SEQUENCES IN THE GENOME:1. The original promoter entry does not exist (GenBank)2. The gene has another name3. The gene has not been annotated yet (RefSeq)4. The promoter sequence does not match the current TSS (RefSeq)5. The promoter sequence is not a promoter sequence (RefSeq)

* FINDING THE MOTIFS IN THE PROMOTER SEQUENCES:1. The binding motif is not in the original promoter sequence2. The motif is not in the coordinates that it was expected to be3. The motif has changed slightly (a few nucleotides)4. There are several motifs that could correspond to the real one5. The relative position among the motifs of the same gene is wrong

Page 17: DESIGNING TRAINING REGULATORY DATASETS Enrique Blanco Xavier Messeguer Roderic Guigó.

Enrique Blanco [http://genome.imim.es] - RegCreative Jamboree 2006

14. MY OWN EXPERIENCE (2)

* TF-MAPS AND ANNOTATIONS:1. The mapping function is not defined for a given TF2. The TFBS is not predicted by the mapping function in one of the

orthologs

* MATCHING THE ALIGNMENTS AND THE ANNOTATIONS:1. There are several mapping definitions that recognize the same

motif

Page 18: DESIGNING TRAINING REGULATORY DATASETS Enrique Blanco Xavier Messeguer Roderic Guigó.

NEW CHALLENGES: NEW CHALLENGES: DESIGN OF FUTURE DESIGN OF FUTURE

DATASETSDATASETS

Page 19: DESIGNING TRAINING REGULATORY DATASETS Enrique Blanco Xavier Messeguer Roderic Guigó.

Enrique Blanco [http://genome.imim.es] - RegCreative Jamboree 2006

15. NON-COLLINEAR CONSERVATION

Page 20: DESIGNING TRAINING REGULATORY DATASETS Enrique Blanco Xavier Messeguer Roderic Guigó.

Enrique Blanco [http://genome.imim.es] - RegCreative Jamboree 2006

16. SITES IN OTHER SPECIES

COLLAGENASE-3 GENE (MMP13) promoters kindly provided by Dr. López-Otín (Universidad de Oviedo)

Page 21: DESIGNING TRAINING REGULATORY DATASETS Enrique Blanco Xavier Messeguer Roderic Guigó.

Enrique Blanco [http://genome.imim.es] - RegCreative Jamboree 2006

17. ENCODE ChIP data

TRANSFAC:V$E2F1_Q3

10,000 bps

mouse

human

Page 22: DESIGNING TRAINING REGULATORY DATASETS Enrique Blanco Xavier Messeguer Roderic Guigó.

CONCLUSIONCONCLUSION

Page 23: DESIGNING TRAINING REGULATORY DATASETS Enrique Blanco Xavier Messeguer Roderic Guigó.

Enrique Blanco [http://genome.imim.es] - RegCreative Jamboree 2006

RESEARCH ON GENE REGULATION: DUAL PERSONALITY?

COMPUTER SCIENTISTEXPERT 1

BIOINFORMATICIANEXPERT 2

EXPERIMENTALIST?EXPERT 3

RESEARCHER

Page 24: DESIGNING TRAINING REGULATORY DATASETS Enrique Blanco Xavier Messeguer Roderic Guigó.

[email protected]@imim.es