Top Banner
MF workshop 10 © Yaron Orenstein 1 Finding sequence motifs in PBM data Workshop Project Yaron Orenstein October 2010
42

MF workshop 10 © Yaron Orenstein 1 Finding sequence motifs in PBM data Workshop Project Yaron Orenstein October 2010.

Dec 28, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: MF workshop 10 © Yaron Orenstein 1 Finding sequence motifs in PBM data Workshop Project Yaron Orenstein October 2010.

MF workshop 10 © Yaron Orenstein

1

Finding sequence motifs in PBM data

Workshop Project

Yaron OrensteinOctober 2010

Page 2: MF workshop 10 © Yaron Orenstein 1 Finding sequence motifs in PBM data Workshop Project Yaron Orenstein October 2010.

MF workshop 10 © Yaron Orenstein

2

Outline

1. Some background again…2. The project

Page 3: MF workshop 10 © Yaron Orenstein 1 Finding sequence motifs in PBM data Workshop Project Yaron Orenstein October 2010.

MF workshop 10 © Yaron Orenstein

3

1. Background

Slides with Ron Shamir and Chaim Linhart

Page 4: MF workshop 10 © Yaron Orenstein 1 Finding sequence motifs in PBM data Workshop Project Yaron Orenstein October 2010.

MF workshop 10 © Yaron Orenstein

4

DNA Pre-mRNA

protein

transcription translation

Mature

mRNA

splicing

Gene: from DNA to protein

Page 5: MF workshop 10 © Yaron Orenstein 1 Finding sequence motifs in PBM data Workshop Project Yaron Orenstein October 2010.

MF workshop 10 © Yaron Orenstein

5

DNA• DNA: a “string” over the alphabet of 4 bases (nucleotides): { A, C, G, T }

• Resides in chromosomes

• Complementary strands: A-T ; C-G

Forward/sense strand: AACTTGCG

Reverse-complement/anti-sense strand: TTGAACGC

• Directional: from 5’ to 3’: (upstream) AACTTGCGATACTCCTA (downstream)

5’ end 3’ end

Page 6: MF workshop 10 © Yaron Orenstein 1 Finding sequence motifs in PBM data Workshop Project Yaron Orenstein October 2010.

MF workshop 10 © Yaron Orenstein

6

Gene structure (eukaryotes)

Transcription start site (TSS)

Promoter

Transcription (RNA polymerase)

DNA

Pre-mRNAExon ExonIntron

Splicing (spliceosome)

Mature mRNA

5’ UTR 3’ UTR

Start codon Stop codonCoding region

Translation (ribosome)

Protein

Coding strand

Page 7: MF workshop 10 © Yaron Orenstein 1 Finding sequence motifs in PBM data Workshop Project Yaron Orenstein October 2010.

MF workshop 10 © Yaron Orenstein

7

Translation• Codon - a triplet of bases, codes a specific

amino acid (except the stop codons); many-to-1 relation

• Stop codons - signal termination of the protein synthesis process

http://ntri.tamuk.edu/cell/ribosomes.html

Page 8: MF workshop 10 © Yaron Orenstein 1 Finding sequence motifs in PBM data Workshop Project Yaron Orenstein October 2010.

MF workshop 10 © Yaron Orenstein

8

Genome sequences• Many genomes have been sequences,

including those of viruses, microbes, plants and animals.

• Human: – 23 pairs of chromosomes– 3+ Gbps (bps = base pairs) , only ~3% are

genes– ~25,000 genes

• Yeast:– 16 chromosomes– 20 Mbps– 6,500 genes

Page 9: MF workshop 10 © Yaron Orenstein 1 Finding sequence motifs in PBM data Workshop Project Yaron Orenstein October 2010.

MF workshop 10 © Yaron Orenstein

9

Regulation of Expression

• Each cell contains an identical copy of the whole genome - but utilizes only a subset of the genes to perform diverse, unique tasks

• Most genes are highly regulated – their expression is limited to specific tissues, developmental stages, physiological condition

• Main regulatory mechanism – transcriptional regulation

Page 10: MF workshop 10 © Yaron Orenstein 1 Finding sequence motifs in PBM data Workshop Project Yaron Orenstein October 2010.

MF workshop 10 © Yaron Orenstein

10

•Transcription is regulated primarily by transcription factors (TFs) – proteins that bind to DNA subsequences, called binding sites (BSs)

•TFBSs are located mainly (not always!) in the gene’s promoter – the DNA sequence upstream the gene’s transcription start site (TSS)

•BSs of a particular TF share a common pattern, or motif

•Some TFs operate together – TF modules

TFTF

Gene5’ 3’

BSBSTSS

Transcriptional regulation

Page 11: MF workshop 10 © Yaron Orenstein 1 Finding sequence motifs in PBM data Workshop Project Yaron Orenstein October 2010.

MF workshop 10 © Yaron Orenstein

11

•Consensus (“degenerate”) string:

TFBS motif models

gene 7

gene 9

gene 5

gene 3gene 2

gene 4

gene 6

gene 8

gene 10

gene 1AACTGT

CACTGTCACTCT

CACTGT

AACTGT

AC ACT

CGT

•Statistical models…•Motif logo representation

Page 12: MF workshop 10 © Yaron Orenstein 1 Finding sequence motifs in PBM data Workshop Project Yaron Orenstein October 2010.

MF workshop 10 © Yaron Orenstein

12

Human G2+M cell-cycle genes:The CHR – NF-Y module

CDCA3 (trigger of mitotic entry 1)CTCAGCCAATAGGGTCAGGGCAGGGGGCGTGGCGGGAAGTTTGAAACT -18

CDCA8 (cell division cycle associated 8)TTGTGATTGGATGTTGTGGGA…[25bp]…TGACTGTGGAGTTTGAATTGG +23

CDC2 (cell division control protein 2 homolog)CTCTGATTGGCTGCTTTGAAAGTCTACGGGCTACCCGATTGGTGAATCCGGGGCCCTTTAGCGCGGTGAGTTTGAAACTGCT 0

CDC42EP4 (cdc42 effector protein 4)GCTTTCAGTTTGAACCGAGGA…[25bp]…CGACGGCCATTGGCTGCTGC -110

CCNB1 (G2/mitotic-specific cyclin B1)AGCCGCCAATGGGAAGGGAG…[30bp]…AGCAGTGCGGGGTTTAAATCT +45

CCNB2 (G2/mitotic-specific cyclin B2)TTCAGCCAATGAGAGT…[15bp]…GTGTTGGCCAATGAGAAC…[15bp]…GGGCCGCCCAATGGGGCGCAAGCGACGCGGTATTTGAATCCTGGA +10

BS’s are short, non-specific, hiding in both strands and at various locations along the promoters

TFs: NF-Y , CHR

Page 13: MF workshop 10 © Yaron Orenstein 1 Finding sequence motifs in PBM data Workshop Project Yaron Orenstein October 2010.

MF workshop 10 © Yaron Orenstein

1313

Protein Binding Microarrays

Berger et al, Nat. Biotech 2006

• Generate an array of double-stranded DNA with all possible k-mers

• Detect TF binding to specific k-mers

Page 14: MF workshop 10 © Yaron Orenstein 1 Finding sequence motifs in PBM data Workshop Project Yaron Orenstein October 2010.

MF workshop 10 © Yaron Orenstein

1414

PBM (2)

Page 15: MF workshop 10 © Yaron Orenstein 1 Finding sequence motifs in PBM data Workshop Project Yaron Orenstein October 2010.

MF workshop 10 © Yaron Orenstein

1515

PBM - implementation

• Use 60-mers (Agilent): 25nt constant primer + 35nt variable region

• De Bruijn seq of all 10-mers (410 long) split into 35nt long fragments with 9nt overlap

• ~40K probes• For each 8-mer, combine signals from

all probes that contain it (or differ in 1nt) to obtain its binding score

Page 16: MF workshop 10 © Yaron Orenstein 1 Finding sequence motifs in PBM data Workshop Project Yaron Orenstein October 2010.

MF workshop 10 © Yaron Orenstein

16

The computational challenge

• Input: PBM data (sequences and binding scores) of one TF.

• Goal: Find a motif (PWM) that is the binding site of that TF.

• Intuition: sequences that match the motif (on one of the two possible strands!) are expected to have high binding scores.

Page 17: MF workshop 10 © Yaron Orenstein 1 Finding sequence motifs in PBM data Workshop Project Yaron Orenstein October 2010.

MF workshop 10 © Yaron Orenstein

17

2. The project

Page 18: MF workshop 10 © Yaron Orenstein 1 Finding sequence motifs in PBM data Workshop Project Yaron Orenstein October 2010.

MF workshop 10 © Yaron Orenstein

18

General goals• Research

- Learn about known solutions- Trial and error with training data

• Develop software from A-Z:– Design– Implementation (Optimization) – Execution & analysis of test data

• A taste of bioinformatics• Have fun• Get credit…

Page 19: MF workshop 10 © Yaron Orenstein 1 Finding sequence motifs in PBM data Workshop Project Yaron Orenstein October 2010.

MF workshop 10 © Yaron Orenstein

19

The computational task

• Given a set of PBM data of different TFs.

• Find the binding site motif in PWM format of each TF.

• Main challenges:– Performance (time, memory)– Accuracy

Page 20: MF workshop 10 © Yaron Orenstein 1 Finding sequence motifs in PBM data Workshop Project Yaron Orenstein October 2010.

MF workshop 10 © Yaron Orenstein

20

InputFile with 41,923 lines, each containing a

probe sequence of length 35 and binding intensity.

<sequence 35bp> \t <intensity> \n

Page 21: MF workshop 10 © Yaron Orenstein 1 Finding sequence motifs in PBM data Workshop Project Yaron Orenstein October 2010.

MF workshop 10 © Yaron Orenstein

21

Input (II)• For the training data, an additional

PWM file will be supplied for each PBM data set.

A: <freq1> <freq2> … <freq10>

C: <freq1> … <freq10>

G: …

T: …

• Separated by \t and \n.• All lines must contain same number of

frequencies (10 is just an example).

Page 22: MF workshop 10 © Yaron Orenstein 1 Finding sequence motifs in PBM data Workshop Project Yaron Orenstein October 2010.

MF workshop 10 © Yaron Orenstein

22

Input (III)

You will be given:1. 10 training sets (PBM data + PWM)2. 4 test sets (PBM data). You have to

provide the PWM.3. In the final project presentation, you

will be given an online test set (PBM data) and your software will be applied to it.

Page 23: MF workshop 10 © Yaron Orenstein 1 Finding sequence motifs in PBM data Workshop Project Yaron Orenstein October 2010.

MF workshop 10 © Yaron Orenstein

23

Output1. A PWM file describing the binding

site found in the given PBM file.2. The PWM in motif logo format (i.e.

displayed on the screen).

The file logo.zip contains a java

package with the code that will easily display your motif.

bits = 2 - entropy

Page 24: MF workshop 10 © Yaron Orenstein 1 Finding sequence motifs in PBM data Workshop Project Yaron Orenstein October 2010.

MF workshop 10 © Yaron Orenstein

Output (II)3. Show graphically how well your

motif predicts the binding intensity.

• One example (note it’s not PWM):

24

Page 25: MF workshop 10 © Yaron Orenstein 1 Finding sequence motifs in PBM data Workshop Project Yaron Orenstein October 2010.

MF workshop 10 © Yaron Orenstein

25

Ranking 8-mers• One possible way to start: rank the 8-

mers in some way. Scores for example:1. Signal average.2. Signal median.

• You can think of other scores that incorporate more information, e.g. position in probe sequence.

• This is just an example. You can think of other ways to start.

Page 26: MF workshop 10 © Yaron Orenstein 1 Finding sequence motifs in PBM data Workshop Project Yaron Orenstein October 2010.

MF workshop 10 © Yaron Orenstein

26

• Then, you can align the significant 8-mers.

• You may take into account the relative score.

• Don’t forget about the reverse complement!

• Example: Cebpb TF

Alignment procedure

Page 27: MF workshop 10 © Yaron Orenstein 1 Finding sequence motifs in PBM data Workshop Project Yaron Orenstein October 2010.

MF workshop 10 © Yaron Orenstein

Enrichment scores

• To test how good your motif is, you can use an enrichment score.

• An enrichment score tests how good the motif distinguishes between high-ranking probes and the rest of the probes.

27

Page 28: MF workshop 10 © Yaron Orenstein 1 Finding sequence motifs in PBM data Workshop Project Yaron Orenstein October 2010.

MF workshop 10 © Yaron Orenstein

Hypergeometric probability

28

drawn not drawn total

white k m − k m

black n − k N + k − n − m

N − m

total n N − n N

Page 29: MF workshop 10 © Yaron Orenstein 1 Finding sequence motifs in PBM data Workshop Project Yaron Orenstein October 2010.

MF workshop 10 © Yaron Orenstein

Hypergeometric enrichment score

• Let B and T (T B) denote the BG and target sets, respectively, and let b and t denote the subset of probes from the BG and target set, respectively, that contain at least one occurrence of the motif.

29

min(| |,| |)

| |

| | | | | |

| |( | |,| |,| |,| |)

| |

| |

T b

i t

b B b

i T iHG tail B T b t

B

T

Page 30: MF workshop 10 © Yaron Orenstein 1 Finding sequence motifs in PBM data Workshop Project Yaron Orenstein October 2010.

MF workshop 10 © Yaron Orenstein

Hypergeometric score (2)• The HG enrichment score computes

the probability of observing at least |t| target sequences with a motif occurrence, under the null hypothesis that the probes in the target set were drawn randomly, independently, and without replacement from the BG set.

• Code is provided in math.zip

30

Page 31: MF workshop 10 © Yaron Orenstein 1 Finding sequence motifs in PBM data Workshop Project Yaron Orenstein October 2010.

MF workshop 10 © Yaron Orenstein

Wilcoxon-Mann-Whitney (WMW) enrichment score

• Foreground probes are all those containing a match, background are all the others.

• B and F are the sizes of background and foreground, respectively.

• ρB and ρF are the sums of the background and foreground ranks.

• Read more in supplementary info (Berger06).

31

FBFBarea FB 1

Page 32: MF workshop 10 © Yaron Orenstein 1 Finding sequence motifs in PBM data Workshop Project Yaron Orenstein October 2010.

MF workshop 10 © Yaron Orenstein

Deciding the length of the motif

• Another challenge is to decide the length of the motif.

• Most binding site are 6-12 bp long.• You should consider the

information each position contains and decide on the length accordingly.

32

Page 33: MF workshop 10 © Yaron Orenstein 1 Finding sequence motifs in PBM data Workshop Project Yaron Orenstein October 2010.

MF workshop 10 © Yaron Orenstein

Scoring your PWM

• One way to score your motif is by ranking the probe sequences according to a match score.

• You may use the given code for match score.

• Compare the ranking of the probes you got to the ranking according to binding intensities. There are different correlation score for that.

33

Page 34: MF workshop 10 © Yaron Orenstein 1 Finding sequence motifs in PBM data Workshop Project Yaron Orenstein October 2010.

MF workshop 10 © Yaron Orenstein

Match Score between PWMs

• Already implemented for you:1.Euclidian Distance:

2.Pearson Correlation Coefficient

3.KL Divergence

34

Page 35: MF workshop 10 © Yaron Orenstein 1 Finding sequence motifs in PBM data Workshop Project Yaron Orenstein October 2010.

MF workshop 10 © Yaron Orenstein

35

Implementation• Java (Eclipse) ; Linux (Other languages are

possible, but will not participate in bonus).• Input: one single argument PBM filename• Output: PWM file, PWM presented in logo

and graphical presentation of PWM matching distribution among probes.

• Packages for motif logo and statistical scores will be supplied

• Time performance will be measured• Reasonable documentation• Separate packages for data-structures,

scores, GUI, I/O, etc.

Page 36: MF workshop 10 © Yaron Orenstein 1 Finding sequence motifs in PBM data Workshop Project Yaron Orenstein October 2010.

MF workshop 10 © Yaron Orenstein

Submission• Printed design document.• Printed code – for comments and

remarks.• Printed results document – for each test

set PWM logo + how good your result in terms of correlation to the probes ranks.

• 4 PWM files, e.g. Test_1.pwm (submitted by email).

• Executable for the online test.

36

Page 37: MF workshop 10 © Yaron Orenstein 1 Finding sequence motifs in PBM data Workshop Project Yaron Orenstein October 2010.

MF workshop 10 © Yaron Orenstein

Grade• 20% for the design • 30% for the implementation (20% for modularity,

clarity, documentation, 10% for efficiency) • 30% for the performance and experimental results

(20% for the accuracy on the 4 test queries and 10% for the accuracy on the online test query)

• 20% for the final report and presentation • 10% bonus to the group with the most accurate

results • 10% bonus for the group with the fastest

implementation

37

Page 38: MF workshop 10 © Yaron Orenstein 1 Finding sequence motifs in PBM data Workshop Project Yaron Orenstein October 2010.

MF workshop 10 © Yaron Orenstein

Bonus grading

• Accuracy will be determined using the provided code that compares two PWMs.

• We will take the average of runs on several different PBM data sets.

• Running time will be measured in java implementation, and the average will be taken.

38

Page 39: MF workshop 10 © Yaron Orenstein 1 Finding sequence motifs in PBM data Workshop Project Yaron Orenstein October 2010.

MF workshop 10 © Yaron Orenstein

Schedule 1.First progress report 23/112.Design document 21/123.Final presentation 16/2

• We shall meet with each group on each of these dates – mark your calendars!

• Schedule can be made earlier if you are ready.

• You are always welcome to meet us. Contact us by email.

39

Page 40: MF workshop 10 © Yaron Orenstein 1 Finding sequence motifs in PBM data Workshop Project Yaron Orenstein October 2010.

MF workshop 10 © Yaron Orenstein

40

Design document• Due in week 12 (21/12).• 3-5 pages (Word), Hebrew/English• Briefly describe main goal, input

and output of program• Describe main data structures,

algorithms, and scores.• Meet with me before submission.

Page 41: MF workshop 10 © Yaron Orenstein 1 Finding sequence motifs in PBM data Workshop Project Yaron Orenstein October 2010.

MF workshop 10 © Yaron Orenstein

Reference

• Berger MF, Philippakis AA, Quershi AM, He FS, EstepIII PW, Bulyk ML. Compact, universal DNA microarrays to comprehensively determine transcription-factor binding site specificities. Nature biotechnology. 2006;338:1429-1435.

Very important! Read: the_brain.bwh.harvard.edu/UPBMseqn/suppl_methods.doc

• Chen X, Hughes TR, Morris Q. RankMotif++: a motif-search algorithm that accounts for relative ranks of K-mers in binding transcription factors. Bioinformatics. 2007 Jul 1;23(13):i72-79.

41

Page 42: MF workshop 10 © Yaron Orenstein 1 Finding sequence motifs in PBM data Workshop Project Yaron Orenstein October 2010.

MF workshop 10 © Yaron Orenstein

42

Fin