Top Banner
Transcription Regulation Transcription Factor Motif Finding Xiaole Shirley Liu STAT115, STAT215
51

Transcription Regulation Transcription Factor Motif Finding Xiaole Shirley Liu STAT115, STAT215.

Dec 25, 2015

Download

Documents

Frank Young
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Transcription Regulation Transcription Factor Motif Finding Xiaole Shirley Liu STAT115, STAT215.

Transcription RegulationTranscription Factor Motif Finding

Xiaole Shirley Liu

STAT115, STAT215

Page 2: Transcription Regulation Transcription Factor Motif Finding Xiaole Shirley Liu STAT115, STAT215.

Imagine a Chef

Restaurant Dinner Home Lunch

Certain recipes used tomake certain dishes

2

Page 3: Transcription Regulation Transcription Factor Motif Finding Xiaole Shirley Liu STAT115, STAT215.

Each Cell Is Like a Chef

3

Page 4: Transcription Regulation Transcription Factor Motif Finding Xiaole Shirley Liu STAT115, STAT215.

Each Cell Is Like a Chef

Infant Skin Adult Liver

Glucose, Oxygen, Amino Acid

Fat, AlcoholNicotine

HealthySkin Cell

State

DiseaseLiver Cell

State

Certain genes expressed tomake certain proteins

4

Page 5: Transcription Regulation Transcription Factor Motif Finding Xiaole Shirley Liu STAT115, STAT215.

Understanding a Genome

Get the complete sequence (encoded cook book)

Observe gene expressionsat different cell states

(meals prepared at different situations)

Decode gene regulation(decode the book, understand the rules)

5

Page 6: Transcription Regulation Transcription Factor Motif Finding Xiaole Shirley Liu STAT115, STAT215.

ATTTACCGATGGCTGCACTATGCCCTATCGATCGACCTCTC

ATTTACCACATCGCATCACAGTTCAGGACTAGACACGGACG

GCCTCGATTGACGGTGGTACAGTTCAATGACAACCTGACTA

TCTCGTTAGGACCCATGCGTACGACCCGTTTAAATCGAGAG

CGCTAGCCCGTCATCGATCTTGTTCGAATCGCGAATTGCCT

Information in DNA

Milk->Yogurt

Beef->Burger

Egg->Omelet

Fish->Sushi

Flour->Cake

Coding region 2%What is to be made

6

Page 7: Transcription Regulation Transcription Factor Motif Finding Xiaole Shirley Liu STAT115, STAT215.

Information in DNA

Non-coding region 98%

Regulation: When, Where,

Amount, Other Conditions, etc

ATTTACCGATGGCTGCACTATGCCCTATCGATCGACCTCTC

ATTTACCACATCGCATCACTACGACGGACTAGACACGGACG

GCCTCGATTGACGGTGGTACAGTTCAATGACAACCTGACTA

TCTCGTTAGGACCCATGCGTACGACCCGTTTAAATCGAGAG

CGCTAGGTCATCCCAGATCTTGTTCGAATCGCGAATTGCCT

Milk->Yogurt

Beef->Burger

Egg->Omelet

Fish->Sushi

Flour->Cake

Morning

Morning

Japanese Restaurant

5 Oz

9 Oz

Butter

Butter

Coding region 2%

7

Page 8: Transcription Regulation Transcription Factor Motif Finding Xiaole Shirley Liu STAT115, STAT215.

Measure Gene Expression

• Microarray or SAGE detects the expression of every gene at a certain cell state

• Clustering find genes that are co-expressed (potentially share regulation)

8

Page 9: Transcription Regulation Transcription Factor Motif Finding Xiaole Shirley Liu STAT115, STAT215.

STAT115, 04/01/2008

Decode Gene Regulation

GATGGCTGCACATTTACCTATGCCCTACGACCTCTCGC

CACATCGCATATTTACCACCAGTTCAGACACGGACGGC

GCCTCGATTTACCGTGGTACAGTTCAAACCTGACTAAA

TCTCGTTAGGACCATATTTACCACCCACATCGAGAGCG

CGCTAGCCATTTACCGATCTTGTTCGAGAATTGCCTAT

Scrambled Egg

Bacon

Cereal

Hash Brown

Orange Juice

Look at genes always expressed together:Upstream Regions Co-expressed

Genes

Page 10: Transcription Regulation Transcription Factor Motif Finding Xiaole Shirley Liu STAT115, STAT215.

STAT115, 04/01/2008

Decode Gene Regulation

GATGGCTGCACATTTACCTATGCCCTACGACCTCTCGC

CACATCGCATATTTACCACCAGTTCAGACACGGACGGC

GCCTCGATTTACCGTGGTACAGTTCAAACCTGACTAAA

TCTCGTTAGGACCATATTTACCACCCACATCGAGAGCG

CGCTAGCCATTTACCGATCTTGTTCGAGAATTGCCTAT

Scrambled Egg

Bacon

Cereal

Hash Brown

Orange Juice

Look at genes always expressed together:Upstream Regions Co-expressed

Genes

Page 11: Transcription Regulation Transcription Factor Motif Finding Xiaole Shirley Liu STAT115, STAT215.

STAT115, 04/01/2008

Decode Gene Regulation

GATGGCTGCACATTTACCTATGCCCTACGACCTCTCGC

CACATCGCATATTTACCACCAGTTCAGACACGGACGGC

GCCTCGATTTACCGTGGTACAGTTCAAACCTGACTAAA

TCTCGTTAGGACCATATTTACCACCCACATCGAGAGCG

CGCTAGCCATTTACCGATCTTGTTCGAGAATTGCCTAT

Scrambled Egg

Bacon

Cereal

Hash Brown

Orange Juice

Look at genes always expressed together:Upstream Regions Co-expressed

Genes

Morning

Page 12: Transcription Regulation Transcription Factor Motif Finding Xiaole Shirley Liu STAT115, STAT215.

Biology of Transcription Regulation

...acatttgcttctgacacaactgtgttcactagcaacctca...aacagacaccATGGTGCACCTGACTCCTGAGGAGAAGTCT...

...agcaggcccaactccagtgcagctgcaacctgcccactcc...ggcagcgcacATGTCTCTGACCAAGACTGAGAGTGCCGTC...

...cgctcgcgggccggcactcttctggtccccacagactcag...gatacccaccgATGGTGCTGTCTCCTGCCGACAAGACCAA...

...gccccgccagcgccgctaccgccctgcccccgggcgagcg...gatgcgcgagtATGGTGCTGTCTCCTGCCGACAAGACCAA...

atttgctt ttcact gcaacct

aactccagt

actca

gcaacct

gcaacct

gcaacctccagcgccg

gcaacctTranscription Factor (TF)

TF Binding Motif

Hemoglobin Beta

Hemoglobin Zeta

Hemoglobin Alpha

Hemoglobin Gamma

Motif can only be computational discovered when there are enough cases for machine learning

12

Page 13: Transcription Regulation Transcription Factor Motif Finding Xiaole Shirley Liu STAT115, STAT215.

Computational Motif Finding

• Input data:– Upstream sequences of gene expression profile cluster

– 20-800 sequences, each 300-5000 bps long

• Output: enriched sequence patterns (motifs)• Ultimate goals:

– Which TFs are involved and their binding motifs and effects (enhance / repress gene expression)?

– Which genes are regulated by this TF, why is there disease when a TF goes wrong?

– Are there binding partner / competitor for a TF?

13

Page 14: Transcription Regulation Transcription Factor Motif Finding Xiaole Shirley Liu STAT115, STAT215.

Challenges: Where/what the signal

The motif should be abundant

GAAATATGCACATTTACCTATGCCCTACGACCTCTCGC

CACATCGCATATTTACCACCAAATAAGACACGGACGGC

GCCTCGAAATAGCCATTTACCGTTCAAACCTGACTAAA

TCTCGTATTTACCATATTAAATACCCACATCGAGAGCG

CGCTAGCAAATATACGATTTACCTCGAGAATTGCCTAT

WaterWater

Water

Water

Water

14

Page 15: Transcription Regulation Transcription Factor Motif Finding Xiaole Shirley Liu STAT115, STAT215.

The motif should be abundant

And Abundant with significance

GAAATATGCACATTTACCTATGCCCTACGACCTCTCGC

CACATCGCATATTTACCACCAAATAAGACACGGACGGC

GCCTCGAAATAGCCATTTACCGTTCAAACCTGACTAAA

TCTCGTATTTACCATATTAAATACCCACATCGAGAGCG

CGCTAGCAAATATACGATTTACCTCGAGAATTGCCTAT

CoconutCoconut

Coconut

Coconut

Coconut

Challenges: Where/what the signal

15

Page 16: Transcription Regulation Transcription Factor Motif Finding Xiaole Shirley Liu STAT115, STAT215.

Challenges: Double stranded DNA

Motif appears in both

strandsGATGGCTGCACATTTACCTATGCCCTACGACCTCTCGC

CACATCGCATGGTAAATACCAGTTCAGACACGGACGGC

TCTCAGGTAAATCAGTCATACTACCCACATCGAGAGCG

|||||||||||||||||||||||||||||GTGTAGCGTACCATTTATGGTCAAGTCTG

|||||||||||||||||||||||||||||AGAGTCCATTTAGTCAGTATGATGGGTGT

16

Page 17: Transcription Regulation Transcription Factor Motif Finding Xiaole Shirley Liu STAT115, STAT215.

Challenges: Base substitutions

Sequences do not have to match the motif

perfectly, base substitutions are allowed

GATGGCTGCACATTTACCTATGCCCTACGACCTCTCGC

CACATCGCATATGTACCACCAGTTCAGACACGGACGGC

GCCTCGATTTGCCGTGGTACAGTTCAAACCTGACTAAA

TCTCGTTAGGACCATATTTATCACCCACATCGAGAGCG

CGCTAGCCAATTACCGATCTTGTTCGAGAATTGCCTAT

17

Page 18: Transcription Regulation Transcription Factor Motif Finding Xiaole Shirley Liu STAT115, STAT215.

Challenges: Variable motif copies

Some sequences do not have the motif

Some have multiple copies of the motif

GATGATGCCTCGGACGGATATGCCCTACGACCTCTCGC

CACATCGCAATGCAGCAATGCGTTCAGACACGGACGGC

TCATGCTAATGCCAGTCATGCTACATGCATCGAGAGCG

GCCTCTAGCTAGGCCGGTGAACATCAGACCTGACTAAA

CGCAATATAGCATTAGCAGACAGACGAGAATTGCCTAT

18

Page 19: Transcription Regulation Transcription Factor Motif Finding Xiaole Shirley Liu STAT115, STAT215.

Challenges: Variable motif copies

Some sequences do not have the motif

Some have multiple copies of the motif

GATGATGCCTCGGACGGATATGCCCTACGACCTCTCGC

CACATCGCAATGCAGCAATGCGTTCAGACACGGACGGC

TCATGCTAATGCCAGTCATGCTACATGCATCGAGAGCG

GCCTCTAGCTAGGCCGGTGAACATCAGACCTGACTAAA

CGCAATATAGCATTAGCAGACAGACGAGAATTGCCTAT

Sushi

Hand Roll

Sashimi

Tempura

Sake

FishFish Fish

Fish Fish Fish Fish

19

Page 20: Transcription Regulation Transcription Factor Motif Finding Xiaole Shirley Liu STAT115, STAT215.

Challenges: Two-block motifsSome motifs have two parts

GACACATTTACCTATGC TGGCCCTACGACCTCTCGC

CACAATTTACCACCA TGGCGTGATCTCAGACACGGACGGC

GCCTCGATTTACCGTGGTATGGCTAGTTCTCAAACCTGACTAAA

TCTCGTTAGATTTACCACCCA TGGCCGTATCGAGAGCG

CGCTAGCCATTTACCGAT TGGCGTTCTCGAGAATTGCCTAT

AATGCG

GCGTAA

or palindromic patterns

Coconut Milk

20

Page 21: Transcription Regulation Transcription Factor Motif Finding Xiaole Shirley Liu STAT115, STAT215.

Scan for Known TF Motif Sites

• Experimental TF sites: TRANSFAC, JASPAR

• Motif representation:– Regular expression: Consensus CACAAAA

binary decision Degenerate CRCAAAW

IUPAC A/TA/G

21

Page 22: Transcription Regulation Transcription Factor Motif Finding Xiaole Shirley Liu STAT115, STAT215.

IUPAC for DNA

A adenosine

C cytidine

G guanine

T thymidine

U uridine

R G A (purine)

Y T C (pyrimidine)

K G T (keto)

M A C (amino)

S G C (strong)

W A T (weak)

B C G T (not A)

D A G T (not C)

H A C T (not G)

V A C G (not T)

N A C G T (any)

22

Page 23: Transcription Regulation Transcription Factor Motif Finding Xiaole Shirley Liu STAT115, STAT215.

Scan for Known TF Motif Sites

• Experimental TF sites: TRANSFAC, JASPAR

• Motif representation:– Regular expression: Consensus CACAAAA

binary decision Degenerate CRCAAAW

– Position weight matrix (PWM): need score cutoff

Pos 12345678

ATGGCATG

AGGGTGCG

ATCGCATG

TTGCCACG

ATGGTATT

ATTGCACG

AGGGCGTT

ATGACATG

ATGGCATG

ACTGGATG

Motif MatrixPos A C G T Con

1 0.9 0 0 0.1 A2 0 0.1 0.2 0.7 T3 0 0.1 0.7 0.2 G4 0.1 0.1 0.8 0 G5 0 0.7 0.1 0.2 C6 0.8 0 0.2 0 A7 0 0.3 0 0.7 T8 0 0 0.8 0.2 G

Sit

es

Segment ATGCAGCT score =

p(generate ATGCAGCT from motif matrix)p(generate ATGCAGCT from background)

p0A p0T p0G p0C p0A p0G p0C p0T

23

Page 24: Transcription Regulation Transcription Factor Motif Finding Xiaole Shirley Liu STAT115, STAT215.

A Word on Sequence Logo

• SeqLogo consists of stacks of symbols, one stack for each position in the sequence

• The overall height of the stack indicates the sequence conservation at that position

• The height of symbols within the stack indicates the relative frequency of nucleic acid at that position

ATGGCATG

AGGGTGCG

ATCGCATG

TTGCCACG

ATGGTATT

ATTGCACG

AGGGCGTT

ATGACATG

ATGGCATG

ACTGGATG

24

Page 25: Transcription Regulation Transcription Factor Motif Finding Xiaole Shirley Liu STAT115, STAT215.

JASPAR

• User defined cutoff to scan for a particular motif

25

Page 26: Transcription Regulation Transcription Factor Motif Finding Xiaole Shirley Liu STAT115, STAT215.

Drawbacks to Known TF Motif Scans

• Limited number of motifs• Limited number of sites to represent each motif

– Low sensitivity and specificity

• Poor description of motif– Binding site borders not clear

– Binding site many mismatches

• Many motifs look very similar– E.g. GC-rich motif, E-box (CACGTG)

26

Page 27: Transcription Regulation Transcription Factor Motif Finding Xiaole Shirley Liu STAT115, STAT215.

De Novo Motif Finding

27

Page 28: Transcription Regulation Transcription Factor Motif Finding Xiaole Shirley Liu STAT115, STAT215.

De novo Sequence Motif Finding

• Goal: look for common sequence patterns enriched in the input data (compared to the genome background)

• Regular expression enumeration – Pattern driven approach

– Enumerate k-mers, check significance in dataset

• Position weight matrix update – Data driven approach, use data to refine motifs

– EM & Gibbs sampling

– Motif score and Markov background

28

Page 29: Transcription Regulation Transcription Factor Motif Finding Xiaole Shirley Liu STAT115, STAT215.

Regular Expression Enumeration

• Oligonucleotide Analysis: check over-representation for every w-mer:– Expected w occurrence in data

• Consider genome sequence + current data size

– Observed w occurrence in data

– Over-represented w is potential TF binding motif

Observed occurrence of w in the data

pw from genome background

size of sequence data

Expected occurrence of w in the data

29

Page 30: Transcription Regulation Transcription Factor Motif Finding Xiaole Shirley Liu STAT115, STAT215.

Suffix Tree for Fast Search

• Weeder, Pavesi & Pesole 2006

• Construction is linear in time and space to length of S.

• Quickly locating a substring allowing a certain number of mistakes

• Provides first linear-time solutions for the longest common substring problem

• Typically requires significantly more space than storing the string itself.

30

Page 31: Transcription Regulation Transcription Factor Motif Finding Xiaole Shirley Liu STAT115, STAT215.

Regular Expression Enumeration

• RE Enumeration Derivatives:– oligo-analysis, spaced dyads w1.ns.w2

– IUPAC alphabet – Markov background (later)– 2-bit encoding, fast index access– Enumerate limited RE patterns known for a TF

protein structure or interaction theme

• Exhaustive, guaranteed to find global optimum, and can find multiple motifs

• Not as flexible with base substitutions, long list of similar good motifs, and limited with motif width

31

Page 32: Transcription Regulation Transcription Factor Motif Finding Xiaole Shirley Liu STAT115, STAT215.

Expectation Maximization and Gibbs Sampling Model

• Objects:– Seq: sequence data to search for motif 0: non-motif (genome background) probability : motif probability matrix parameter : motif site locations

• Problem: P(, | seq, 0)• Approach: alternately estimate

by P( | , seq, 0) by P( | , seq, 0)– EM and Gibbs differ in the estimation methods

32

Page 33: Transcription Regulation Transcription Factor Motif Finding Xiaole Shirley Liu STAT115, STAT215.

Expectation Maximization

• E step: | , seq, 0

TTGACGACTGCACGT

TTGAC p1

TGACG p2

GACGA p3

ACGAC p4

CGACT p5

GACTGp6

ACTGC p7

CTGCA p8

...

P1 = likelihood ratio =

P(TTGAC| )

P(TTGAC| 0)

p0T p0T p0G p0A p0C= 0.3 0.3 0.2 0.3 0.2

33

Page 34: Transcription Regulation Transcription Factor Motif Finding Xiaole Shirley Liu STAT115, STAT215.

Expectation Maximization• E step: | , seq, 0

TTGACGACTGCACGT

TTGAC p1

TGACG p2

GACGA p3

ACGAC p4

CGACT p5

GACTG p6

ACTGC p7

CTGCA p8

...

• M step: | , seq, 0

p1 TTGAC

p2 TGACG

p3 GACGA

p4 ACGAC

...

• Scale ACGT at each position, reflects weighted average of

34

Page 35: Transcription Regulation Transcription Factor Motif Finding Xiaole Shirley Liu STAT115, STAT215.

M Step

TTGACGACTGCACGT

0.8 TTGAC0.2 TGACG0.6 GACGA0.5 ACGAC0.3 CGACT0.7 GACTG0.4 ACTGC0.1 CTGCA0.9 TGCAC…

35

Page 36: Transcription Regulation Transcription Factor Motif Finding Xiaole Shirley Liu STAT115, STAT215.

EM Derivatives

• First EM motif finder (C Lawrence)– Deterministic algorithm, guarantee local optimum

• MEME (TL Bailey)– Prior probability allows 0-n site / sequence

– Parallel running multiple

EM with different seed

– User friendly results

36

Page 37: Transcription Regulation Transcription Factor Motif Finding Xiaole Shirley Liu STAT115, STAT215.

Gibbs Sampling

• Stochastic process, although still may need multiple initializations– Sample from P( | , seq, 0)

– Sample from P( | , seq, 0)

• Collapsed form: estimated with counts, not sampling from Dirichlet

– Sample site from one seq based on sites from other seqs

• Converged motif matrix and converged motif sites represent stationary distribution of a Markov Chain

37

Page 38: Transcription Regulation Transcription Factor Motif Finding Xiaole Shirley Liu STAT115, STAT215.

1

2

3

4

5

Gibbs Sampler

Initial 1

31

41

51

21

11

• Randomly initialize a probability matrixRandomly initialize a probability matrix

nA1 + sA

nA1 + sA + nC1 + sC + nG1 + sG + nT1 + sT

estimated with counts

pA1 =

38

Page 39: Transcription Regulation Transcription Factor Motif Finding Xiaole Shirley Liu STAT115, STAT215.

Gibbs Sampler

1 Without11 Segment

• Take out one sequence with its sites from current Take out one sequence with its sites from current motifmotif

31

41

51

21

11

39

Page 40: Transcription Regulation Transcription Factor Motif Finding Xiaole Shirley Liu STAT115, STAT215.

Segment Scores of Sequence 1

0

10

20

30

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Starting Position of Segment

Se

gm

en

t S

core

Segment (1-8) Sequence 1

Gibbs Sampler• Score each possible segment of this sequenceScore each possible segment of this sequence

1 Without11 Segment

31

41

51

21

40

Page 41: Transcription Regulation Transcription Factor Motif Finding Xiaole Shirley Liu STAT115, STAT215.

Segment (2-9)

Segment Scores of Sequence 1

0

10

20

30

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Starting Position of Segment

Se

gm

en

t S

core

Sequence 1

Gibbs Sampler• Score each possible segment of this sequenceScore each possible segment of this sequence

31

41

51

21

1 Without11 Segment 41

Page 42: Transcription Regulation Transcription Factor Motif Finding Xiaole Shirley Liu STAT115, STAT215.

Segment Score

• Use current motif matrix to score a segment

Pos 12345678

ATGGCATG

AGGGTGCG

ATCGCATG

TTGCCACG

ATGGTATT

ATTGCACG

AGGGCGTT

ATGACATG

ATGGCATG

ACTGGATG

Motif MatrixPos A C G T Con

1 0.9 0 0 0.1 A2 0 0.1 0.2 0.7 T3 0 0.1 0.7 0.2 G4 0.1 0.1 0.8 0 G5 0 0.7 0.1 0.2 C6 0.8 0 0.2 0 A7 0 0.3 0 0.7 T8 0 0 0.8 0.2 G

Sit

es

Segment ATGCAGCT score =

p(generate ATGCAGCT from motif matrix)p(generate ATGCAGCT from background)

p0A p0T p0G p0C p0A p0G p0C p0T

42

Page 43: Transcription Regulation Transcription Factor Motif Finding Xiaole Shirley Liu STAT115, STAT215.

Scoring Segments

Motif 1 2 3 4 5 bg

A 0.4 0.1 0.3 0.4 0.2 0.3

T 0.2 0.5 0.1 0.2 0.2 0.3

G 0.2 0.2 0.2 0.3 0.4 0.2

C 0.2 0.2 0.4 0.1 0.2 0.2

Ignore pseudo counts for now…

Sequence: TTCCATATTAATCAGATTCCG… score

TAATC …

AATCA 0.4/0.3 x 0.1/0.3 x 0.1/0.3 x 0.1/0.2 x 0.2/0.3 = 0.049383

ATCAG 0.4/0.3 x 0.5/0.3 x 0.4/0.2 x 0.4/0.3 x 0.4/0.2 = 11.85185

TCAGA 0.2/0.3 x 0.2/0.3 x 0.3/0.3 x 0.3/0.2 x 0.2/0.3 = 0.444444

CAGAT …

43

Page 44: Transcription Regulation Transcription Factor Motif Finding Xiaole Shirley Liu STAT115, STAT215.

Segment Scores of Sequence 1

0

10

20

30

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Starting Position of Segment

Se

gm

en

t S

core

12

Gibbs Sampler• Sample site from one seq based on sites from other seqs

31

41

51

21

Modified 1

estimated with counts 44

Page 45: Transcription Regulation Transcription Factor Motif Finding Xiaole Shirley Liu STAT115, STAT215.

Hill Climbing vs Sampling

45

Pos 1 2 3 4 5 6 7 8 9

Score 3 1 12 5 8 9 1 2 6

SubT 3 4 16 21 29 38 39 41 47

• Rand(subtotal) = X• Find the first position with subtotal larger than X

Pos 1 2 3 4 5 6 7 8 9

Score 3 1 12 5 8 9 500 2 6

SubT 3 4 16 21 29 38 538 540 546

Page 46: Transcription Regulation Transcription Factor Motif Finding Xiaole Shirley Liu STAT115, STAT215.

Gibbs Sampler

• Repeat the process until motif convergesRepeat the process until motif converges

1 Without21 Segment

31

41

51

12

21

46

Page 47: Transcription Regulation Transcription Factor Motif Finding Xiaole Shirley Liu STAT115, STAT215.

Gibbs Sampler Intuition

• Beginning:– Randomly initialized motif

– No preference towards any segment

Beginning Iterations

0

10

20

30

40

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Starting position of segments

47

Page 48: Transcription Regulation Transcription Factor Motif Finding Xiaole Shirley Liu STAT115, STAT215.

Gibbs Sampler Intuition

• Motif appears:– Motif should have enriched signal (more sites)

– By chance some correct sites come to alignment

– Sites bias motif to attract other similar sites

Some good aligned segments come

0

10

20

30

40

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Starting position of segments

48

Page 49: Transcription Regulation Transcription Factor Motif Finding Xiaole Shirley Liu STAT115, STAT215.

Gibbs Sampler Intuition

• Motif converges:– All sites come to alignment

– Motif totally biased to sample sites every time

Motif converges towards the end

0

10

20

30

40

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Starting position of segments

49

Page 50: Transcription Regulation Transcription Factor Motif Finding Xiaole Shirley Liu STAT115, STAT215.

1

2

3

4

5

Gibbs Sampler

3i

4i

5i

2i

1i

• Column shift

• Metropolis algorithm:– Propose * as shifted 1 column to left or right

– Calculate motif score u() and u(*)

– Accept * with prob = min(1, u(*) / u())

50

Page 51: Transcription Regulation Transcription Factor Motif Finding Xiaole Shirley Liu STAT115, STAT215.

Summary

• Biology and challenge of transcription regulation• Scan for known TF motif sites: TRANSFAC &

JASPAR• De novo method

– Regular expression enumeration• Oligonucleotide analysis

– Position weight matrix update• EM (iterate , ; ~ weighted average)

• Gibbs Sampler (sample , ; Markov chain convergence)

51