Transcription Regulation Transcription Factor Motif Finding Xiaole Shirley Liu STAT115, STAT215
Dec 25, 2015
Each Cell Is Like a Chef
Infant Skin Adult Liver
Glucose, Oxygen, Amino Acid
Fat, AlcoholNicotine
HealthySkin Cell
State
DiseaseLiver Cell
State
Certain genes expressed tomake certain proteins
4
Understanding a Genome
Get the complete sequence (encoded cook book)
Observe gene expressionsat different cell states
(meals prepared at different situations)
Decode gene regulation(decode the book, understand the rules)
5
ATTTACCGATGGCTGCACTATGCCCTATCGATCGACCTCTC
ATTTACCACATCGCATCACAGTTCAGGACTAGACACGGACG
GCCTCGATTGACGGTGGTACAGTTCAATGACAACCTGACTA
TCTCGTTAGGACCCATGCGTACGACCCGTTTAAATCGAGAG
CGCTAGCCCGTCATCGATCTTGTTCGAATCGCGAATTGCCT
Information in DNA
Milk->Yogurt
Beef->Burger
Egg->Omelet
Fish->Sushi
Flour->Cake
Coding region 2%What is to be made
6
Information in DNA
Non-coding region 98%
Regulation: When, Where,
Amount, Other Conditions, etc
ATTTACCGATGGCTGCACTATGCCCTATCGATCGACCTCTC
ATTTACCACATCGCATCACTACGACGGACTAGACACGGACG
GCCTCGATTGACGGTGGTACAGTTCAATGACAACCTGACTA
TCTCGTTAGGACCCATGCGTACGACCCGTTTAAATCGAGAG
CGCTAGGTCATCCCAGATCTTGTTCGAATCGCGAATTGCCT
Milk->Yogurt
Beef->Burger
Egg->Omelet
Fish->Sushi
Flour->Cake
Morning
Morning
Japanese Restaurant
5 Oz
9 Oz
Butter
Butter
Coding region 2%
7
Measure Gene Expression
• Microarray or SAGE detects the expression of every gene at a certain cell state
• Clustering find genes that are co-expressed (potentially share regulation)
8
STAT115, 04/01/2008
Decode Gene Regulation
GATGGCTGCACATTTACCTATGCCCTACGACCTCTCGC
CACATCGCATATTTACCACCAGTTCAGACACGGACGGC
GCCTCGATTTACCGTGGTACAGTTCAAACCTGACTAAA
TCTCGTTAGGACCATATTTACCACCCACATCGAGAGCG
CGCTAGCCATTTACCGATCTTGTTCGAGAATTGCCTAT
Scrambled Egg
Bacon
Cereal
Hash Brown
Orange Juice
Look at genes always expressed together:Upstream Regions Co-expressed
Genes
STAT115, 04/01/2008
Decode Gene Regulation
GATGGCTGCACATTTACCTATGCCCTACGACCTCTCGC
CACATCGCATATTTACCACCAGTTCAGACACGGACGGC
GCCTCGATTTACCGTGGTACAGTTCAAACCTGACTAAA
TCTCGTTAGGACCATATTTACCACCCACATCGAGAGCG
CGCTAGCCATTTACCGATCTTGTTCGAGAATTGCCTAT
Scrambled Egg
Bacon
Cereal
Hash Brown
Orange Juice
Look at genes always expressed together:Upstream Regions Co-expressed
Genes
STAT115, 04/01/2008
Decode Gene Regulation
GATGGCTGCACATTTACCTATGCCCTACGACCTCTCGC
CACATCGCATATTTACCACCAGTTCAGACACGGACGGC
GCCTCGATTTACCGTGGTACAGTTCAAACCTGACTAAA
TCTCGTTAGGACCATATTTACCACCCACATCGAGAGCG
CGCTAGCCATTTACCGATCTTGTTCGAGAATTGCCTAT
Scrambled Egg
Bacon
Cereal
Hash Brown
Orange Juice
Look at genes always expressed together:Upstream Regions Co-expressed
Genes
Morning
Biology of Transcription Regulation
...acatttgcttctgacacaactgtgttcactagcaacctca...aacagacaccATGGTGCACCTGACTCCTGAGGAGAAGTCT...
...agcaggcccaactccagtgcagctgcaacctgcccactcc...ggcagcgcacATGTCTCTGACCAAGACTGAGAGTGCCGTC...
...cgctcgcgggccggcactcttctggtccccacagactcag...gatacccaccgATGGTGCTGTCTCCTGCCGACAAGACCAA...
...gccccgccagcgccgctaccgccctgcccccgggcgagcg...gatgcgcgagtATGGTGCTGTCTCCTGCCGACAAGACCAA...
atttgctt ttcact gcaacct
aactccagt
actca
gcaacct
gcaacct
gcaacctccagcgccg
gcaacctTranscription Factor (TF)
TF Binding Motif
Hemoglobin Beta
Hemoglobin Zeta
Hemoglobin Alpha
Hemoglobin Gamma
Motif can only be computational discovered when there are enough cases for machine learning
12
Computational Motif Finding
• Input data:– Upstream sequences of gene expression profile cluster
– 20-800 sequences, each 300-5000 bps long
• Output: enriched sequence patterns (motifs)• Ultimate goals:
– Which TFs are involved and their binding motifs and effects (enhance / repress gene expression)?
– Which genes are regulated by this TF, why is there disease when a TF goes wrong?
– Are there binding partner / competitor for a TF?
13
Challenges: Where/what the signal
The motif should be abundant
GAAATATGCACATTTACCTATGCCCTACGACCTCTCGC
CACATCGCATATTTACCACCAAATAAGACACGGACGGC
GCCTCGAAATAGCCATTTACCGTTCAAACCTGACTAAA
TCTCGTATTTACCATATTAAATACCCACATCGAGAGCG
CGCTAGCAAATATACGATTTACCTCGAGAATTGCCTAT
WaterWater
Water
Water
Water
14
The motif should be abundant
And Abundant with significance
GAAATATGCACATTTACCTATGCCCTACGACCTCTCGC
CACATCGCATATTTACCACCAAATAAGACACGGACGGC
GCCTCGAAATAGCCATTTACCGTTCAAACCTGACTAAA
TCTCGTATTTACCATATTAAATACCCACATCGAGAGCG
CGCTAGCAAATATACGATTTACCTCGAGAATTGCCTAT
CoconutCoconut
Coconut
Coconut
Coconut
Challenges: Where/what the signal
15
Challenges: Double stranded DNA
Motif appears in both
strandsGATGGCTGCACATTTACCTATGCCCTACGACCTCTCGC
CACATCGCATGGTAAATACCAGTTCAGACACGGACGGC
TCTCAGGTAAATCAGTCATACTACCCACATCGAGAGCG
|||||||||||||||||||||||||||||GTGTAGCGTACCATTTATGGTCAAGTCTG
|||||||||||||||||||||||||||||AGAGTCCATTTAGTCAGTATGATGGGTGT
16
Challenges: Base substitutions
Sequences do not have to match the motif
perfectly, base substitutions are allowed
GATGGCTGCACATTTACCTATGCCCTACGACCTCTCGC
CACATCGCATATGTACCACCAGTTCAGACACGGACGGC
GCCTCGATTTGCCGTGGTACAGTTCAAACCTGACTAAA
TCTCGTTAGGACCATATTTATCACCCACATCGAGAGCG
CGCTAGCCAATTACCGATCTTGTTCGAGAATTGCCTAT
17
Challenges: Variable motif copies
Some sequences do not have the motif
Some have multiple copies of the motif
GATGATGCCTCGGACGGATATGCCCTACGACCTCTCGC
CACATCGCAATGCAGCAATGCGTTCAGACACGGACGGC
TCATGCTAATGCCAGTCATGCTACATGCATCGAGAGCG
GCCTCTAGCTAGGCCGGTGAACATCAGACCTGACTAAA
CGCAATATAGCATTAGCAGACAGACGAGAATTGCCTAT
18
Challenges: Variable motif copies
Some sequences do not have the motif
Some have multiple copies of the motif
GATGATGCCTCGGACGGATATGCCCTACGACCTCTCGC
CACATCGCAATGCAGCAATGCGTTCAGACACGGACGGC
TCATGCTAATGCCAGTCATGCTACATGCATCGAGAGCG
GCCTCTAGCTAGGCCGGTGAACATCAGACCTGACTAAA
CGCAATATAGCATTAGCAGACAGACGAGAATTGCCTAT
Sushi
Hand Roll
Sashimi
Tempura
Sake
FishFish Fish
Fish Fish Fish Fish
19
Challenges: Two-block motifsSome motifs have two parts
GACACATTTACCTATGC TGGCCCTACGACCTCTCGC
CACAATTTACCACCA TGGCGTGATCTCAGACACGGACGGC
GCCTCGATTTACCGTGGTATGGCTAGTTCTCAAACCTGACTAAA
TCTCGTTAGATTTACCACCCA TGGCCGTATCGAGAGCG
CGCTAGCCATTTACCGAT TGGCGTTCTCGAGAATTGCCTAT
AATGCG
GCGTAA
or palindromic patterns
Coconut Milk
20
Scan for Known TF Motif Sites
• Experimental TF sites: TRANSFAC, JASPAR
• Motif representation:– Regular expression: Consensus CACAAAA
binary decision Degenerate CRCAAAW
IUPAC A/TA/G
21
IUPAC for DNA
A adenosine
C cytidine
G guanine
T thymidine
U uridine
R G A (purine)
Y T C (pyrimidine)
K G T (keto)
M A C (amino)
S G C (strong)
W A T (weak)
B C G T (not A)
D A G T (not C)
H A C T (not G)
V A C G (not T)
N A C G T (any)
22
Scan for Known TF Motif Sites
• Experimental TF sites: TRANSFAC, JASPAR
• Motif representation:– Regular expression: Consensus CACAAAA
binary decision Degenerate CRCAAAW
– Position weight matrix (PWM): need score cutoff
Pos 12345678
ATGGCATG
AGGGTGCG
ATCGCATG
TTGCCACG
ATGGTATT
ATTGCACG
AGGGCGTT
ATGACATG
ATGGCATG
ACTGGATG
Motif MatrixPos A C G T Con
1 0.9 0 0 0.1 A2 0 0.1 0.2 0.7 T3 0 0.1 0.7 0.2 G4 0.1 0.1 0.8 0 G5 0 0.7 0.1 0.2 C6 0.8 0 0.2 0 A7 0 0.3 0 0.7 T8 0 0 0.8 0.2 G
Sit
es
Segment ATGCAGCT score =
p(generate ATGCAGCT from motif matrix)p(generate ATGCAGCT from background)
p0A p0T p0G p0C p0A p0G p0C p0T
23
A Word on Sequence Logo
• SeqLogo consists of stacks of symbols, one stack for each position in the sequence
• The overall height of the stack indicates the sequence conservation at that position
• The height of symbols within the stack indicates the relative frequency of nucleic acid at that position
ATGGCATG
AGGGTGCG
ATCGCATG
TTGCCACG
ATGGTATT
ATTGCACG
AGGGCGTT
ATGACATG
ATGGCATG
ACTGGATG
24
Drawbacks to Known TF Motif Scans
• Limited number of motifs• Limited number of sites to represent each motif
– Low sensitivity and specificity
• Poor description of motif– Binding site borders not clear
– Binding site many mismatches
• Many motifs look very similar– E.g. GC-rich motif, E-box (CACGTG)
26
De novo Sequence Motif Finding
• Goal: look for common sequence patterns enriched in the input data (compared to the genome background)
• Regular expression enumeration – Pattern driven approach
– Enumerate k-mers, check significance in dataset
• Position weight matrix update – Data driven approach, use data to refine motifs
– EM & Gibbs sampling
– Motif score and Markov background
28
Regular Expression Enumeration
• Oligonucleotide Analysis: check over-representation for every w-mer:– Expected w occurrence in data
• Consider genome sequence + current data size
– Observed w occurrence in data
– Over-represented w is potential TF binding motif
Observed occurrence of w in the data
pw from genome background
size of sequence data
Expected occurrence of w in the data
29
Suffix Tree for Fast Search
• Weeder, Pavesi & Pesole 2006
• Construction is linear in time and space to length of S.
• Quickly locating a substring allowing a certain number of mistakes
• Provides first linear-time solutions for the longest common substring problem
• Typically requires significantly more space than storing the string itself.
30
Regular Expression Enumeration
• RE Enumeration Derivatives:– oligo-analysis, spaced dyads w1.ns.w2
– IUPAC alphabet – Markov background (later)– 2-bit encoding, fast index access– Enumerate limited RE patterns known for a TF
protein structure or interaction theme
• Exhaustive, guaranteed to find global optimum, and can find multiple motifs
• Not as flexible with base substitutions, long list of similar good motifs, and limited with motif width
31
Expectation Maximization and Gibbs Sampling Model
• Objects:– Seq: sequence data to search for motif 0: non-motif (genome background) probability : motif probability matrix parameter : motif site locations
• Problem: P(, | seq, 0)• Approach: alternately estimate
by P( | , seq, 0) by P( | , seq, 0)– EM and Gibbs differ in the estimation methods
32
Expectation Maximization
• E step: | , seq, 0
TTGACGACTGCACGT
TTGAC p1
TGACG p2
GACGA p3
ACGAC p4
CGACT p5
GACTGp6
ACTGC p7
CTGCA p8
...
P1 = likelihood ratio =
P(TTGAC| )
P(TTGAC| 0)
p0T p0T p0G p0A p0C= 0.3 0.3 0.2 0.3 0.2
33
Expectation Maximization• E step: | , seq, 0
TTGACGACTGCACGT
TTGAC p1
TGACG p2
GACGA p3
ACGAC p4
CGACT p5
GACTG p6
ACTGC p7
CTGCA p8
...
• M step: | , seq, 0
p1 TTGAC
p2 TGACG
p3 GACGA
p4 ACGAC
...
• Scale ACGT at each position, reflects weighted average of
34
M Step
TTGACGACTGCACGT
0.8 TTGAC0.2 TGACG0.6 GACGA0.5 ACGAC0.3 CGACT0.7 GACTG0.4 ACTGC0.1 CTGCA0.9 TGCAC…
35
EM Derivatives
• First EM motif finder (C Lawrence)– Deterministic algorithm, guarantee local optimum
• MEME (TL Bailey)– Prior probability allows 0-n site / sequence
– Parallel running multiple
EM with different seed
– User friendly results
36
Gibbs Sampling
• Stochastic process, although still may need multiple initializations– Sample from P( | , seq, 0)
– Sample from P( | , seq, 0)
• Collapsed form: estimated with counts, not sampling from Dirichlet
– Sample site from one seq based on sites from other seqs
• Converged motif matrix and converged motif sites represent stationary distribution of a Markov Chain
37
1
2
3
4
5
Gibbs Sampler
Initial 1
31
41
51
21
11
• Randomly initialize a probability matrixRandomly initialize a probability matrix
nA1 + sA
nA1 + sA + nC1 + sC + nG1 + sG + nT1 + sT
estimated with counts
pA1 =
38
Gibbs Sampler
1 Without11 Segment
• Take out one sequence with its sites from current Take out one sequence with its sites from current motifmotif
31
41
51
21
11
39
Segment Scores of Sequence 1
0
10
20
30
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Starting Position of Segment
Se
gm
en
t S
core
Segment (1-8) Sequence 1
Gibbs Sampler• Score each possible segment of this sequenceScore each possible segment of this sequence
1 Without11 Segment
31
41
51
21
40
Segment (2-9)
Segment Scores of Sequence 1
0
10
20
30
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Starting Position of Segment
Se
gm
en
t S
core
Sequence 1
Gibbs Sampler• Score each possible segment of this sequenceScore each possible segment of this sequence
31
41
51
21
1 Without11 Segment 41
Segment Score
• Use current motif matrix to score a segment
Pos 12345678
ATGGCATG
AGGGTGCG
ATCGCATG
TTGCCACG
ATGGTATT
ATTGCACG
AGGGCGTT
ATGACATG
ATGGCATG
ACTGGATG
Motif MatrixPos A C G T Con
1 0.9 0 0 0.1 A2 0 0.1 0.2 0.7 T3 0 0.1 0.7 0.2 G4 0.1 0.1 0.8 0 G5 0 0.7 0.1 0.2 C6 0.8 0 0.2 0 A7 0 0.3 0 0.7 T8 0 0 0.8 0.2 G
Sit
es
Segment ATGCAGCT score =
p(generate ATGCAGCT from motif matrix)p(generate ATGCAGCT from background)
p0A p0T p0G p0C p0A p0G p0C p0T
42
Scoring Segments
Motif 1 2 3 4 5 bg
A 0.4 0.1 0.3 0.4 0.2 0.3
T 0.2 0.5 0.1 0.2 0.2 0.3
G 0.2 0.2 0.2 0.3 0.4 0.2
C 0.2 0.2 0.4 0.1 0.2 0.2
Ignore pseudo counts for now…
Sequence: TTCCATATTAATCAGATTCCG… score
TAATC …
AATCA 0.4/0.3 x 0.1/0.3 x 0.1/0.3 x 0.1/0.2 x 0.2/0.3 = 0.049383
ATCAG 0.4/0.3 x 0.5/0.3 x 0.4/0.2 x 0.4/0.3 x 0.4/0.2 = 11.85185
TCAGA 0.2/0.3 x 0.2/0.3 x 0.3/0.3 x 0.3/0.2 x 0.2/0.3 = 0.444444
CAGAT …
43
Segment Scores of Sequence 1
0
10
20
30
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Starting Position of Segment
Se
gm
en
t S
core
12
Gibbs Sampler• Sample site from one seq based on sites from other seqs
31
41
51
21
Modified 1
estimated with counts 44
Hill Climbing vs Sampling
45
Pos 1 2 3 4 5 6 7 8 9
Score 3 1 12 5 8 9 1 2 6
SubT 3 4 16 21 29 38 39 41 47
• Rand(subtotal) = X• Find the first position with subtotal larger than X
Pos 1 2 3 4 5 6 7 8 9
Score 3 1 12 5 8 9 500 2 6
SubT 3 4 16 21 29 38 538 540 546
Gibbs Sampler
• Repeat the process until motif convergesRepeat the process until motif converges
1 Without21 Segment
31
41
51
12
21
46
Gibbs Sampler Intuition
• Beginning:– Randomly initialized motif
– No preference towards any segment
Beginning Iterations
0
10
20
30
40
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Starting position of segments
47
Gibbs Sampler Intuition
• Motif appears:– Motif should have enriched signal (more sites)
– By chance some correct sites come to alignment
– Sites bias motif to attract other similar sites
Some good aligned segments come
0
10
20
30
40
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Starting position of segments
48
Gibbs Sampler Intuition
• Motif converges:– All sites come to alignment
– Motif totally biased to sample sites every time
Motif converges towards the end
0
10
20
30
40
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Starting position of segments
49
1
2
3
4
5
Gibbs Sampler
3i
4i
5i
2i
1i
• Column shift
• Metropolis algorithm:– Propose * as shifted 1 column to left or right
– Calculate motif score u() and u(*)
– Accept * with prob = min(1, u(*) / u())
50
Summary
• Biology and challenge of transcription regulation• Scan for known TF motif sites: TRANSFAC &
JASPAR• De novo method
– Regular expression enumeration• Oligonucleotide analysis
– Position weight matrix update• EM (iterate , ; ~ weighted average)
• Gibbs Sampler (sample , ; Markov chain convergence)
51