Welcome to Research Simulation 1 PSSMs & Search for Repeated Sequences Monday, 9 June 2003
Jan 05, 2016
Welcome toResearch Simulation 1
PSSMs & Search for Repeated SequencesMonday, 9 June 2003
Free-living Nostoc
heterocysts
Matveyev and Elhai (unpublished)
CO2
sucroseN2
O2
CyanobacteriaAnabaena/Nostoc grown on NO3
-, air
Free-living Nostoc
heterocysts
Matveyev and Elhai (unpublished)
CO2
sucroseN2
NH3
NH3
O2
CyanobacteriaAnabaena/Nostoc grown on NO3
-, air
Tandem Heptameric RepeatsDo they come in complementary pairs?
Tandem Heptameric RepeatsDo they come in complementary pairs?
C2. Consider the sequence below of one strand of a DNA fragment.
5'-AGAGAGAGCTAAGGTCTCTCC-3'
Which of the following is a likely structure for the single-stranded fragment to assume?
A B C
Tandem Heptameric RepeatsDo they come in complementary pairs?
5'-AGAGAGAGCTAAGGTCTCTCC-3'
A B C
Tandem Heptameric RepeatsDo they come in complementary pairs?
5'-AGAGAGAGCTAAGGTCTCTCC-3'
A B C
Tandem Heptameric RepeatsDo they come in complementary pairs?
5'-AGAGAGAGCTAAGGTCTCTCC-3'
A B C
Tandem Heptameric RepeatsDo they come in complementary pairs?
5'-AGAGAGAGCTAAGGTCTCTCC-3'
A B C
Tandem Heptameric RepeatsDo they come in complementary pairs?
5'-AGAGAGAGCTAAGGTCTCTCC-3'
A B C
Tandem Heptameric RepeatsDo they come in complementary pairs?
5'-AGAGAGAGCTAAGGTCTCTCC-3'
A B C
Tandem Heptameric RepeatsDo they come in complementary pairs?
5'-AGAGAGAGCTAAGGTCTCTCC-3'
A B C
Tandem Heptameric RepeatsDo they come in complementary pairs?
5'-AGAGAGAGCTAAGGTCTCTCC-3'
A B C
Tandem Heptameric RepeatsDo they come in complementary pairs?
IF:
Tandem Heptameric RepeatsDo they come in complementary pairs?
TCATTGGTCATTGGTCATTGGTCATTTGTCCTTTGT
AACAGTAACAGGAAACAGTAAACAATAAACAGGAAACAGTAAAC
Tandem Heptameric RepeatsDo they come in complementary pairs?
TCATTGGTCATTGGTCATTGGTCATTTGTCCTTTGT
AACAGTAACAGGAAACAGTAAACAATAAACAGGAAACAGTAAAC
Regulatory Protein and their Binding Sites
GTA ..(8).. TAC
5’-GTA ..(8).. TACNNNNNNNNNNTANNNTNNNNNNNNNNNNNNNNNNNNNNNNNNNNATGNNNNNNNNNNNNNNNN3’-CAT ..(8).. ATGNNNNNNNNNNATNNNANNNNNNNNNNNNNNNNNNNNNNNNNNNNTACNNNNNNNNNNNNNNNN
hetQ
NtcA
N RNA Polymerase
Differentiation in cyanobacteriaWhat does NtcA bind to?
Herrero et al (2001) J Bacteriol 183:411-425
mRNA
…(20-24)…TAnnnTGTA…(8)…TAC
Position-specific scoring matrices
Table 1: Examples of position-specific scoring matrices from sequence alignment
A. Sequence alignmenta
A T T T A G T A T C A A A A A T A A C A A T T C
G T T C T G T A A C A A A G A C T A C A A A A C
A T T T T G T A G C T A C T T A T A C T A T T T
A A G C T G T A A C A A A A T C T A C C A A A T
C A T T T G T A C A G T C T G T T A C C T T T A
A. Sequence alignmenta
A T T T A G T A T C A A A A A T A A C A A T T C
G T T C T G T A A C A A A G A C T A C A A A A C
A T T T T G T A G C T A C T T A T A C T A T T T
A A G C T G T A A C A A A A T C T A C C A A A T
C A T T T G T A C A G T C T G T T A C C T T T A
B. Table of occurrencesa
A 3 2 0 0 1 0 0 5 2 1 3 4 3 2 2 1 1 5 0 2 4 2 2 1
C 1 0 0 2 0 0 0 0 1 4 0 0 2 0 0 2 0 0 5 2 0 0 0 2
G 1 0 1 0 0 5 0 0 1 0 1 0 0 1 1 0 0 0 0 0 0 0 0 0
T 0 3 4 3 4 0 5 0 1 0 1 1 0 2 2 2 4 0 0 1 1 3 3 2
Position-specific scoring matrices
Position-specific scoring matrices
B. Table of occurrencesa
A 0 1 0 0 5 2 1 3 4 3
C 2 0 0 0 0 1 4 0 0 2
G 0 0 5 0 0 1 0 1 0 0
T 3 4 0 5 0 1 0 1 1 0
C. Position-specific scoring matrix (B = 0)b
A 0 .20 0 0 1.0 .40 .20 .60 .80 .60
C .40 0 0 0 0 .20 .80 0 0 .40
G 0 0 1.0 0 0 .20 0 .20 0 0
T .60 .80 0 1.0 0 .20 0 .20 .20 0
Position-specific scoring matrices
Table 2: Scoring a sequence with a PSSM
urt-71 T A G T A T C A A A
Scorea .60 .20 1.0 1.0 1.0 .20 .80 .60 .80 .60
w/ps’countsb .51 .24 .75 .79 .79 .24 .61 .51 .65 .51
Normal’db 1.6 .75 4.2 2.5 2.5 .75 3.4 1.6 2.0 1.6
Score = .60 * .20 * 1.0 * …
Position-specific scoring matricesIntroduction of pseudocounts
A. Sequence alignmenta
A T T T A G T A T C A A A A A T A A C A A T T C
G T T C T G T A A C A A A G A C T A C A A A A C
A T T T T G T A G C T A C T T A T A C T A T T T
A A G C T G T A A C A A A A T C T A C C A A A T
C A T T T G T A C A G T C T G T T A C C T T T A
A?qG,6 = 5 real counts
pG = ? pseudocounts
Position-specific scoring matricesIntroduction of pseudocounts
Score(position,nucleotide) = (q + p) / (N + B)
p = pseudocounts = B * (overall frequency of nucleotide)
[A] = 0.36[T] = 0.36[C] = 0.18[G] = 0.18
B = Total number of pseudocounts
= Square root (N) ?
or = 0.1 ?
Position-specific scoring matricesIntroduction of pseudocounts
C. Position-specific scoring matrix (B = 0)b
A 0 .20 0 0 1.0 .40 .20
C .40 0 0 0 0 .20 .80
G 0 0 1.0 0 0 .20 0
T .60 .80 0 1.0 0 .20 0
D. Position-specific scoring matrix (B = N = 2.2)c
A .099 .24 .099 .099 .79 .38 .24
C .33 .056 .056 .056 .056 .19 .61
G .056 .056 .75 .056 .056 .19 .056
T .51 .65 .099 .79 .099 .24 .099
Position-specific scoring matricesNormalization
A. Sequence alignmenta
A T T T A G T A T C A A A A A T A A C A A T T C
G T T C T G T A A C A A A G A C T A C A A A A C
A T T T T G T A G C T A C T T A T A C T A T T T
A A G C T G T A A C A A A A T C T A C C A A A T
C A T T T G T A C A G T C T G T T A C C T T T A
B. Table of occurrencesa
A 3 2 0 0 1 0 0 5 2 1 3 4 3 2 2 1 1 5 0 2 4 2 2 1
C 1 0 0 2 0 0 0 0 1 4 0 0 2 0 0 2 0 0 5 2 0 0 0 2
G 1 0 1 0 0 5 0 0 1 0 1 0 0 1 1 0 0 0 0 0 0 0 0 0
T 0 3 4 3 4 0 5 0 1 0 1 1 0 2 2 2 4 0 0 1 1 3 3 2
How to account for similarity due to similar base composition?
Compare ScorePSSM / Scorebackground frequency
0.79 / 0.32 = 2.2
Position-specific scoring matricesLog odds form
E. Position-specific scoring matrix (B = 0.1)c
A .006 .20 .006 .006 .99 .40 .20 .59
C .40 .004 .004 .004 .004 .20 .79 .004
G .004 .004 .98 .004 .004 .20 .004 .20
T .59 .79 .006 .99 .006 .20 .006 .20
F. Position-specific scoring matrix: Log-odds form (B = 0.1)c,d
A 2.2 0.7 2.2 2.2 0.0 0.4 0.7 0.2
C 0.4 2.5 2.5 2.5 2.5 0.7 0.1 2.5
G 2.5 2.5 0.0 2.5 2.5 0.7 2.5 0.7
T 0.2 0.1 2.2 0.0 2.2 0.7 2.2 0.7
Log odds = -log(score)
Score * score * score … log + log + log …
Position-specific scoring matricesDecrease complexity through info analysis
Uncertainty (Hc) = - Sum [pic log2(pic)]
H1 = -{[4/11 log2(4/11)] + [3/11 log2(3/11)] + [1/11 log2(1/11)] + [3/11 log2(3/11)]}
= 1.87
H31 = -{[1/11 log2(1/11)] + [1/11 log2(1/11)] + [1/11 log2(1/11)] + [8/11 log2(8/11)]}
= 1.28
Information content = Sum (Hmax – Hc) (summed over all columns)