Sequence Motif Analysis Lecture in M.Sc. Biomedizin, Module: ”Proteinbiochemie und Bioinformatik” Jonas Ibn-Salem Andrade group Johannes Gutenberg University Mainz Institute of Molecular Biology March 7, 2016 Morgane Thomas-Chollier (ENS, Paris) kindly shared some of her slides Jonas Ibn-Salem (JGU Mainz/IMB) Sequence Motif Analysis March 7, 2016 1 / 46
54
Embed
Sequence Motif Analysis - uni-mainz.decbdm-01.zdv.uni-mainz.de/~jibnsale/teaching/SS16_Motif...Sequence Motif Analysis Lecture in M.Sc. Biomedizin, Module: "Proteinbiochemie und Bioinformatik"
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Sequence Motif Analysis
Lecture in M.Sc. Biomedizin,Module: ”Proteinbiochemie und Bioinformatik”
Jonas Ibn-Salem
Andrade groupJohannes Gutenberg University Mainz
Institute of Molecular Biology
March 7, 2016
Morgane Thomas-Chollier (ENS, Paris) kindly shared some of her slides
Jonas Ibn-Salem (JGU Mainz/IMB) Sequence Motif Analysis March 7, 2016 1 / 46
� Matrix based� Position frequency matrix (PFM)� Position probability matrix (PPM)� Position specific scoring matrix (PSSM)
� Sequence Logos
� Hidden Markov Models (HMM)
Jonas Ibn-Salem (JGU Mainz/IMB) Sequence Motif Analysis March 7, 2016 7 / 46
Consensus sequence
� A consensus sequence is a motifdescription as string(=sequence).
� A strict consensus sequence isderived from the collection ofbinding sites by taking thepredominant letter at eachposition (column) in thealignment.
� A degenerate consensussequence is defined by selectinga degeneracy nucleotide symbolfor each position.
Source: Morgane Thomas-Chollier (ENS, Paris)
Jonas Ibn-Salem (JGU Mainz/IMB) Sequence Motif Analysis March 7, 2016 8 / 46
Matrix model of sequence motifs
Problem
TF binding sites (TFBS) are degenerated. A given TF is able to bind DNAon TFBSs with different sequences.
� A Position Frequency Matrix (PFM) is createdby counting the the occurrences of each base(rows) in each position (columns) of the alignment.
� This is a more quantitative description of theknown binding sites.
Source: Morgane Thomas-Chollier (ENS, Paris)
Jonas Ibn-Salem (JGU Mainz/IMB) Sequence Motif Analysis March 7, 2016 9 / 46
Matrix model of sequence motifs
Problem
TF binding sites (TFBS) are degenerated. A given TF is able to bind DNAon TFBSs with different sequences.
� A Position Frequency Matrix (PFM) is createdby counting the the occurrences of each base(rows) in each position (columns) of the alignment.
� This is a more quantitative description of theknown binding sites.
Source: Morgane Thomas-Chollier (ENS, Paris)
Jonas Ibn-Salem (JGU Mainz/IMB) Sequence Motif Analysis March 7, 2016 9 / 46
Conversion to Position Probability matrix (PPM)Position Frequency Matrix (PFM):
Position Probability Matrix (PPM):
Reference: [Hertz and Stormo, 1999] Source: Morgane Thomas-Chollier (ENS, Paris)
Jonas Ibn-Salem (JGU Mainz/IMB) Sequence Motif Analysis March 7, 2016 10 / 46
Position specific scoring matrix (PSSM) or weight matrix
� To model the enrichment of an observed frequency over somebackground the PFM can be converted to a position specific scoringmatrix (PSSM) or weight matrix W .
Wk,j = log(Mk,j/bk)
where bk is a background probability for each base (under a Bernoullimodel).
� Usually a pseudo-count of +1 is added to the PFM before thetransformation to avoid the logarithm of zero.
Jonas Ibn-Salem (JGU Mainz/IMB) Sequence Motif Analysis March 7, 2016 11 / 46
Sequence Logo
� A Sequence logo is a graphicalrepresentation of a motif.
� The hight of each letter is proportional tothe frequency of each residue at eachposition.
� The total height of each column isproportional to the sequence conservationand indicate the amount of informationcontained in each position (informationcontent measured in bits).
� Allows easy identification of the mostimportant positions in the motif.Source: Morgane Thomas-Chollier (ENS, Paris)
Jonas Ibn-Salem (JGU Mainz/IMB) Sequence Motif Analysis March 7, 2016 12 / 46
� Multiple testing correction by multiplying the p-value by the numberof tested words.
Jonas Ibn-Salem (JGU Mainz/IMB) Sequence Motif Analysis March 7, 2016 22 / 46
Assembling overlapping words
Source: Morgane Thomas-Chollier (ENS, Paris)
Jonas Ibn-Salem (JGU Mainz/IMB) Sequence Motif Analysis March 7, 2016 23 / 46
Motif discovery using matrices
Idea:
Find a matrix model that optimally describes a motif that is enriched inthe sequences.
� Comprehensive approach� Test all possible motif matrices that can build from the sequences.� Compute a score (e.g. information content, P-value) associated to
each motif matrix.� Report the highest scoring matrix.
Problem: The number of possible matrices is too large to becomputed in a reasonable time.
Given a known TF recognition motif and a sequence, we want to know allpositions where the motif matches the sequence.
Source: Morgane Thomas-Chollier (ENS, Paris)
Jonas Ibn-Salem (JGU Mainz/IMB) Sequence Motif Analysis March 7, 2016 38 / 46
Motif discovery vs. Pattern Matching
Motif Discovery Problem:
Given a set of related sequences,identify common sequence motifs infrom these sequences
Input: Set of sequences
Output: Motif model
Example
� What is the sequence motif ofthe TF in a ChIP-seqexperiment?
� Are there other co-factorsinvolved?
� Which TF bind promoters ofsome co-expressed genes?
Pattern Matching Problem:
Given a known TF recognitionmotif and a sequence, we want toknow all positions where the motifmatches the sequence.
Input: Motif, Sequence
Output: Positions of motif hits
Example
� Where does a TF bind exactlyin a ChIP-seq peak region?
� Does a TF bind direct orindirect at a ChIP-seq peakregion?
Jonas Ibn-Salem (JGU Mainz/IMB) Sequence Motif Analysis March 7, 2016 39 / 46
Searching with Known Motif: String based
Problem:
Given a motif as consensus sequence, find all matches in an inputsequence.
� Compare each position along the sequence with the motif (scanning)and report matches.
Source: Morgane Thomas-Chollier (ENS, Paris)
Jonas Ibn-Salem (JGU Mainz/IMB) Sequence Motif Analysis March 7, 2016 40 / 46
Searching with Known Motif: Matrix based
Problem:
Given a matrix motif model, find all significant hits (motif instances) in asequence.
� The similarity of a given sequence segment S with the motif can becomputed as weight score WS
WS = logP(S |M)
P(S |B)
� P(S |M) is the probability that the sequence segment S occursaccording to the motif model M. This can be simply computed fromthe matrix.
� P(S |B) is the probability of S under a background model B. Thiscan be a first order Markov model.
� The higher the score WS the more similar is the sequence S to thematrix model M.
Jonas Ibn-Salem (JGU Mainz/IMB) Sequence Motif Analysis March 7, 2016 41 / 46
Probability of a sequence under the matrix model P(S |M)
� Given a matrix and sequence we can calculate
P(S |M) =w∏j=1
Mrj ,j
where M is the frequency matrix of width w , S = {r1, r2, ..., r2} is thesequence segment and rj the residue found at position j in thesequence S .
� Note, that usually a corrected matrix is used by adding pseudo-counts(e.g. +1) to avoid zeros in the calculation.
Source: Morgane Thomas-Chollier (ENS, Paris)
Jonas Ibn-Salem (JGU Mainz/IMB) Sequence Motif Analysis March 7, 2016 42 / 46
Probability of a seq. under the background model P(S |B)
� The probability under the background model can be calculated as
P(S |B) =w∏j=1
prj
where prj is the probability of observing the residue rj according tothe background model.
� A background model (B) should estimate the probability of asequence motif in non-biding sites.
� Various possible background models. E.g.� Bernoulli model with residue-specific probabilities (pr )� Markov models to take dinucleotide frequencies into account (e.g
Jonas Ibn-Salem (JGU Mainz/IMB) Sequence Motif Analysis March 7, 2016 43 / 46
Assigning a score to each segment and threshold usage
� The sequence is scanned with the matrix and a score is assigned toeach position.
� Resulting hits depend highly on the threshold on the weight score:� Stringent threshold ⇒ high confidence in predicted sites, but many
sites missed.� Loose threshold ⇒ real sides hidden in many false positives.
� Some programs compute theoretical p-values from the weight score.� A threshold on a p-value can be easier interpreted: For p = 10−4 we
expect 1 false prediction every 10.000 base pairs (on one strand).Source: Morgane Thomas-Chollier (ENS, Paris)
Jonas Ibn-Salem (JGU Mainz/IMB) Sequence Motif Analysis March 7, 2016 44 / 46
Summary
� Transcription factors (TFs) recognize specific DNA sequence motifs.
� Sequence motifs can be represented as consensus sequence (string) ormore quantitatively as matrix model to capture the specificity of eachresidue at each position.
� A sequence logo plot is a graphical representation of a motif matrix.
� An enriched sequence motif can be detected in a set of functionallyrelated sequences using de novo motif discovery approaches.
� Known TF recognition motifs can be retrieved as matrix model fromonline databases.
� Pattern matching aims at finding putative TF binding sites (for whichthe binding motif is known) in DNA sequences.
Jonas Ibn-Salem (JGU Mainz/IMB) Sequence Motif Analysis March 7, 2016 45 / 46
References
Hertz, G. Z. and Stormo, G. D. (1999).
Identifying DNA and protein patterns with statistically significant alignments of multiple sequences.Bioinformatics (Oxford, England), 15(7-8):563–77.
Mathelier, A., Fornes, O., Arenillas, D. J., Chen, C.-y., Denay, G., Lee, J., Shi, W., Shyr, C., Tan, G., Worsley-Hunt, R.,
Zhang, A. W., Parcy, F., Lenhard, B., Sandelin, A., and Wasserman, W. W. (2015).JASPAR 2016: a major expansion and update of the open-access database of transcription factor binding profiles.Nucleic acids research, 44(November 2015):gkv1176.
Stormo, G. D. (2013).
Modeling the specificity of protein-DNA interactions.Quantitative biology, 1(2):115–130.
Thomas-Chollier, M., Darbo, E., Herrmann, C., Defrance, M., Thieffry, D., and van Helden, J. (2012).
A complete workflow for the analysis of full-size ChIP-seq (and similar) data sets using peak-motifs.Nature protocols, 7(8):1551–68.
Wasserman, W. W. and Sandelin, A. (2004).
Applied bioinformatics for the identification of regulatory elements.Nature reviews. Genetics, 5(4):276–87.
Jonas Ibn-Salem (JGU Mainz/IMB) Sequence Motif Analysis March 7, 2016 46 / 46