Top Banner
A Statistical Method for Finding Transcriptional Factor Binding Sites Authors: Saurabh Sinha and Martin Tompa Presenter: Christopher Schlosberg CS598ss
28

A Statistical Method for Finding Transcriptional Factor Binding Sites

Jan 14, 2016

Download

Documents

rona

A Statistical Method for Finding Transcriptional Factor Binding Sites. Authors: Saurabh Sinha and Martin Tompa Presenter: Christopher Schlosberg CS598ss. Regulation of Gene Expression. Difficulties of Motif Finding. - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: A Statistical Method for Finding Transcriptional Factor Binding Sites

A Statistical Method for Finding Transcriptional Factor Binding Sites

Authors: Saurabh Sinha and Martin Tompa

Presenter: Christopher Schlosberg

CS598ss

Page 2: A Statistical Method for Finding Transcriptional Factor Binding Sites

Regulation of Gene Expression

Page 3: A Statistical Method for Finding Transcriptional Factor Binding Sites

Difficulties of Motif Finding

Regulatory sequences don’t follow same orientation as the coding sequence or each other

Multiple binding sites might exist for each regulated gene

Large variation in the binding sites of a single factor. Variations are not well understood.

Page 4: A Statistical Method for Finding Transcriptional Factor Binding Sites

Previous & Proposed Methods for Finding Motifs

Previous Methods: Find longer, general motifs Use local search algorithms (Gibbs

sampling, Expectation Maximization, greedy algorithms)

Proposed Method: TFBS is small enough to use enumerative

methods Enumerative statistical methods guarantee

global optimality and affordability

Page 5: A Statistical Method for Finding Transcriptional Factor Binding Sites

Proposed Method Highlights Allows variations in the binding site instances of a

given transcription factor

Allows for motifs to include “spacers”

Allows for overlapping occurrences (in both orientations), which lends to complex dependencies

Statistical significance of a motif (s) is based on the frequencies of shorter (more frequent) oligonucleotides

Use of Markov chain to model background genomic distribution

Use of z-score to measure statistical significance

Allows for multiple binding sites

Page 6: A Statistical Method for Finding Transcriptional Factor Binding Sites

Characteristics of a Motif

Any single TFBS has significant variation

Many motifs have spacers from 1-11bp

Variation often occurs as a transition (e.g. purine purine) rather than a transversion (e.g. pyrimidine purine)

Variation occurs less between a pair of complementary bases.

Indels are uncommon

Page 7: A Statistical Method for Finding Transcriptional Factor Binding Sites

Proposed Motif Definition

Motif will be a string with Σ= {A,C,G,T, R,Y,S,W,N}

A,C,G,T (DNA bp), R (purine), Y (pyrimidine), S (strong), W (weak), N (spacer)

TF database (SCPD) confirms this model of variation Of 50 binding site consensi, 31 exact fits (62%) Another 10 fit if slight variations allowed

Page 8: A Statistical Method for Finding Transcriptional Factor Binding Sites

Measure of Statistical Significance

Given set of corregulated S. cerevisiae genes, the input to the problem is corresponding set of 800bp upstream sequences having 3’ end on start site of gene translation.

Model must measure from input sequences: Absolute number of occurrences (Ns) of motif (s) Background genomic distribution

X is a set of random DNA sequences in the same number and lengths of the input sequences Generated by Markov chain of order m Transition probabilities determined by (m+1)-mer

frequencies in fully complement of 6000+ (800bp in length) Background model chooses m=3

Page 9: A Statistical Method for Finding Transcriptional Factor Binding Sites

z-score

Xs – r.v. is number of occurrences of motif (s) in X

E(Xs) – expectation, σ(Xs) – standard deviation

zs – number of S.D. by which observed value Ns exceeds expectation

Page 10: A Statistical Method for Finding Transcriptional Factor Binding Sites

Implications

Possibility of overlap of a motif with itself (in either orientation)

Previous study of pattern autocorrelation

Generalized computation of SD, treating motif as a finite set of strings Higher order Markov chains Spacers handled at no extra computational cost Handles motif in either orientation

Page 11: A Statistical Method for Finding Transcriptional Factor Binding Sites

Algorithm

Enumerates over each input sequence

Tabulates number Ns of occurrences of each motif in either direction

Compute expectation and SD for each motif s.t. Ns>0

Calculate z-score

Rank motifs by z-score

Page 12: A Statistical Method for Finding Transcriptional Factor Binding Sites

Algorithm Analysis

For single motif, complexity is O(c2k2) k – # of nonspacer characters in motif c – # of instantiations of R, Y, S, W in motif

Only modest values of k

Linear dependence on genome size

Can trim variance calculation to optimize

Page 13: A Statistical Method for Finding Transcriptional Factor Binding Sites

Number of Occurrences

Convert motif s into a multiset W

Add reverse complements for each string in W

Motif s only occurs at position in X iff some string in W occurs at same position

Xs - # of occurrences (in X) of each member of W

Handling Palindromes

Wi – member of W

|W| = T

Page 14: A Statistical Method for Finding Transcriptional Factor Binding Sites

Number of Occurrences Con’t

Page 15: A Statistical Method for Finding Transcriptional Factor Binding Sites

Expectation

Linearity of Expectation

Page 16: A Statistical Method for Finding Transcriptional Factor Binding Sites

Variance

B term

C term

Page 17: A Statistical Method for Finding Transcriptional Factor Binding Sites

C Term

A term

Page 18: A Statistical Method for Finding Transcriptional Factor Binding Sites

A Term

Page 19: A Statistical Method for Finding Transcriptional Factor Binding Sites

Overlapping Concatenation

CW (like W) is potentially a multiset

One-to-one correspondence

Page 20: A Statistical Method for Finding Transcriptional Factor Binding Sites

C Term Simplification

Page 21: A Statistical Method for Finding Transcriptional Factor Binding Sites

A Term Revisited

Page 22: A Statistical Method for Finding Transcriptional Factor Binding Sites

Si1Si2 Term & Approximation

Kleffe and Borodovsky (1992) Approximation

Page 23: A Statistical Method for Finding Transcriptional Factor Binding Sites

B Term

Page 24: A Statistical Method for Finding Transcriptional Factor Binding Sites

B Term Con’t

Page 25: A Statistical Method for Finding Transcriptional Factor Binding Sites

Summary

Page 26: A Statistical Method for Finding Transcriptional Factor Binding Sites

Higher Order Markov Models

Variance calculations remain the same except for Si1Si2 term

Experimental m = 3

Page 27: A Statistical Method for Finding Transcriptional Factor Binding Sites

Experimental Results & Future Considerations

17 coregulated sets of genes

Known TF with known binding site consensus

In 9 experiments, known consensus was one of 3 highest scoring motifs

Future Topics: Non-centered spacers Enumeration Loop optimization Filtering repeats

Page 28: A Statistical Method for Finding Transcriptional Factor Binding Sites

Question

E(Xs) is more straight-forward to calculate compared to σ(Xs). Under the assumptions given in the paper, name one of the reasons for this complication.