Top Banner
Finding Transcription Factor Binding Sites BNFO 602/691 Biological Sequence Analysis Mark Reimers, VIPBG
30

Finding Transcription Factor Binding Sites

Feb 24, 2016

Download

Documents

anise

Finding Transcription Factor Binding Sites. BNFO 602/691 Biological Sequence Analysis Mark Reimers, VIPBG. Why TFBS?. Protein transcription factors were among the first gene regulators identified Many TFBS have distinct sequences, which were suitable for bioinformatics analysis - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Finding Transcription Factor Binding Sites

Finding Transcription Factor Binding Sites

BNFO 602/691Biological Sequence Analysis

Mark Reimers, VIPBG

Page 2: Finding Transcription Factor Binding Sites

Why TFBS?

• Protein transcription factors were among the first gene regulators identified

• Many TFBS have distinct sequences, which were suitable for bioinformatics analysis

• Now seen as one layer of mammalian gene regulation, along with 3D structure of chromatin, histones, anti-sense lncRNAs, miRNAs, sequestration and transport

Page 3: Finding Transcription Factor Binding Sites

Outline

• Transcription factors and DNA-binding proteins

• Factors affecting TF binding• DNA motifs & PSWM

Page 4: Finding Transcription Factor Binding Sites

Transcription Factors and DNA-Binding Proteins

• Several dozen distinct families of proteins have independently evolved binding to specific DNA configurations (sequences)

• They perform a variety of functions in organizing DNA or regulating transcription

• Usually several are involved in initiating gene transcription

Page 5: Finding Transcription Factor Binding Sites

Transcription Factor Biology

• Most bind in major groove of DNA• Many bind as dimers• Typically TFs expressed at low levels– Small changes in expression have big effects

• Mostly phosphorylated or otherwise activated by other proteins– Typically end-points of signalling cascades from

receptors on cell surface or in cytoplasm

Page 6: Finding Transcription Factor Binding Sites

Transcription Factor Families

• Over 40 known families • In mammals majority of

TFs from three families: – C2H2 zinc-finger

• (675 TFs in human)– Homeodomain

• (257 TFs) – Helix–loop–helix

• (87 TFs)

From Nature education

Page 7: Finding Transcription Factor Binding Sites

Transcription Factor Binding Motifs

• Binding of most factors is very specific, and only significant at a small fraction of sites

• For many factors much of that specificity is captured by the typical DNA sequence

• Sequence specificity often represented by motif– Not all sites well represented by motif

CTCF binding not captured well by motif NRF1 binding is well represented by motif

Page 8: Finding Transcription Factor Binding Sites

Transcription Factor Binding Sites

• ChIP-Seq experiments in the human genome typically find from several hundred to >20,000 locations where a particular TF is binding

• Binding sites may be stronger or weaker

A typical set of ChIP-Seq reads for HNF4a (from BayesPeak paper)

Page 9: Finding Transcription Factor Binding Sites

TFs Often Bind Cooperatively

• Most TFBS occur in clusters in promoters or enhancers/silencers with several others of different kinds

• Usually only a few of these are actually functional in any one cell type

• Different clusters operate in different cell types

Page 10: Finding Transcription Factor Binding Sites

Dynamics of TF Binding

• TF comes on and off the DNA site, often cycling in minutes or seconds

• Cooperative binding stabilizes TF• Many TFs act in respond to signals or stresses– Not captured systematically in most samples

Page 11: Finding Transcription Factor Binding Sites

TFBS Locations Often Evolve Rapidly

• Most enhancer TFBS in human do not align to TFBS in mouse

From Odom et al, Nature Genetics 2007

From Schmidt et al Science 2010

Page 12: Finding Transcription Factor Binding Sites

Factors Affecting TF Binding - I• Most TFs occupy less than a few percent of their

consensus target sites in the genome• Chromatin state– Most TFs can only recognize their motif(s) if the DNA is

relatively open– Some ‘pioneer’ factors bind to their sites in 3nm fiber and

open up chromatin for others

Zaret & Carroll, Genes Dev, 2012

Page 13: Finding Transcription Factor Binding Sites

Factors Affecting TF Binding - II

• Allosteric hindrance– Presence of another TF on

opposite side hinders binding by spreading major groove

• Cooperative Binding– Some TFs provide binding sites or

enhance binding of specific others to DNA

– Promotes non-linear step-like expression response to stimuli

Spitz et al Nature Rev Genetics 2012

Kim et al Science 2013

Page 14: Finding Transcription Factor Binding Sites

Transcription Factor Binding Motifs

• Binding of most factors is very specific, and only significant at a small fraction of sites

• For many factors much of that specificity is captured by the typical DNA sequence

• Sequence specificity often represented by motif– Not all sites well represented by motif

CTCF binding not captured well by motif NRF1 binding is well represented by motif

Page 15: Finding Transcription Factor Binding Sites

TFBS Motifs Are Stable Over Evolution

• Most transcription factors favor almost the same motifs in humans and in mice (and in lizards … and often even in flies)

From Odom et al, Nature Genetics 2007

Page 16: Finding Transcription Factor Binding Sites

Position-Specific Weight Matrices Represent TFBS Better than Motifs

• Represent log of probability of each base occurring at each position in TFBS

• Often used to scan along genome calculating log-likelihood at each position

A composite PWSM scan for SP1(from PEAKS webpage)

Page 17: Finding Transcription Factor Binding Sites

TFBS Motif Databases

• JASPAR - http://jaspar.genereg.net/– High-quality curated public data

• TRANSFAC - http://www.biobase-international.com/product/transcription-factor-binding-sites– Commercial product with dated public version

• Several research groups doing genome-wide characterizations by various means

Page 18: Finding Transcription Factor Binding Sites

Finding TFBS and Motifs in Animals

• Sequence-based methods – Scanning known TFBS motif – If have several co-regulated genes, use HMM or Gibbs

sampler to identify common motif in them• Data-based methods– Use ChIP to identify locations of binding

• Needs good antibody; often picks up indirect binding– Compare promoters across genomes

• Need depth; miss enhancers and species-related changes– Look for DNAse footprints– Use SELEX or DS-DNA microarray to profile TF’s DBD

Page 19: Finding Transcription Factor Binding Sites

Other Approaches to Finding TFBS

• Systematic Evolution of Ligands by Exponential Enrichment (SELEX)

From Jolma et al, Cell, 2013

Generate random DNA sequence library of moderate length. The sequences in the library are exposed to the target ligand, and those that do not bind the target are removed by affinity chromatography. The bound sequences are eluted, and then amplified by PCR, and the process is run again under more stringent elution conditions to purify the tightest-binding sequences.

Page 20: Finding Transcription Factor Binding Sites

Other Approaches to Finding TFBSIdentify recurrent motifs under DNaseI footprints

From Neph et al, Nature, 2012

Page 21: Finding Transcription Factor Binding Sites

Integrated Approaches to Identifying TFBS

• Combining Scores and TF-Specific ChIP-Seq• Combining information from scanning and

PhastCons or PhyloP conservation• Combining information from DNAse,

conservation and histone marks– Integrating DGF

• Combining information from DNAse, conservation and histone marks

Page 22: Finding Transcription Factor Binding Sites

Finding TFBS Motif via TF-Specific ChIP-Seq• ChIP gives

approximate (~200bp) TFBS locations

• Sequence can identify loci more specifically within ChIP peaks

• Use HMM or Gibbs• Indirect binding won’t

be found• Weak binding can be

accommodated

From Gelfond et al Biometrics 2009

Page 23: Finding Transcription Factor Binding Sites

Finding Active TFBS in Tissues

• Need Bayes model to integrate information from various sources

• Easiest if have some PSWM for binding site• We will focus on this situation• Increasingly being done to discover novel

motifs or PSWMs

Page 24: Finding Transcription Factor Binding Sites

Bayesian Hierarchical Models

• Prior probability of binding site set very low or estimated from TF-specific ChIP data

• In principle binding should be a continuous variable; we will treat as ‘yes-no’

• Need to estimate probability of various genomic features – conservation, DNAse, histone marks – for TFBS and for background sequence

Page 25: Finding Transcription Factor Binding Sites

Bayes Model for Combining Scores and Conservation

• How to estimate P(conserved | TFBS)?• Depends on depth of time for which conservation is used

– For mammals ~ 40%; primates ~ 80%– Varies between promoter and enhancer

• Background state can be estimated from genome-wide conservation (typically 5 - 10%)

• Then combine by Bayes Formula

• C and S are conditionally independent given B, so P(C&S|B) = P(C|B)P(S|B) (likewise for ~B)

Page 26: Finding Transcription Factor Binding Sites

Bayes Model for Combining Scores and DNase Sensitivity

• How to estimate P(DHS | TFBS)?• Almost all (~98%) of known TFBS occur in DHS• Background state can be estimated from genome-

wide levels (typically 1 or 2%)• Then combine by Bayes Formula

• D & S are conditionally independent given B, so P(D&S|B) = P(D|B)P(S|B) – likewise P(D&S)=P(D)P(S)

Page 27: Finding Transcription Factor Binding Sites

What Information from Histone Marks?

• By themselves histone marks, esp H3K4me3, H3K4me1, H3K27me3 can be very informative

• After introducing DNAse data, these marks do not add much direct information

• Could be used to adjust probabilities for DHS and conservation (not yet done)

Page 28: Finding Transcription Factor Binding Sites

Chromia – A Method for Using Histone Marks and PSWM

• Uses an HMM approach to integrate PSWM and histone marks (NB P300 ~ H3K27me3)

Page 29: Finding Transcription Factor Binding Sites

CENTIPEDE– A Method for Combining DNAse, Conservation and PSWM Scores

• Combines several kinds of genomic information with PSWM to identify putative TFBS

• Confirmation by ChIP-Seq is quite good

Pique-Regi R et al. Genome Res. 2011;21:447-455

Page 30: Finding Transcription Factor Binding Sites

CENTIPEDE– A Method for Combining DNAse, Conservation and PSWM Scores

Pique-Regi R et al. Genome Res. 2011;21:447-455

Model learned by the CENTIPEDE approach for the transcription factor NRSF. (A) Empirical density plots for key aspects of the data for sites inferred by CENTIPEDE to be bound (green lines, CENTIPEDE posterior probabilities >0.95) and unbound (red lines, probabilities < 0.5).