Top Banner
Cracking the promoter code: Identifying regulatory modules for tissue-specific transcripts. Chris Stoeckert, Ph.D. Center for Bioinformatics & Dept. of Genetics University of Pennsylvania School of Medicine Nov. 15, 2005 University of Nebraska Medical Center
46

Cracking the promoter code: Identifying regulatory modules for tissue-specific transcripts.

Jan 11, 2016

Download

Documents

Stewart Walker

Cracking the promoter code: Identifying regulatory modules for tissue-specific transcripts. Chris Stoeckert, Ph.D. Center for Bioinformatics & Dept. of Genetics University of Pennsylvania School of Medicine Nov. 15, 2005 University of Nebraska Medical Center. Expression. TFBS1. TFBS2. - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Cracking the promoter code: Identifying regulatory modules for tissue-specific transcripts.

Cracking the promoter code: Identifying regulatory modules for tissue-specific transcripts.

Chris Stoeckert, Ph.D.

Center for Bioinformatics & Dept. of Genetics

University of Pennsylvania School of Medicine

Nov. 15, 2005

University of Nebraska Medical Center

Page 2: Cracking the promoter code: Identifying regulatory modules for tissue-specific transcripts.

What is the code for determining where (and when) a gene is expressed?

http://molbio.info.nih.gov/molbio/gcode.html

Expression

TFBS1 TFBS4TFBS3

TFBS3

TFBS4

TFBS2

TFBS2

TFBS1

TFBS = transcription factor binding site

Page 3: Cracking the promoter code: Identifying regulatory modules for tissue-specific transcripts.

Goal is to Identify Combinations of TFBS (cis-Regulatory Modules or

CRMs) that Specify Tissue Expression

From Wasserman & Sandelin, NRG 2004

Page 4: Cracking the promoter code: Identifying regulatory modules for tissue-specific transcripts.

A Genomics Unified Schema approach to understanding

gene expression

Jennifer Dommer, Steve Fischer, Thomas Gan, Greg Grant, John Iodice, Junmin Liu, Elisabetta Manduchi, Joan

Mazzarelli, Debbie Pinney, Angel Pizarro, Mike Saffitz, Jonathan Schug, Chris Stoeckert, Trish Whetzel

Computational Biology and Informatics Laboratory (CBIL), Penn Center for Bioinformatics

Page 5: Cracking the promoter code: Identifying regulatory modules for tissue-specific transcripts.

Cryptospiridium Database

Beta Cell Biology Consortium

Plasmodium Genome Resource

Phytophthora SoybeanEST Database

GUS

Page 6: Cracking the promoter code: Identifying regulatory modules for tissue-specific transcripts.

GUS

GUS is an open source projectSanger Institute

U. Georgia

Flora Centromere

Database

U. ChicagoKansas U.

U. Penn

U. Toronto

Virginia BioinformaticsInsitiute

Page 7: Cracking the promoter code: Identifying regulatory modules for tissue-specific transcripts.

GUS Project Goals

• Provide:– A platform for broad genomics data integration– An infrastructure system for functional

genomics

• Support:– Websites with advanced query capabilities– Research driven queries and mining

Page 8: Cracking the promoter code: Identifying regulatory modules for tissue-specific transcripts.

GUS Project Resources• Website -- http://www.gusdb.org

– News, Documentation, Distributable, GUS-based Projects

Page 9: Cracking the promoter code: Identifying regulatory modules for tissue-specific transcripts.

GUS Components

• Schema

• Application Framework

– Object/Relational Layer

– Plugin API

– Pipeline API

• Plug-ins

• Web DevelopmentKit (WDK)

Page 10: Cracking the promoter code: Identifying regulatory modules for tissue-specific transcripts.

Schemas Domain Features

DoTS Sequence and annotation

EST clustersGene models

RAD Gene expression MIAME

Prot Protein expression Mass specMzdata/pepXML

Study Experiments FuGE

TESS Gene Regulation TFBS organization

SRes Shared resources Ontologies

Core Administration Documentation, Data Provenance

GUS 3.5 Schemas

Page 11: Cracking the promoter code: Identifying regulatory modules for tissue-specific transcripts.

RAD EST clustering and assembly

DoTS

Genomic alignmentand comparativesequence analysis

Identify sharedTF binding sites

TESS

BioMaterial annotation SRes, Study

Page 12: Cracking the promoter code: Identifying regulatory modules for tissue-specific transcripts.

DoTS integrates sequence annotation including where expressed

Page 13: Cracking the promoter code: Identifying regulatory modules for tissue-specific transcripts.

RAD Contains Detailed Expression Experiments Including Tissue Surveys

Page 14: Cracking the promoter code: Identifying regulatory modules for tissue-specific transcripts.

TESS Allows You to Find Potential TFBS

But there are too many potential sites!

Page 15: Cracking the promoter code: Identifying regulatory modules for tissue-specific transcripts.

Promoters Features Related to Tissue-Specificity as Measured by Shannon

Entropy

Jonathan Schug1, Winfried-Paul Schuller2, Claudia Kappen2, J. Michael Salbaum2, Maja Bucan1, Christian

J. Stoeckert Jr.1

1. Center for Bioinformatics, Department of Genetics, University of Pennsylvania, Philadelphia, Pennsylvania

2. Department of Genetics, Cell Biology and Anatomy, University of Nebraska Medical Center, Omaha, Nebraska

Genome Biology 2005 6:R33

Page 16: Cracking the promoter code: Identifying regulatory modules for tissue-specific transcripts.

What is a Liver-Specific Gene?

*http://expression.gnf.org/

Page 17: Cracking the promoter code: Identifying regulatory modules for tissue-specific transcripts.

Assessing Tissue Specificity of Genes Using Shannon Entropy

Shannon entropy is a measure of the uniformity of a discrete probability distribution. Given a set of T tissues, H ranges from 0 for a gene expressed in a single tissue to lg T for a gene expressed uniformly in all T tissues. It works well as a measure of overall tissue-specificity.

To measure specificity to a particular tissue, we combine entropy H and the relative expression level in that tissue to get Q. Q = 0 for a tissue when the gene is expressed only in that tissue and Q = 2T for a typical tissue in uniform expression.

(a) Very specific liver expression: H=1.6 and Qliver = 2.2, 98612_at cytochrome p450

(b) Near uniform expression : H=4.3 and Qliver=10.2, 104391_s_at Clcn7 chloride channel 7

Page 18: Cracking the promoter code: Identifying regulatory modules for tissue-specific transcripts.

Agreement between Microarrays and ESTs on Tissue Specificity

Page 19: Cracking the promoter code: Identifying regulatory modules for tissue-specific transcripts.

Specificity Characteristics of Tissues

TissueProbe SetID

H Q RefSeq Description

96055_at 3.2 5.8 NM_031161 cholecystokinin

93178_at 2.7 5.8 NM_019867neuronal guanine nucleotideexchange factor

93273_at 3.7 5.8 NM_009221 synuclein, alpha

92943_at 3.5 6.0 NM_008165glutamate receptor, ionotropic,AMPA1 (alpha 1)

Amygdala

95436_at 3.3 6.1 NM_009215 somatostatin

98406_at 2.7 4.0 NM_013653chemokine (C-C motif) ligand5

98063_at 1.6 4.1 -glycosylation dependent celladhesion molecule 1

99446_at 2.5 4.1 NM_007641membrane-spanning 4-domains, subfamily A, member1

92741_g_at 3.3 4.5 -immunoglobulin heavy chain 4(serum IgG1)

Lymph Node

102940_at 2.8 4.6 NM_008518 lymphotoxin B94777_at 1.3 2.1 - albumin 1101287_s_at 1.6 2.2 NM_010005 cytochrome P450, 2d1099269_g_at 1.5 2.2 NM_019911 tryptophan 2,3-dioxygenase100329_at 1.4 2.3 NM_009246 serine protease inhibitor 1-4

Liver

94318_at 1.6 2.3 NM_013475 apolipoprotein H

Page 20: Cracking the promoter code: Identifying regulatory modules for tissue-specific transcripts.

CpG Islands are Associated with the Start Sites of Genes with Wide-Spread Expression

CpG island = minimum 200 bp, C+G > 0.6, obs./expect. >=0.5

Page 21: Cracking the promoter code: Identifying regulatory modules for tissue-specific transcripts.

Tissue-Specific and Non-Specific Promoters Have Distinct Base Compositions

CpG+ CpG-

Multi-TissueH >= 4.4

TissueSpecificH <= 3.5

Promoters based on DBTSS (http://dbtss.hgc.jp)

Page 22: Cracking the promoter code: Identifying regulatory modules for tissue-specific transcripts.

TATA Boxes are Associated with Tissue-Specific Genes

p h = 0.13; p m = 0.15

p h = 0.00007; p m = 0.00087

p h = 0.00005; p m = 0.00001

0

10

20

30

40

50

60

70

80

90

0-2 2-4 4-6 6-8 8-10 >10

Q-Value

% with TATAA Box

human

mouse

(7/9)

(8/9)

(4/8)

(8/28)

(16/80)

(3/8) (10/28)

(16/80)

genes with

TATAA Box

human 18.8%

mouse: 22.9%

(4/31)

(2/27)

(9/35)

(3/27)

Page 23: Cracking the promoter code: Identifying regulatory modules for tissue-specific transcripts.

CellularComponent

BiologicalProcess

Human Only Mouse Only

extracellular,extracellular space

microsome, vesicular fraction intermediate filament(cytoskeleton)

CGI-/TATA+

response tostimulus

organismal physiological processinflammatory responseinnate immune responsecell motilitydefense responseresponse to pest/pathogen/parasiteresponse to woundingresponse to biotic stimuluscell-cell signalingmorphogenesisdigestionmuscle contraction

chemotaxis,taxis,response to chemicalsubstance,response to abioticstimulus,muscle development

cell, cytoplasm,intracellular,mitochondrion

nucleus, ribonucleoproteincomplex

CGI+/TATA-

nucleobase, nucleoside, nucleotideand nucleic acid metabolismintracellular transportmetabolismprotein transportintracellular protein transportRNA processingRNA metabolismcell cyclemitotic cell cycle

(integral to)(plasma)membrane

extracellular,extracellular space

CGI-/TATA-

organismalphysiologicalprocess, defenseresponse, immuneresponse, responseto biotic stimulus,response tostimulus

response to pest/ pathogen/parasite, cell communication,response to wounding, cellulardefense response, signaltransduction

complement activation,complement activation(classical pathway),humoral defensemechanism (sensuVertebrata),humoral immuneresponse

Functional relationships of promoter classes based on over-represented GO terms (EASE)

Page 24: Cracking the promoter code: Identifying regulatory modules for tissue-specific transcripts.

First Clues: TATA Box indicates Tissue Specific;

CpG indicates Wide Spread Expression

Additional clues: CpG-/TATA+ indicates high expression, secreted proteins while CpG+/TATA- indicates cellular and mitchondrial proteins.

Page 25: Cracking the promoter code: Identifying regulatory modules for tissue-specific transcripts.

Expanding the Mammalian CArGome

Qiang Sun1, Guang Chen2, Jeffrey W. Streb1, Xiaochun Long1, Yumei Yang1, Christian J. Stoeckert, Jr.2 and

Joseph M. Miano1

1. Cardiovascular Research Institute, University of Rochester School of Medicine, Rochester, New York

2. Center for Bioinformatics, University of Pennsylvania, Philadelphia, Pennsylvania

Genome Research (in press)

Page 26: Cracking the promoter code: Identifying regulatory modules for tissue-specific transcripts.

Serum Response Factor (SRF) Target Genes

Page 27: Cracking the promoter code: Identifying regulatory modules for tissue-specific transcripts.

Finding Novel CArG elements

• Expect 1 CArG element about every kb just by chance. – CCWWWWWWGG with one

mismatch allowed• Use conservation to reduce false

positives.– 188 associated with 4362

orthologous genes– 116 had orthologous CArGs– 10/62 known genes found– Repeated with 9169 orthologous

genes• 489 predictions• 32/62 known genes found

• 60 of 83 predictions were experimentally validated– Transfection assays– Binding assays– Knockdown assays

Page 28: Cracking the promoter code: Identifying regulatory modules for tissue-specific transcripts.

Serum Response Factor (SRF) Target Genes

Page 29: Cracking the promoter code: Identifying regulatory modules for tissue-specific transcripts.

More Clues: Human-mouse conservation enables

identification of valid CArG elements

CArG elements associated with many cytoskeletal genes suggesting role of SRF in cytoskeletal dynamics.

Page 30: Cracking the promoter code: Identifying regulatory modules for tissue-specific transcripts.

Using Bounded Collection Grammars to Identify cis-

Regulatory Modules in Tissue Specific Genes

Jonathan Schug

Max Mintz (CIS, U Penn)

Page 31: Cracking the promoter code: Identifying regulatory modules for tissue-specific transcripts.

Bounded Collection Grammars

Collection production rules for the GR response element in the PEPCK promoter

Page 32: Cracking the promoter code: Identifying regulatory modules for tissue-specific transcripts.

Rules are evaluated using the receiver operating characteristic (ROC)

Each point is a different parameter setting for a rule applied to training sets. Typically use area under the curve (AUC) to rank rules.

Page 33: Cracking the promoter code: Identifying regulatory modules for tissue-specific transcripts.

Rules are built by increasing complexity when AUC improves

Reduce search space by not pursuing unproductive paths.e.g., If (A,B) not better than A or B then don’t need to look at (A,B,C) or (A,B,D) or (A,B,C,D)

Page 34: Cracking the promoter code: Identifying regulatory modules for tissue-specific transcripts.

The 3-set rule for the PEPCK GR element

Note improvements of 2-sets over solos and the 3-set over 2-sets.

Page 35: Cracking the promoter code: Identifying regulatory modules for tissue-specific transcripts.

Discovering regulatory modules by creating profiles for Gene

Ontology Biological Processes based on tissue-specificity scores

Elisabetta Manduchi, Jonathan Schug

Klaus Kaestner (Genetics, U Penn)

Page 36: Cracking the promoter code: Identifying regulatory modules for tissue-specific transcripts.

If we focus on biological processes that are predominantly taking place in a given tissue, can we identify regulatory modules common to genes involved in these processes?

TissueBiological Process

Genes

Page 37: Cracking the promoter code: Identifying regulatory modules for tissue-specific transcripts.

For a given tissue survey, we attach “tissue-specificity” profiles to gene sets defined by GO BPs, based on the ranked lists of genes in each tissue according to increasing Q.

• To this end, we use an Enrichment Score (ES) in the spirit of that described in Mootha et al. (2003), as a measure of tissue-specificity for that gene set.

• The ES turns out to be equivalent (i.e. equal up to a multiplicative constant) to a Kolmogorov-Smirnov statistic.

......

...

liver muscle brain

**

*

*

*

*

***

*steroid metabolism

Page 38: Cracking the promoter code: Identifying regulatory modules for tissue-specific transcripts.

• Have applied to two different Affymetrix-based datasets– Schmueli et al. C R Biol 2003. GeneNote

(human)– Su et al. PNAS 2004. GEA2 (human and

mouse)

• We looked at ~ 2000 GO BPs that we could map to probe sets

Application to Tissue Surveys

Page 39: Cracking the promoter code: Identifying regulatory modules for tissue-specific transcripts.

GO BPs having significantly specific profiles for each tissue can be identified

significant in liver significant in heartand skeletal muscle

Page 40: Cracking the promoter code: Identifying regulatory modules for tissue-specific transcripts.

Mouse Tissue Survey Human Tissue Survey

Tissue-specific GO BPs Tissue-specific GO BPs

Reduce and Intersect

Training Setof Promoter Sequences

Training Setof Promoter Sequences

Ortholog Pairs(Homologene)

Learning Tissue-Specific Promoter Motifs

Mm-based consensusconserved sequences

Hs-based consensusconserved sequences

32 POS, 365 NEGUCSC conserved

sequences

Positive Solos Positive Solos

Liver-specificSteroid metabolism

GEA2

ROC area > 0.5

Page 41: Cracking the promoter code: Identifying regulatory modules for tissue-specific transcripts.

Common solos(31)

Mm-based collections(30)

Hs-based collections(83)

(13) Common collections

GATA MYCMAC/USF AIRE CAAT ER-LEFT TCF11TTAC/EFC/NCX/VBP PBX INIAATC SREBP DBPForkhead E4B EBOXCCAA CREB/ATF S8/CART1/CHX10/NKX25G_AA/CEBP/HLF TAACC LXRHNF1 ALX4 HNF4/TCF4/COUP/PPAR PPAR-LEFT ROAZ AML/PEBPBACH/NFE2/NRF2 ZTAP53 GNCF/SF1

Liver-specific from Krivan and Wasserman (2001)Known Liver TFs

Page 42: Cracking the promoter code: Identifying regulatory modules for tissue-specific transcripts.

Learning Liver Specific CRMs for Steroid Metabolism

• AIRE • P53• ER-LEFT • {CREB/ATF, GATA}• {GATA, GNCF/SF1}• {Forkhead, GATA} • {GATA, G_AA/CEBP/HLF}• {GATA, SREBP}• {GATA, TAACC}• {GATA, ZTA}• {AATC, SREBP}• {CAAT, SREBP}• {BACH/NFE2/NRF2, G_AA/CEBP/HLF}

Without imposing prior knowledge, end up with rules that are highly enriched for TFs expected to play a role in liver-specific streroid metabolism.

Page 43: Cracking the promoter code: Identifying regulatory modules for tissue-specific transcripts.

Testing a learned CRM for liver steroid metabolism using a liver HNF3-beta (FoxA2)

knock out mouse study

• {Forkhead, GATA} set rule applies to HNF3-beta/FoxA2

• Search promoters of genes down-regulated in liver as measured on PancChip microarray– Pancreas-focused array with 7356 known genes.

• 52 (0.7%) map to steroid metabolism.

– 71 genes down-regulated• 7 (10%) map to steroid metabolism.

Genes down-regulated by knockout of a forkhead protein (Hnf3-beta) are significantly enriched in steroid metabolism

Page 44: Cracking the promoter code: Identifying regulatory modules for tissue-specific transcripts.

More Clues: We can identify candidate CRMs from top-ranking GO Biological Processes for tissues

Tested a candidate CRM for liver steroid metabolism with a knock-out mouse. Support for role of one of the factors but not enough sensitivity for seeing both factors.

Page 45: Cracking the promoter code: Identifying regulatory modules for tissue-specific transcripts.

Future Directions

• Apply learning methods to many tissues and processes incorporating multiple surveys

• Add novel motifs to learning process

• Use ChIP and tissue-focused expression datasets to better evaluate

Our goal is to make inferences of the form: "The gene set G shows specificity for tissue T and is regulated by module M in this context".

Page 46: Cracking the promoter code: Identifying regulatory modules for tissue-specific transcripts.

http://www.cbil.upenn.edu