Top Banner
1 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Do not reproduce without permission 1 Gerstein.info/talks (c) 2004 Annotation of Intergenic Regions of the Human Genome Mark B Gerstein Yale (Comp. Bio. & Bioinformatics) Cistrome 2007, Boston, MA 2007.04.30, 15:20-15:55 Slides downloadable from Lectures.GersteinLab.org (Please read permissions statement.) (Genome Annotation Talk without much pgenes, including Tilescope, HMMs, DART, binding sites, and pgene-transcription, All completed comfortably within time.)
56

Do not reproduce without permission 1 Gerstein.info/talks (c) 2004 1 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Annotation of Intergenic Regions.

Dec 25, 2015

Download

Documents

Ashley Bond
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Do not reproduce without permission 1 Gerstein.info/talks (c) 2004 1 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Annotation of Intergenic Regions.

1

(c

) M

ark

Ge

rste

in,

20

02

, Y

ale

, b

ioin

fo.m

bb

.ya

le.e

du

Do not reproduce without permission 1 G

ers

tein

.in

fo/t

alk

s

(c)

20

04

Annotation of Intergenic Regionsof the Human Genome

Mark B GersteinYale (Comp. Bio. & Bioinformatics)

Cistrome 2007, Boston, MA

2007.04.30, 15:20-15:55Slides downloadable from Lectures.GersteinLab.org

(Please read permissions statement.)

(Genome Annotation Talk without much pgenes, including Tilescope, HMMs, DART, binding sites, and pgene-transcription, All completed comfortably within time.)

Page 2: Do not reproduce without permission 1 Gerstein.info/talks (c) 2004 1 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Annotation of Intergenic Regions.

2 2

zdz

© m

mvii

[IHGSC, Nature 409, 2001][Venter et al. Science 29, 2001]

Most of the human genome is not coding sequence

Page 3: Do not reproduce without permission 1 Gerstein.info/talks (c) 2004 1 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Annotation of Intergenic Regions.

3

(c

) M

ark

Ge

rste

in,

20

02

, Y

ale

, b

ioin

fo.m

bb

.ya

le.e

du

Do not reproduce without permission 3 G

ers

tein

.in

fo/t

alk

s

(c)

20

04

• Mike Snyder &Sherman Weissman

• Tiling of whole chromosomes into small fragments

• Large-scale hybridization to find transcribed regions in unbiased fashion and TF binding sites (via ChIP-chip)

• Careful Computational Annotation

+ENCODE

Page 4: Do not reproduce without permission 1 Gerstein.info/talks (c) 2004 1 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Annotation of Intergenic Regions.

4

(c

) M

ark

Ge

rste

in,

20

02

, Y

ale

, b

ioin

fo.m

bb

.ya

le.e

du

Do not reproduce without permission 4 G

ers

tein

.in

fo/t

alk

s

(c)

20

04

Overall Aim of Yale Genomics is Comprehensive Intergenic Annotation

• Regulatory regions, repeats, non-coding RNAs, origins of replication, pseudogenes, segmental duplications, unknown elements….

• Specifc Results within ENCODE – 1% of human genome (~30Mb in 44 regions)

Pseudogenes (Zheng et al., GR) Classification of Novel Transcribed Regions (Rozowsky et al., GR)

• Characterization of Novel Structured RNAs (Washeitl et al., GR) Grouping and Classification of Binding Sites

(from ChIP-chip)• Med. Scale (~100kb) deserts and islands (Zhang et al., GR)• Novel Promotors (Trinklein et al., GR)

• CNVs and SDs (from hires-aCGH, Korbel et al.)

Page 5: Do not reproduce without permission 1 Gerstein.info/talks (c) 2004 1 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Annotation of Intergenic Regions.

5

(c

) M

ark

Ge

rste

in,

20

02

, Y

ale

, b

ioin

fo.m

bb

.ya

le.e

du

Do not reproduce without permission 5 G

ers

tein

.in

fo/t

alk

s

(c)

20

04

Outline: Tiling Array Analysis + Annotation Pipelines

• Tools for Scoring Arrays• Tools for Segmentation and Validation of Arrays• Results on Clusters of Novel Transcribed Regions• Results on Clusters of Binding Sites• Results on Active, Transcribed Pseudogenes

Page 6: Do not reproduce without permission 1 Gerstein.info/talks (c) 2004 1 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Annotation of Intergenic Regions.

6

(c

) M

ark

Ge

rste

in,

20

02

, Y

ale

, b

ioin

fo.m

bb

.ya

le.e

du

Do not reproduce without permission 6 G

ers

tein

.in

fo/t

alk

s

(c)

20

04

Tiling Arrays Probing Intergenic

Activity: Tools

Page 7: Do not reproduce without permission 1 Gerstein.info/talks (c) 2004 1 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Annotation of Intergenic Regions.

7 7

zdz

© m

mvii

Tilescope 101

▪ It is available at tilescope.gersteinlab.org

▪ It was designed for high-density tiling microarray data analysis.

▪ It is useful▫ Most existing data processing software was designed for traditional

microarrays.

▫ It is flexible—several microarray data processing methods are available.

▫ It is easy to use• It has a graphic user interface.• The data analysis process is streamlined.• It is online software. No need to install.

▫ It is free!

Zhang et al. (2007) GenomeBiology

Page 8: Do not reproduce without permission 1 Gerstein.info/talks (c) 2004 1 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Annotation of Intergenic Regions.

8 8

zdz

© m

mvii

Tilescope: system implementation

▪ Written in Java

▪ Composed of 3 parts: applet, servlet, and pipeline program

Internet

Applet

ServletPipeline

Server Users

Zhang et al. (2007) GenomeBiology

Page 9: Do not reproduce without permission 1 Gerstein.info/talks (c) 2004 1 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Annotation of Intergenic Regions.

9 9

zdz

© m

mvii

Tilescope: user interface

Zhang et al. (2007) GenomeBiology

Page 10: Do not reproduce without permission 1 Gerstein.info/talks (c) 2004 1 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Annotation of Intergenic Regions.

1010

zdz

© m

mvii

Tilescope: data processing

▪ Array data can be normalized by mean, median, quantile, and loess.

▪ Tile scoring generates the signal map and the P-value map.

▪ Feature identification produces ‘hits’.

Zhang et al. (2007) GenomeBiology

Page 11: Do not reproduce without permission 1 Gerstein.info/talks (c) 2004 1 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Annotation of Intergenic Regions.

13

(

c)

Ma

rk G

ers

tein

, 2

00

2,

Ya

le,

bio

info

.mb

b.y

ale

.ed

u

Do not reproduce without permission 13

Ge

rste

in.i

nfo

/ta

lks

(c

) 2

00

4

HMM Segmentation + Optimal Selection of Regions to

Validate

Page 12: Do not reproduce without permission 1 Gerstein.info/talks (c) 2004 1 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Annotation of Intergenic Regions.

14

(

c)

Ma

rk G

ers

tein

, 2

00

2,

Ya

le,

bio

info

.mb

b.y

ale

.ed

u

Do not reproduce without permission 14

Ge

rste

in.i

nfo

/ta

lks

(c

) 2

00

4

Tiling array analyses: modeling

(unknown)

Goal: identify S = 1 probes based on DMethod: build a model M’ based on D, compute SPossible performance metric: error rate in predicting the state of a probe, experimental validation, …

(may need pre-processing)

Page 13: Do not reproduce without permission 1 Gerstein.info/talks (c) 2004 1 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Annotation of Intergenic Regions.

15

(

c)

Ma

rk G

ers

tein

, 2

00

2,

Ya

le,

bio

info

.mb

b.y

ale

.ed

u

Do not reproduce without permission 15

Ge

rste

in.i

nfo

/ta

lks

(c

) 2

00

4

Tiling array analyses: Hidden Markov model

Source: Du et al. (2006) Bioinformatics, 22, 3016-3024.

A model for transcriptional tiling array data

TAR: transcriptionally active region

Source: http://en.wikipedia.org/wiki/Hidden_Markov_model

State transitions in a hidden Markov model (example)x — hidden states

y — observable outputsa — transition probabilitiesb — output probabilities

Page 14: Do not reproduce without permission 1 Gerstein.info/talks (c) 2004 1 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Annotation of Intergenic Regions.

16

(

c)

Ma

rk G

ers

tein

, 2

00

2,

Ya

le,

bio

info

.mb

b.y

ale

.ed

u

Do not reproduce without permission 16

Ge

rste

in.i

nfo

/ta

lks

(c

) 2

00

4

Source: Du et al. (2006) Bioinformatics, 22, 3016-3024.

Page 15: Do not reproduce without permission 1 Gerstein.info/talks (c) 2004 1 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Annotation of Intergenic Regions.

17

(

c)

Ma

rk G

ers

tein

, 2

00

2,

Ya

le,

bio

info

.mb

b.y

ale

.ed

u

Do not reproduce without permission 17

Ge

rste

in.i

nfo

/ta

lks

(c

) 2

00

4

Tiling array analyses: active sampling & supervised learning

• Active sampling Selecting a small set of sub-regions for validation first

• Supervised learning Use the validation data to train the statistical model

Source: Gerstein et al. (2007) Gen. Res.

Page 16: Do not reproduce without permission 1 Gerstein.info/talks (c) 2004 1 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Annotation of Intergenic Regions.

18

(

c)

Ma

rk G

ers

tein

, 2

00

2,

Ya

le,

bio

info

.mb

b.y

ale

.ed

u

Do not reproduce without permission 18

Ge

rste

in.i

nfo

/ta

lks

(c

) 2

00

4

Tiling array analyses: Sample sub-region selection

Can we find a good selection scheme?

Source: Gerstein et al. (2007) Gen. Res.

Page 17: Do not reproduce without permission 1 Gerstein.info/talks (c) 2004 1 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Annotation of Intergenic Regions.

19

(

c)

Ma

rk G

ers

tein

, 2

00

2,

Ya

le,

bio

info

.mb

b.y

ale

.ed

u

Do not reproduce without permission 19

Ge

rste

in.i

nfo

/ta

lks

(c

) 2

00

4

Tiling array analyses: Sample sub-region selection

• Sampling solely based on data D Some candidates

• Random selection• Entropy based• KL-divergence based

• Testing the performance of these schemes Simulation

• Why?– So that we know S exactly

Page 18: Do not reproduce without permission 1 Gerstein.info/talks (c) 2004 1 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Annotation of Intergenic Regions.

20

(

c)

Ma

rk G

ers

tein

, 2

00

2,

Ya

le,

bio

info

.mb

b.y

ale

.ed

u

Do not reproduce without permission 20

Ge

rste

in.i

nfo

/ta

lks

(c

) 2

00

4

Tiling array analyses: Simulation results

Source: Du et al. (2006) Bioinformatics, 22, 3016-3024.

Page 19: Do not reproduce without permission 1 Gerstein.info/talks (c) 2004 1 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Annotation of Intergenic Regions.

21

(

c)

Ma

rk G

ers

tein

, 2

00

2,

Ya

le,

bio

info

.mb

b.y

ale

.ed

u

Do not reproduce without permission 21

Ge

rste

in.i

nfo

/ta

lks

(c

) 2

00

4

Tiling array analyses: Simulation results (cont.)

Source: Du et al. (2006) Bioinformatics, 22, 3016-3024.

Page 20: Do not reproduce without permission 1 Gerstein.info/talks (c) 2004 1 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Annotation of Intergenic Regions.

23

(

c)

Ma

rk G

ers

tein

, 2

00

2,

Ya

le,

bio

info

.mb

b.y

ale

.ed

u

Do not reproduce without permission 23

Ge

rste

in.i

nfo

/ta

lks

(c

) 2

00

4

Tiling array analyses: back to the real world

• Transcriptional tiling array Mainly using gene annotation to train the model

• Somewhat analogous to using MaxEntropy to select the sample regions

• The training set is expected to be noisy, but still leads to satisfying performance

• ChIP-chip tiling array Try to guess the signal distribution according to annotation information

• Ideal scenario Optimally select a medium-sized set of sample sub-regions Do experimental validations to build the model

Page 21: Do not reproduce without permission 1 Gerstein.info/talks (c) 2004 1 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Annotation of Intergenic Regions.

2424

zdz

© m

mvii

Tiling array analyses: results on transcriptional data(ENCODE regions (~30Mb) , training set (~7.5mb), ¼ training set (~1.9Mb, ~0.1M probes))

Source: Du et al. (2006) Bioinformatics, 22, 3016-3024.

Page 22: Do not reproduce without permission 1 Gerstein.info/talks (c) 2004 1 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Annotation of Intergenic Regions.

25

(

c)

Ma

rk G

ers

tein

, 2

00

2,

Ya

le,

bio

info

.mb

b.y

ale

.ed

u

Do not reproduce without permission 25

Ge

rste

in.i

nfo

/ta

lks

(c

) 2

00

4

Tiling Arrays Probing Intergenic

Activity: Classifying Un-annotated Transcription

Page 23: Do not reproduce without permission 1 Gerstein.info/talks (c) 2004 1 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Annotation of Intergenic Regions.

26

(

c)

Ma

rk G

ers

tein

, 2

00

2,

Ya

le,

bio

info

.mb

b.y

ale

.ed

u

Do not reproduce without permission 26

Ge

rste

in.i

nfo

/ta

lks

(c

) 2

00

4

DART: Classification of Unannotated Transcription

• Large amount of novel transcribed regions (TARs / transfrags) detected using tiling microarrays.

• Developed DART: Database of Active Regions and Tools Developed a classification procedure for these novel TARs/transfrags Database for storing & visualizing various sets of TARs/transfrags Associated tools for analyzing these sets

Rozowsky et al. Genome Research (2007, in press)

Page 24: Do not reproduce without permission 1 Gerstein.info/talks (c) 2004 1 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Annotation of Intergenic Regions.

27

(

c)

Ma

rk G

ers

tein

, 2

00

2,

Ya

le,

bio

info

.mb

b.y

ale

.ed

u

Do not reproduce without permission 27

Ge

rste

in.i

nfo

/ta

lks

(c

) 2

00

4

Set of All TARs

Exonic TARs Novel TARs

Intronic Intergenic ESTs

Proximal Distal Proximal Distal

Pseudo TARs

Rozowsky et al. Gen. Res. (2007, in press)

Page 25: Do not reproduce without permission 1 Gerstein.info/talks (c) 2004 1 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Annotation of Intergenic Regions.

28

(

c)

Ma

rk G

ers

tein

, 2

00

2,

Ya

le,

bio

info

.mb

b.y

ale

.ed

u

Do not reproduce without permission 28

Ge

rste

in.i

nfo

/ta

lks

(c

) 2

00

4

Sch

em

atic

of C

lass

ifica

tion

Pro

ced

ure

Set of Novel TARs

S1A Filter Novel TARsfor Unusual SequenceComposition

S1B Filter Novel TARsfor Cross-Hybridization

S3A Cluster into Novel Transcribed Loci using Expression Profiles (EP)

Peculiar TARs

Cross-Hyb TARs

Novel EP Loci

Singlet or Ambiguous TARs

P D P D E

S2 Assign Novel TARs to Known Genes usingExpression Profiles

Gene Assoc. TARs

P D P D E

P D P D E

P D P D E

S3B Cluster into Novel TranscribedLoci using Phylogenic Profiles (PP)

Novel PP Loci

P D P D E

DART

Rozowsky et al. Genome Research (in press)

Page 26: Do not reproduce without permission 1 Gerstein.info/talks (c) 2004 1 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Annotation of Intergenic Regions.

29

(

c)

Ma

rk G

ers

tein

, 2

00

2,

Ya

le,

bio

info

.mb

b.y

ale

.ed

u

Do not reproduce without permission 29

Ge

rste

in.i

nfo

/ta

lks

(c

) 2

00

4

Rozowsky et al. Gen. Res. (2007, in press)

TAR clustering

Page 27: Do not reproduce without permission 1 Gerstein.info/talks (c) 2004 1 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Annotation of Intergenic Regions.

30

(

c)

Ma

rk G

ers

tein

, 2

00

2,

Ya

le,

bio

info

.mb

b.y

ale

.ed

u

Do not reproduce without permission 30

Ge

rste

in.i

nfo

/ta

lks

(c

) 2

00

4

DART: Database & Tools- Interfaces with UCSC- Tools use Ensembl API

Rozowsky et al. Genome Research (in press)

Page 28: Do not reproduce without permission 1 Gerstein.info/talks (c) 2004 1 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Annotation of Intergenic Regions.

31

(

c)

Ma

rk G

ers

tein

, 2

00

2,

Ya

le,

bio

info

.mb

b.y

ale

.ed

u

Do not reproduce without permission 31

Ge

rste

in.i

nfo

/ta

lks

(c

) 2

00

4

Table 1 Locations of all TARs

Exonic Pseudogenes Unannotated RegionsSize of ENCODE Regions (bp) 1,776,157 144,745 28,077,158Percentage of all ENCODE 5.9% 0.5% 93.6%

Number of TARs 3,666 195 6,988Percentage of all TARs 33.8% 1.8% 64.4%

Locations of Novel TARsESTs not in Exons Intronic Proximal Intronic Distal Intergenic Proximal Intergenic Distal

Size of Unannotated Regions (bp) 2,477,910 8,522,559 5,536,879 2,434,101 9,250,454Percentage of Unannotated Regions 8.8% 30.2% 19.6% 8.6% 32.8%

Number of Novel TARs 1,194 3,006 864 772 1,300Percentage of all Novel TARs 16.7% 42.1% 12.1% 10.8% 18.2%

Table 2: Sets of Classified Novel TARs Number Percentage

Total 6,988 100.0%

With peculiar sequence composition 503 7.2%Assigned to known genes 955 13.7%Caused by cross-hybridization - -In novel transcribed loci using expression profiles 681 9.7%In novel transcribed loci using phylogenetic profiles 782 11.2%

Rozowsky et al. Genome Research (in press)

Page 29: Do not reproduce without permission 1 Gerstein.info/talks (c) 2004 1 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Annotation of Intergenic Regions.

32

(

c)

Ma

rk G

ers

tein

, 2

00

2,

Ya

le,

bio

info

.mb

b.y

ale

.ed

u

Do not reproduce without permission 32

Ge

rste

in.i

nfo

/ta

lks

(c

) 2

00

4

DART Classification has been experimentally validated with some small scale experiment RT-PCR & Sequencing

Results:

18/46 (39%) confirmed by RT-PCR

4/5 Sequenced Products Map uniquely to correct genomic region

0.5Kb

1Kb

1.5Kb

2Kb

+ - + - + - + - + - + - + - + - + - + - + - + -+ - + - L

A11 A12 A13 A14 A15 A16 A17 A18 A19 A20 A21 A22 A23 B1ID:

TAR 1

PCR Sequence 1 ttcttcggaaaagcacatgaactctttggagtctcctgttccacttggtaaatttcctat 60 |||||||||||||||||||||||||| |||||||||||||||||||||||||||||||||Chr21 34,270,569 ttcttcggaaaagcacatgaactcttcggagtctcctgttccacttggtaaatttcctat 34,270,628

PCR Sequence 61 agctccgcactgaaagtccctgctgccctccttcctctgagcttgtggggcccacagatc 120 ||| |||||||||||||||||||||||||||||||||||||||||||||||||||||||Chr21 34,270,629 agccacgcactgaaagtccctgctgccctccttcctctgagcttgtggggcccacagatc 34,270,688

PCR Sequence 121 ccctgctccacttcctgcttcatttcagctgat 153 |||||||||||||||||||||||||||||||||Chr21 34,270,689 ccctgctccacttcctgcttcatttcagctgat 34,270,721

TAR 2

PCR Sequence 154 ggatgacactccctcgttctaataccatctgaatgcctgagcaattacatcttacaacct 213 ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||Chr21 34,270,898 ggatgacactccctcgttctaataccatctgaatgcctgagcaattacatcttacaacct 34,270,957

PCR Sequence 214 catgaaaaacacagcagcttgtcacgatgaatg 246 |||||||||||||||||||||||||||||||||Chr21 34,270,958 catgaaaaacacagcagcttgtcacgatgaatg 34,270,990

Forward Primer

Reverse Primer

Novel TARs

PCR SequenceFP RP

Rozowsky et al. Genome Research (in press)

Page 30: Do not reproduce without permission 1 Gerstein.info/talks (c) 2004 1 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Annotation of Intergenic Regions.

33

(

c)

Ma

rk G

ers

tein

, 2

00

2,

Ya

le,

bio

info

.mb

b.y

ale

.ed

u

Do not reproduce without permission 33

Ge

rste

in.i

nfo

/ta

lks

(c

) 2

00

4

Tiling Arrays Probing Intergenic

Activity: Categories Groups of Binding

Sites

Page 31: Do not reproduce without permission 1 Gerstein.info/talks (c) 2004 1 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Annotation of Intergenic Regions.

3434

zdz

© m

mvii

Transcriptional regulation

▪ Protein-coding genes are transcribed by RNA polymerase II (Pol2) ...

▪ ... Under elaborate regulation by▫ The binding of a complex set of transcription factors to their regulatory

elements

▫ Histone modifications such as acetylation and methylation

▫ Chromatin remodeling

▪ Transcription factor binding sites include▫ Core promoters

▫ Promoter proximal elements

▫ Other elements such as enhancers, silencers, insulators, and response elements

▪ Transcriptional regulatory elements can be globally mapped by high-throughput experiments such as ChIP-chip or ChIP-PET.

Zhang et al. (2007) Gen. Res.

Page 32: Do not reproduce without permission 1 Gerstein.info/talks (c) 2004 1 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Annotation of Intergenic Regions.

3535

zdz

© m

mvii

ENCODE TR study

▪ First concerted effort to systematically identify TREs in the human genome on a large scale▫ 105 lists of transcriptional regulatory elements in the encode regions

▫ 29 transcription factors, 9 cell lines, 2 time points

▫ 7 laboratories and 3 different microarray platforms

▪ TFs and their TREs can be studied on various genomic levels.

Zhang et al. (2007) Gen. Res.

Page 33: Do not reproduce without permission 1 Gerstein.info/talks (c) 2004 1 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Annotation of Intergenic Regions.

3636

zdz

© m

mvii

Our TRE analysis approach

▪ On an intermediate genomic level, involving 10 ~ 100 kb of DNA with several genes on average.

▪ Try to present the problem and subsequently analyze the data in a consistent and coherent statistical framework.

Zhang et al. (2007) Gen. Res.

Page 34: Do not reproduce without permission 1 Gerstein.info/talks (c) 2004 1 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Annotation of Intergenic Regions.

3737

zdz

© m

mvii

Landscape of ENCODE TREs

▪ Positive correlation of the TRE density with both non-exonic conservation and gene density in a genomic region

Zhang et al. (2007) Gen. Res.

Page 35: Do not reproduce without permission 1 Gerstein.info/talks (c) 2004 1 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Annotation of Intergenic Regions.

3838

zdz

© m

mvii

Non-random distribution of TREs

▪ TREs are not evenly distributed throughout the encode regions (P < 2.2×10−16 ).

▪ The actual TRE distribution is power-law.

▪ The null distribution is ‘Poissonesque.’

▪ Many genomic subregions with extreme numbers of TREs.

Zhang et al. (2007) Gen. Res.

Page 36: Do not reproduce without permission 1 Gerstein.info/talks (c) 2004 1 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Annotation of Intergenic Regions.

3939

zdz

© m

mvii

Local TRE enrichment and depletion

▪ Hundreds of TRE ‘islands’ and ‘deserts’ are identified in ENCODE regions.

▪ The longest island is composed of 68 various TREs and covers a 35-kb region near the HOXA cluster on chromosome 7.

▪ The entirety of ehd1 on chromosome 11 is covered by TRE islands.

▪ Some of islands are located in the intergenic regions in the genome.

dart.gersteinlab.org/encode/tr/

Zhang et al. (2007) Gen. Res.

Page 37: Do not reproduce without permission 1 Gerstein.info/talks (c) 2004 1 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Annotation of Intergenic Regions.

40

(

c)

Ma

rk G

ers

tein

, 2

00

2,

Ya

le,

bio

info

.mb

b.y

ale

.ed

u

Do not reproduce without permission 40

Ge

rste

in.i

nfo

/ta

lks

(c

) 2

00

4

Tiling Arrays Probing Intergenic

Activity: Connecting Intergenic Activity to

Pseudogenes

Page 38: Do not reproduce without permission 1 Gerstein.info/talks (c) 2004 1 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Annotation of Intergenic Regions.

41

(

c)

Ma

rk G

ers

tein

, 2

00

2,

Ya

le,

bio

info

.mb

b.y

ale

.ed

u

Do not reproduce without permission 41

Ge

rste

in.i

nfo

/ta

lks

(c

) 2

00

4

5 Methods of Assignment

• 4 automatic pipelines retroFinder, pseudoFinder, PseudoPipe, GIS Comparing protein or transcript v genomic DNA,

filtering, application of rules

• HAVANA manual• What is a pseudogene?

Different criteria

• Conservative approach here Can't overlap gene annotation Need to have a protein alignment 201 pseudogenes (in comparison to ~400 genes)

Zheng et al. (2007) Gen. Res.

Page 39: Do not reproduce without permission 1 Gerstein.info/talks (c) 2004 1 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Annotation of Intergenic Regions.

42

(

c)

Ma

rk G

ers

tein

, 2

00

2,

Ya

le,

bio

info

.mb

b.y

ale

.ed

u

Do not reproduce without permission 42

Ge

rste

in.i

nfo

/ta

lks

(c

) 2

00

4

Overlap of Pseudogenes by 5 Different

Methods

Union of 252

Zheng et al. (2007) Gen. Res.

Page 40: Do not reproduce without permission 1 Gerstein.info/talks (c) 2004 1 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Annotation of Intergenic Regions.

43

(

c)

Ma

rk G

ers

tein

, 2

00

2,

Ya

le,

bio

info

.mb

b.y

ale

.ed

u

Do not reproduce without permission 43

Ge

rste

in.i

nfo

/ta

lks

(c

) 2

00

4

Ex. Pseudogene

Intersecting Transcript-

ional Evidence

SpecialG

tracks in browser

diTAG

CAGE

TARS

ChIP-chip

Zheng et al. (2007) Gen. Res.

Page 41: Do not reproduce without permission 1 Gerstein.info/talks (c) 2004 1 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Annotation of Intergenic Regions.

44

(

c)

Ma

rk G

ers

tein

, 2

00

2,

Ya

le,

bio

info

.mb

b.y

ale

.ed

u

Do not reproduce without permission 44

Ge

rste

in.i

nfo

/ta

lks

(c

) 2

00

4

Intersection of Pseudogenes with Transcriptional Evidence

Zheng et al. (2007) Gen. Res.

Page 42: Do not reproduce without permission 1 Gerstein.info/talks (c) 2004 1 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Annotation of Intergenic Regions.

45

(

c)

Ma

rk G

ers

tein

, 2

00

2,

Ya

le,

bio

info

.mb

b.y

ale

.ed

u

Do not reproduce without permission 45

Ge

rste

in.i

nfo

/ta

lks

(c

) 2

00

4

Intersection of Pseudogenes with Transcriptional Evidence

Zheng et al. (2007) Gen. Res.

Page 43: Do not reproduce without permission 1 Gerstein.info/talks (c) 2004 1 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Annotation of Intergenic Regions.

46

(

c)

Ma

rk G

ers

tein

, 2

00

2,

Ya

le,

bio

info

.mb

b.y

ale

.ed

u

Do not reproduce without permission 46

Ge

rste

in.i

nfo

/ta

lks

(c

) 2

00

4

Intersection of Pseudogenes with Transcriptional Evidence

Excluding TARs (due to cross-hyb issues)

Targeted RACE expts to 160 pseudogenes, gives 14

Total Evidence from Sequencing is 38 of 201 (with 5 having cryptic promotors)

14

Zheng et al. (2007) Gen. Res.

Page 44: Do not reproduce without permission 1 Gerstein.info/talks (c) 2004 1 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Annotation of Intergenic Regions.

47

(

c)

Ma

rk G

ers

tein

, 2

00

2,

Ya

le,

bio

info

.mb

b.y

ale

.ed

u

Do not reproduce without permission 47

Ge

rste

in.i

nfo

/ta

lks

(c

) 2

00

4

Targeted Transcription Expts.• RACE expts

Interrogated 160 pseudogenes (49 non-processed & 111 processed) In 51 cases (26 non-processed and 25 processed pseudogenes), could

design distinguishing primers (>4 mismatched bp v. parent) The resulting data supported transcription from 14 (8 processed and 6 non-

processed) of the 160 pseudogenes (9 with pseudogene specific primers) These numbers might represent a conservative estimate since a RACEfrag

was assigned to its parent gene by default if it could be mapped to both a parent locus and a pseudogene locus.

• RACE expts + sequencing (CAGE, PET, EST and mRNA) unambiguous evidence for pseudogene transcription All together, these data indicate 38 of 201 pseudogenes being the source

of novel RNA transcripts 5 of these had cryptic promotors (from TR analysis)

Zheng et al. (2007) Gen. Res.

Page 45: Do not reproduce without permission 1 Gerstein.info/talks (c) 2004 1 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Annotation of Intergenic Regions.

48

(

c)

Ma

rk G

ers

tein

, 2

00

2,

Ya

le,

bio

info

.mb

b.y

ale

.ed

u

Do not reproduce without permission 48

Ge

rste

in.i

nfo

/ta

lks

(c

) 2

00

4

Extension to Whole Genome

• 233 Transcribed from ~8000 Processed Pseudogenes • Evidence for Transcription

8% Refseq mRNAs 32% Unigene consensus sequences 72% dbEST expressed sequence tags 32% Oligonucleotide microarray data (extra support)

• Highly decayed Fraction with Ka/Ks ≥ 0.5 is 54%

Harrison et al. (2005) NAR

Page 46: Do not reproduce without permission 1 Gerstein.info/talks (c) 2004 1 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Annotation of Intergenic Regions.

4949

Genes & PseudogenesGenes & Pseudogenes

Zheng & Gerstein, TIG (2007)

Page 47: Do not reproduce without permission 1 Gerstein.info/talks (c) 2004 1 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Annotation of Intergenic Regions.

5050

Genes or Pseudogenes?Genes or Pseudogenes?

Zheng & Gerstein, TIG (2007)

Page 48: Do not reproduce without permission 1 Gerstein.info/talks (c) 2004 1 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Annotation of Intergenic Regions.

51

(

c)

Ma

rk G

ers

tein

, 2

00

2,

Ya

le,

bio

info

.mb

b.y

ale

.ed

u

Do not reproduce without permission 51

Ge

rste

in.i

nfo

/ta

lks

(c

) 2

00

4

Conclusions

• Tilescope Processing Pipeline HMM Segmentation & optimal selection of regions to validate

• DART classification of TARs 1300 clusters of transcriptionally active regions in ENCODE

• Deserts and Forests of Binding Activity on ~50kb scale

• Pseudogene Activity Consensus annotation from automatic pipelines and manual curation

gives 201 (~2/3 processed) >20% appear to be transcribed (38/201)

Page 49: Do not reproduce without permission 1 Gerstein.info/talks (c) 2004 1 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Annotation of Intergenic Regions.

52

(

c)

Ma

rk G

ers

tein

, 2

00

2,

Ya

le,

bio

info

.mb

b.y

ale

.ed

u

Do not reproduce without permission 52

Ge

rste

in.i

nfo

/ta

lks

(c

) 2

00

4

MS

MG

PM

SW

Acknowledgements

Page 50: Do not reproduce without permission 1 Gerstein.info/talks (c) 2004 1 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Annotation of Intergenic Regions.

53

(

c)

Ma

rk G

ers

tein

, 2

00

2,

Ya

le,

bio

info

.mb

b.y

ale

.ed

u

Do not reproduce without permission 53

Ge

rste

in.i

nfo

/ta

lks

(c

) 2

00

4

MS

MG

PM

SW

Acknowledgements

Page 51: Do not reproduce without permission 1 Gerstein.info/talks (c) 2004 1 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Annotation of Intergenic Regions.

54

(

c)

Ma

rk G

ers

tein

, 2

00

2,

Ya

le,

bio

info

.mb

b.y

ale

.ed

u

Do not reproduce without permission 54

Ge

rste

in.i

nfo

/ta

lks

(c

) 2

00

4 MS

MG

PM

SW

Acknowledgementspseudogene.org, tiling.gersteinlab.org

Page 52: Do not reproduce without permission 1 Gerstein.info/talks (c) 2004 1 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Annotation of Intergenic Regions.

55

(

c)

Ma

rk G

ers

tein

, 2

00

2,

Ya

le,

bio

info

.mb

b.y

ale

.ed

u

Do not reproduce without permission 55

Ge

rste

in.i

nfo

/ta

lks

(c

) 2

00

4

Acknowledgements

MS

MG

PM

SW

pseudogene.org, tiling.gersteinlab.org

Page 53: Do not reproduce without permission 1 Gerstein.info/talks (c) 2004 1 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Annotation of Intergenic Regions.

56

(

c)

Ma

rk G

ers

tein

, 2

00

2,

Ya

le,

bio

info

.mb

b.y

ale

.ed

u

Do not reproduce without permission 56

Ge

rste

in.i

nfo

/ta

lks

(c

) 2

00

4

Acknowledgements

P Bertone

D Zheng

Z Zhang

MS

MG

PM

SW

P Harrison

pseudogene.org, tiling.gersteinlab.org

Page 54: Do not reproduce without permission 1 Gerstein.info/talks (c) 2004 1 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Annotation of Intergenic Regions.

57

(

c)

Ma

rk G

ers

tein

, 2

00

2,

Ya

le,

bio

info

.mb

b.y

ale

.ed

u

Do not reproduce without permission 57

Ge

rste

in.i

nfo

/ta

lks

(c

) 2

00

4

Acknowledgements

P Bertone

J Rozowsky

G Euskirchen

T Royce

S Balasubramanian

J Korbel

A Karpikov

D Zheng

Z Zhang

D Yan

R Sasidharan

O Emanuelsson

J Du

J Rinn

MS

MG

PM

SW

V Stolc

R Martone

P Harrison

pseudogene.org, tiling.gersteinlab.org

N Luscombe

C Bruce

J Chang

N Carriero

N Echols

J Karro

Page 55: Do not reproduce without permission 1 Gerstein.info/talks (c) 2004 1 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Annotation of Intergenic Regions.

58

(

c)

Ma

rk G

ers

tein

, 2

00

2,

Ya

le,

bio

info

.mb

b.y

ale

.ed

u

Do not reproduce without permission 58

Ge

rste

in.i

nfo

/ta

lks

(c

) 2

00

4

ENCODE Acknowledgements

Adam Frankish, Robert Baertsch, Philipp Kapranov, Alexandre Reymond,

Siew Woh Choo, Yontao Lu, France Denoeud, Stylianos Antonarakis, Yijun Ruan, Chia-Lin Wei, Thomas

Gingeras, Roderic Guigo, Jennifer Harrow

Sanger, UCSC, GIS, AFFX, Geneva, IMIM

Page 56: Do not reproduce without permission 1 Gerstein.info/talks (c) 2004 1 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Annotation of Intergenic Regions.

59

(

c)

Ma

rk G

ers

tein

, 2

00

2,

Ya

le,

bio

info

.mb

b.y

ale

.ed

u

Do not reproduce without permission 59

Ge

rste

in.i

nfo

/ta

lks

(c

) 2

00

4

Permissions Statement

This Presentation is copyright Mark Gerstein, Yale University, 2007.

Feel free to use images in it with

PROPER acknowledgement

(via citation to relevant papers or link to gersteinlab.org).