1 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Do not reproduce without permission 1 Gerstein.info/talks (c) 2004 Annotation of Intergenic Regions of the Human Genome Mark B Gerstein Yale (Comp. Bio. & Bioinformatics) Cistrome 2007, Boston, MA 2007.04.30, 15:20-15:55 Slides downloadable from Lectures.GersteinLab.org (Please read permissions statement.) (Genome Annotation Talk without much pgenes, including Tilescope, HMMs, DART, binding sites, and pgene-transcription, All completed comfortably within time.)
56
Embed
Do not reproduce without permission 1 Gerstein.info/talks (c) 2004 1 (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Annotation of Intergenic Regions.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
1
(c
) M
ark
Ge
rste
in,
20
02
, Y
ale
, b
ioin
fo.m
bb
.ya
le.e
du
Do not reproduce without permission 1 G
ers
tein
.in
fo/t
alk
s
(c)
20
04
Annotation of Intergenic Regionsof the Human Genome
Mark B GersteinYale (Comp. Bio. & Bioinformatics)
Cistrome 2007, Boston, MA
2007.04.30, 15:20-15:55Slides downloadable from Lectures.GersteinLab.org
(Please read permissions statement.)
(Genome Annotation Talk without much pgenes, including Tilescope, HMMs, DART, binding sites, and pgene-transcription, All completed comfortably within time.)
• Tools for Scoring Arrays• Tools for Segmentation and Validation of Arrays• Results on Clusters of Novel Transcribed Regions• Results on Clusters of Binding Sites• Results on Active, Transcribed Pseudogenes
▪ Array data can be normalized by mean, median, quantile, and loess.
▪ Tile scoring generates the signal map and the P-value map.
▪ Feature identification produces ‘hits’.
Zhang et al. (2007) GenomeBiology
13
(
c)
Ma
rk G
ers
tein
, 2
00
2,
Ya
le,
bio
info
.mb
b.y
ale
.ed
u
Do not reproduce without permission 13
Ge
rste
in.i
nfo
/ta
lks
(c
) 2
00
4
HMM Segmentation + Optimal Selection of Regions to
Validate
14
(
c)
Ma
rk G
ers
tein
, 2
00
2,
Ya
le,
bio
info
.mb
b.y
ale
.ed
u
Do not reproduce without permission 14
Ge
rste
in.i
nfo
/ta
lks
(c
) 2
00
4
Tiling array analyses: modeling
(unknown)
Goal: identify S = 1 probes based on DMethod: build a model M’ based on D, compute SPossible performance metric: error rate in predicting the state of a probe, experimental validation, …
(may need pre-processing)
15
(
c)
Ma
rk G
ers
tein
, 2
00
2,
Ya
le,
bio
info
.mb
b.y
ale
.ed
u
Do not reproduce without permission 15
Ge
rste
in.i
nfo
/ta
lks
(c
) 2
00
4
Tiling array analyses: Hidden Markov model
Source: Du et al. (2006) Bioinformatics, 22, 3016-3024.
Tiling array analyses: results on transcriptional data(ENCODE regions (~30Mb) , training set (~7.5mb), ¼ training set (~1.9Mb, ~0.1M probes))
Source: Du et al. (2006) Bioinformatics, 22, 3016-3024.
25
(
c)
Ma
rk G
ers
tein
, 2
00
2,
Ya
le,
bio
info
.mb
b.y
ale
.ed
u
Do not reproduce without permission 25
Ge
rste
in.i
nfo
/ta
lks
(c
) 2
00
4
Tiling Arrays Probing Intergenic
Activity: Classifying Un-annotated Transcription
26
(
c)
Ma
rk G
ers
tein
, 2
00
2,
Ya
le,
bio
info
.mb
b.y
ale
.ed
u
Do not reproduce without permission 26
Ge
rste
in.i
nfo
/ta
lks
(c
) 2
00
4
DART: Classification of Unannotated Transcription
• Large amount of novel transcribed regions (TARs / transfrags) detected using tiling microarrays.
• Developed DART: Database of Active Regions and Tools Developed a classification procedure for these novel TARs/transfrags Database for storing & visualizing various sets of TARs/transfrags Associated tools for analyzing these sets
S3A Cluster into Novel Transcribed Loci using Expression Profiles (EP)
Peculiar TARs
Cross-Hyb TARs
Novel EP Loci
Singlet or Ambiguous TARs
P D P D E
S2 Assign Novel TARs to Known Genes usingExpression Profiles
Gene Assoc. TARs
P D P D E
P D P D E
P D P D E
S3B Cluster into Novel TranscribedLoci using Phylogenic Profiles (PP)
Novel PP Loci
P D P D E
DART
Rozowsky et al. Genome Research (in press)
29
(
c)
Ma
rk G
ers
tein
, 2
00
2,
Ya
le,
bio
info
.mb
b.y
ale
.ed
u
Do not reproduce without permission 29
Ge
rste
in.i
nfo
/ta
lks
(c
) 2
00
4
Rozowsky et al. Gen. Res. (2007, in press)
TAR clustering
30
(
c)
Ma
rk G
ers
tein
, 2
00
2,
Ya
le,
bio
info
.mb
b.y
ale
.ed
u
Do not reproduce without permission 30
Ge
rste
in.i
nfo
/ta
lks
(c
) 2
00
4
DART: Database & Tools- Interfaces with UCSC- Tools use Ensembl API
Rozowsky et al. Genome Research (in press)
31
(
c)
Ma
rk G
ers
tein
, 2
00
2,
Ya
le,
bio
info
.mb
b.y
ale
.ed
u
Do not reproduce without permission 31
Ge
rste
in.i
nfo
/ta
lks
(c
) 2
00
4
Table 1 Locations of all TARs
Exonic Pseudogenes Unannotated RegionsSize of ENCODE Regions (bp) 1,776,157 144,745 28,077,158Percentage of all ENCODE 5.9% 0.5% 93.6%
Number of TARs 3,666 195 6,988Percentage of all TARs 33.8% 1.8% 64.4%
Locations of Novel TARsESTs not in Exons Intronic Proximal Intronic Distal Intergenic Proximal Intergenic Distal
Size of Unannotated Regions (bp) 2,477,910 8,522,559 5,536,879 2,434,101 9,250,454Percentage of Unannotated Regions 8.8% 30.2% 19.6% 8.6% 32.8%
Number of Novel TARs 1,194 3,006 864 772 1,300Percentage of all Novel TARs 16.7% 42.1% 12.1% 10.8% 18.2%
Table 2: Sets of Classified Novel TARs Number Percentage
Total 6,988 100.0%
With peculiar sequence composition 503 7.2%Assigned to known genes 955 13.7%Caused by cross-hybridization - -In novel transcribed loci using expression profiles 681 9.7%In novel transcribed loci using phylogenetic profiles 782 11.2%
Rozowsky et al. Genome Research (in press)
32
(
c)
Ma
rk G
ers
tein
, 2
00
2,
Ya
le,
bio
info
.mb
b.y
ale
.ed
u
Do not reproduce without permission 32
Ge
rste
in.i
nfo
/ta
lks
(c
) 2
00
4
DART Classification has been experimentally validated with some small scale experiment RT-PCR & Sequencing
Results:
18/46 (39%) confirmed by RT-PCR
4/5 Sequenced Products Map uniquely to correct genomic region
▪ First concerted effort to systematically identify TREs in the human genome on a large scale▫ 105 lists of transcriptional regulatory elements in the encode regions
▫ 29 transcription factors, 9 cell lines, 2 time points
▫ 7 laboratories and 3 different microarray platforms
▪ TFs and their TREs can be studied on various genomic levels.
▪ Hundreds of TRE ‘islands’ and ‘deserts’ are identified in ENCODE regions.
▪ The longest island is composed of 68 various TREs and covers a 35-kb region near the HOXA cluster on chromosome 7.
▪ The entirety of ehd1 on chromosome 11 is covered by TRE islands.
▪ Some of islands are located in the intergenic regions in the genome.
dart.gersteinlab.org/encode/tr/
Zhang et al. (2007) Gen. Res.
40
(
c)
Ma
rk G
ers
tein
, 2
00
2,
Ya
le,
bio
info
.mb
b.y
ale
.ed
u
Do not reproduce without permission 40
Ge
rste
in.i
nfo
/ta
lks
(c
) 2
00
4
Tiling Arrays Probing Intergenic
Activity: Connecting Intergenic Activity to
Pseudogenes
41
(
c)
Ma
rk G
ers
tein
, 2
00
2,
Ya
le,
bio
info
.mb
b.y
ale
.ed
u
Do not reproduce without permission 41
Ge
rste
in.i
nfo
/ta
lks
(c
) 2
00
4
5 Methods of Assignment
• 4 automatic pipelines retroFinder, pseudoFinder, PseudoPipe, GIS Comparing protein or transcript v genomic DNA,
filtering, application of rules
• HAVANA manual• What is a pseudogene?
Different criteria
• Conservative approach here Can't overlap gene annotation Need to have a protein alignment 201 pseudogenes (in comparison to ~400 genes)
Zheng et al. (2007) Gen. Res.
42
(
c)
Ma
rk G
ers
tein
, 2
00
2,
Ya
le,
bio
info
.mb
b.y
ale
.ed
u
Do not reproduce without permission 42
Ge
rste
in.i
nfo
/ta
lks
(c
) 2
00
4
Overlap of Pseudogenes by 5 Different
Methods
Union of 252
Zheng et al. (2007) Gen. Res.
43
(
c)
Ma
rk G
ers
tein
, 2
00
2,
Ya
le,
bio
info
.mb
b.y
ale
.ed
u
Do not reproduce without permission 43
Ge
rste
in.i
nfo
/ta
lks
(c
) 2
00
4
Ex. Pseudogene
Intersecting Transcript-
ional Evidence
SpecialG
tracks in browser
diTAG
CAGE
TARS
ChIP-chip
Zheng et al. (2007) Gen. Res.
44
(
c)
Ma
rk G
ers
tein
, 2
00
2,
Ya
le,
bio
info
.mb
b.y
ale
.ed
u
Do not reproduce without permission 44
Ge
rste
in.i
nfo
/ta
lks
(c
) 2
00
4
Intersection of Pseudogenes with Transcriptional Evidence
Zheng et al. (2007) Gen. Res.
45
(
c)
Ma
rk G
ers
tein
, 2
00
2,
Ya
le,
bio
info
.mb
b.y
ale
.ed
u
Do not reproduce without permission 45
Ge
rste
in.i
nfo
/ta
lks
(c
) 2
00
4
Intersection of Pseudogenes with Transcriptional Evidence
Zheng et al. (2007) Gen. Res.
46
(
c)
Ma
rk G
ers
tein
, 2
00
2,
Ya
le,
bio
info
.mb
b.y
ale
.ed
u
Do not reproduce without permission 46
Ge
rste
in.i
nfo
/ta
lks
(c
) 2
00
4
Intersection of Pseudogenes with Transcriptional Evidence
Excluding TARs (due to cross-hyb issues)
Targeted RACE expts to 160 pseudogenes, gives 14
Total Evidence from Sequencing is 38 of 201 (with 5 having cryptic promotors)
14
Zheng et al. (2007) Gen. Res.
47
(
c)
Ma
rk G
ers
tein
, 2
00
2,
Ya
le,
bio
info
.mb
b.y
ale
.ed
u
Do not reproduce without permission 47
Ge
rste
in.i
nfo
/ta
lks
(c
) 2
00
4
Targeted Transcription Expts.• RACE expts
Interrogated 160 pseudogenes (49 non-processed & 111 processed) In 51 cases (26 non-processed and 25 processed pseudogenes), could
design distinguishing primers (>4 mismatched bp v. parent) The resulting data supported transcription from 14 (8 processed and 6 non-
processed) of the 160 pseudogenes (9 with pseudogene specific primers) These numbers might represent a conservative estimate since a RACEfrag
was assigned to its parent gene by default if it could be mapped to both a parent locus and a pseudogene locus.
• RACE expts + sequencing (CAGE, PET, EST and mRNA) unambiguous evidence for pseudogene transcription All together, these data indicate 38 of 201 pseudogenes being the source
of novel RNA transcripts 5 of these had cryptic promotors (from TR analysis)
Zheng et al. (2007) Gen. Res.
48
(
c)
Ma
rk G
ers
tein
, 2
00
2,
Ya
le,
bio
info
.mb
b.y
ale
.ed
u
Do not reproduce without permission 48
Ge
rste
in.i
nfo
/ta
lks
(c
) 2
00
4
Extension to Whole Genome
• 233 Transcribed from ~8000 Processed Pseudogenes • Evidence for Transcription