ENCODE: Understanding the Genome Michael Snyder November 6, 2012 Conflicts: Personalis, Genapsys, Illumina Slides From Ewan Birney, Marc Schaub, Alan Boyle
ENCODE: Understanding the Genome
Michael Snyder
November 6, 2012
Conflicts: Personalis, Genapsys, Illumina
Slides From Ewan Birney, Marc Schaub, Alan Boyle
Encyclopedia of DNA Elements (ENCODE)
• NHGRI-funded consortium
• Goal: delineate all functional elements in the human genome
• Wide array of experimental assays
• Three Phases: 1) Pilot 2) Scale Up 1.0 3) Scale up 2.0
The ENCODE Project Consortium. An Integrated Encyclopedia of DNA Elements in the Human Genome. Nature 2012 Project website: http://encodeproject.org
3
The ENCODE Consortium Brad Bernstein (Eric Lander, Manolis Kellis, Tony Kouzarides)
Ewan Birney (Jim Kent, Mark Gerstein, Bill Noble, Peter Bickel, Ross Hardison, Zhiping Weng)
Greg Crawford (Ewan Birney, Jason Lieb, Terry Furey, Vishy Iyer)
Jim Kent (David Haussler, Kate Rosenbloom)
John Stamatoyannopoulos (Evan Eichler, George Stamatoyannopoulos, Job Dekker, Maynard Olson, Michael Dorschner, Patrick
Navas, Phil Green)
Mike Snyder (Kevin Struhl, Mark Gerstein, Peggy Farnham, Sherman Weissman)
Rick Myers (Barbara Wold)
Scott Tenenbaum (Luiz Penalva)
Tim Hubbard (Alexandre Reymond, Alfonso Valencia, David Haussler, Ewan Birney, Jim Kent, Manolis Kellis, Mark Gerstein, Michael Brent,
Roderic Guigo)
Tom Gingeras (Alexandre Reymond, David Spector, Greg Hannon, Michael Brent, Roderic Guigo, Stylianos Antonarakis, Yijun Ruan,
Yoshihide Hayashizaki)
Zhiping Weng (Nathan Trinklein, Rick Myers)
Additional ENCODE Participants: Elliott Marguiles, Eric Green, Job Dekker, Laura Elnitski, Len Pennachio, Jochen Wittbrodt
.. and many senior scientists, postdocs, students, technicians, computer scientists, statisticians and administrators in these groups
NHGRI: Elise Feingold, Mike Pazin, Peter Good
Experimental Assays
Chip-seq (165 TFs
+ Histone marks)
RNA-seq (292)
DNAse-seq (~200)
RNA-Sequencing
Wang et al. 2009 Nat Gen. Rev.
Immunoprecipitation
Transcription Factor
Sequence and align ChIP-seq
Peak 300-500 bp
Antibody
Functional data: ChIP-seq
ChIP-exo Histone Marks
Motif (8-12 bp)
DNaseI
Histone Histone
Transcription Factor
Sequence and align
DNaseI hypersensitivity peak
Functional data: DNase-seq
Region of open chromatin
DNaseI
Histone Histone
Transcription Factor
Sequence and align
DNaseI Footprint
Functional data: DNase footprints
Region of open chromatin
Examples of Signal Tracks
a b
c
de
Phenotype−associated SNPs
Random sampling of matched SNPs
Genotyped SNPs
1000 Genomes
24 Peqsonal genomes
chq5:40,390,001-40,440,000 (50,000 bp)
qs4613763 qs17234657 qs11742570
qs6451493
qs1992660
qs6896969 qs1373692 qs9292777
HUVEC G ATA2
HUVEC cFOS
HUVEC Input
HUVEC
Juqkat
Th1
Th2
chq5:
G W AS Catalog
Human Feb. 2009 (GRCh37/hg19) chq5:39,274,501-40,819,500 (1,545,000 bp)
39500000 40000000 40500000
C9
DAB2
BC026261
PTGER4TTC33
OSRF
PRKAA1
DNase I
TFs
Cqohn’s disease
ulceqative colitis
multiple scleqosis
DNaseI peaks TF
Fraction of SNPs that overlap featuqe
0
0.1
0.2
0.3
log2 (fold enrichment oq depletion)
−0.5
0.0
0.5
1.0
1.5
E WETSSPF
CTCFTD/R
GM12878
genes above
thqeshold
G O:0006955 immune qesponseG WAS enqichment10
-log p-value
Phenotype SN
P-P
hen
o a
sso
cia
tio
ns
overl
ap
an
y T
F o
cc
up
an
cy
Gm
12
878
Eb
f
Gm
12
878
Po
l24h
8
Gm
12
878
Po
l2
Hela
s3C
eb
pb
Igg
rab
K56
2C
tcf
Gm
12
878
Pu
1
Gm
12
878
Me
f2a
K56
2P
ol2
V04
16
101
Gm
12
878
Pax5
c20
Gm
12
878
Nfk
bIg
gra
b
Hep
g2
Ctc
f
Hu
vecG
ata
2U
cd
Gm
12
878
Elf
1sc
631
V0
4161
01
Gm
12
878
Eg
r1V
0416
101
Hep
g2
Ma
fkab
50322
Igg
rab
Hu
vecC
fosU
cd
Gm
12
878
Bcl1
1a
Gm
12
878
Irf4
Gm
12
878
Batf
Gm
12
878
Tc
f12
Hep
g2
Fo
sl2
K56
2Ta
l1sc
12984
Igg
mu
s
K56
2M
ax
V0
4161
02
Hep
g2
Tcf4
Ucd
Hep
g2
Fo
xa2
Hep
g2
P30
0V
041
610
1
Hep
g2
Fo
xa1c2
0
Hu
vecC
tcf
Hela
s3C
tcf
Hep
g2
Ju
nd
CA
CO
2.D
S823
5
HU
VE
C.a
ll
Hep
G2.a
ll
Ju
rkat.
DS
12
659
hT
H1
.all
hT
H2
.DS
784
2
CD
34
.DS
16
814
TOTAL 4860 600 78 57 69 69 72 47 47 71 54 35 54 29 44 28 48 50 38 35 45 37 37 44 62 33 57 46 62 40 55 47 70 85 118 62 192 57 81
Height 204 34 7 3 3 7 6 1 3 2 3 2 6 0 4 6 3 2 3 5 5 2 0 2 3 1 2 0 2 5 4 3 3 6 5 4 9 3 7
Systemic_lupus_erythematosus 62 10 4 6 6 2 1 1 4 0 1 4 1 1 4 2 0 1 2 3 4 2 1 0 1 0 0 0 0 1 1 1 1 2 0 0 4 2 1
Crohn's_disease 105 20 2 2 2 2 1 2 2 0 2 1 2 5 1 1 1 3 2 1 1 0 2 1 1 2 1 2 3 2 3 1 3 6 5 3 9 5 5
Ulcerative_colitis 85 11 2 3 3 0 1 2 3 1 3 3 1 2 3 2 1 1 2 1 2 2 0 2 2 1 0 2 2 0 1 1 3 2 5 3 7 2 3
Multiple_sclerosis 71 15 4 3 3 1 0 3 4 2 4 2 0 2 2 1 0 2 4 3 2 3 0 3 1 0 0 0 0 0 0 0 0 1 1 3 5 4 3
Rheumatoid_arthritis 57 11 4 2 2 1 0 4 3 0 4 4 0 0 1 1 0 0 1 0 2 2 0 1 0 0 0 0 0 0 0 0 0 2 2 1 11 3 1
LDL_cholesterol 45 8 0 0 0 2 2 1 0 4 1 0 1 0 1 0 1 0 0 0 0 0 3 0 3 2 3 2 2 1 1 3 1 0 2 1 0 1 0
Bone_mineral_density 65 9 1 1 1 1 2 2 2 1 2 1 1 0 2 2 2 0 1 2 1 1 0 0 1 0 2 2 3 1 1 1 2 2 4 3 3 2 3
Coronary_heart_disease 107 17 2 0 0 2 4 0 0 4 1 2 0 2 0 0 1 1 1 0 0 1 1 1 3 1 2 2 2 1 1 1 3 2 3 0 6 0 1
Chronic_lymphocytic_leukemia 17 8 1 4 5 0 0 3 1 0 2 1 0 0 2 0 1 0 2 1 1 2 0 1 0 1 0 0 0 0 0 0 1 0 0 0 2 0 1
Prostate_cancer 56 8 0 0 0 0 0 0 0 1 0 0 2 1 0 0 3 2 0 0 0 0 2 1 1 4 3 3 3 0 0 2 2 3 1 1 2 0 1
Triglycerides 48 10 0 0 0 1 2 0 0 2 1 0 2 1 1 0 2 2 0 0 0 0 3 1 2 1 2 2 1 2 2 3 0 2 1 0 2 1 0
Celiac_disease 54 11 4 3 3 0 2 2 1 1 1 2 0 0 1 0 0 0 0 1 1 1 1 0 1 1 1 1 2 0 1 2 0 0 0 2 2 1 2
Colorectal_cancer 18 5 0 0 0 1 0 0 0 1 0 0 2 0 0 0 0 0 0 0 0 0 2 0 0 2 3 3 3 0 0 2 2 0 1 0 2 0 1
Hematological_parameters 85 12 3 0 0 3 1 0 3 0 1 1 2 0 0 0 2 0 1 1 2 2 1 1 0 0 1 1 1 0 0 0 1 1 3 3 6 3 5
HIV-1_control 55 10 0 2 4 1 2 0 1 2 2 0 0 0 1 0 0 0 1 1 1 1 1 1 2 1 1 2 1 0 0 1 0 0 0 0 2 1 0
Protein_quantitative_trait_loci 48 7 2 2 2 0 0 2 1 1 1 0 0 1 2 2 1 0 2 1 2 1 0 0 1 0 0 0 0 0 0 1 1 1 2 1 2 1 1
Alzheimer's_disease 42 5 0 0 0 1 2 0 0 1 0 0 2 0 0 0 0 1 0 0 0 0 1 1 1 0 1 1 0 2 2 1 1 2 0 0 2 0 1
HDL_cholesterol 55 8 1 0 0 1 1 0 0 2 1 0 1 0 0 0 1 2 0 0 0 1 2 1 1 1 2 2 2 1 1 2 0 1 2 0 3 1 0
Cholesterol 16 6 1 0 0 0 2 0 0 2 2 0 1 0 1 0 0 1 0 0 0 1 1 0 1 0 0 0 0 2 1 1 0 0 2 0 1 1 0
Longevity 30 5 0 2 3 1 1 0 0 1 0 0 1 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 2 1 2 0 2 1 0 0 1
Attention_deficit_hyperactivity_disorder 102 9 0 0 0 1 2 0 0 1 0 1 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 0 1 0 0 0 0 2 0 1 0 0
Cognitive_performance 111 8 0 0 2 1 1 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 2 0 0 1 3 0 0 0 0
Type_2_diabetes 97 13 0 0 0 1 1 2 1 1 1 0 1 1 0 0 2 1 1 1 1 0 0 3 1 0 0 0 0 2 1 0 1 0 2 0 4 0 0
Conduct_disorder 38 5 0 1 1 1 3 0 1 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 2 0 0 1 2 2 1 0 1
Type_1_diabetes 67 7 2 1 1 0 0 2 1 0 1 1 0 0 1 0 0 0 1 0 1 0 0 1 1 1 1 2 2 0 1 0 1 0 2 1 5 1 1
Dialysis-related_mortality 26 6 1 0 0 1 1 1 0 0 0 0 2 1 0 0 0 1 0 0 0 0 0 0 0 0 1 1 1 1 1 0 0 1 2 1 0 0 1
Bipolar_disorder 110 6 1 0 0 2 1 0 0 1 0 0 0 1 0 0 0 3 0 0 0 0 1 0 1 0 0 0 0 0 0 1 2 3 1 0 2 0 1
Body_mass 98 5 0 0 0 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 3 0 1 0 0
C-reactive_protein 34 7 0 0 0 0 0 0 0 1 0 0 0 0 0 0 2 0 0 0 0 0 0 0 1 0 1 0 0 1 1 0 1 0 2 1 0 0 1
Menarche_and_menopause 62 6 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 1 0 2 0 0 0 0 0 0 0 1 0 0 2 1 1 0
Breast_cancer 43 6 1 0 0 2 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 1 2 2 0 0
Mean_platelet_volume 15 5 1 0 0 0 0 1 0 0 1 0 0 0 0 0 2 0 1 1 0 0 0 0 0 0 1 0 1 0 0 0 0 1 0 0 2 0 1
Soluble_levels_of_adhesion_molecules 5 5 1 0 1 1 0 1 0 2 1 1 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
Psoriasis 38 6 1 1 2 0 0 2 1 1 0 1 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 1 4 0 0
Parkinson's_disease 46 5 0 1 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 1 0 0 1 1 0 0
Obesity 36 5 0 0 0 1 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Fasting_glucose-related_traits 17 5 1 0 0 0 0 0 0 1 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 1 1 1 1 1 0 0
Ross Hardison, Belinda Giardine
a b
c
de
Phenotype−associated SNPs
Random sampling of matched SNPs
Genotyped SNPs
1000 Genomes
24 Peqsonal genomes
chq5:40,390,001-40,440,000 (50,000 bp)
qs4613763 qs17234657 qs11742570
qs6451493
qs1992660
qs6896969 qs1373692 qs9292777
HUVEC G ATA2
HUVEC cFOS
HUVEC Input
HUVEC
Juqkat
Th1
Th2
chq5:
G W AS Catalog
Human Feb. 2009 (GRCh37/hg19) chq5:39,274,501-40,819,500 (1,545,000 bp)
39500000 40000000 40500000
C9
DAB2
BC026261
PTGER4TTC33
OSRF
PRKAA1
DNase I
TFs
Cqohn’s disease
ulceqative colitis
multiple scleqosis
DNaseI peaks TF
Fraction of SNPs that overlap featuqe
0
0.1
0.2
0.3
log2 (fold enrichment oq depletion)
−0.5
0.0
0.5
1.0
1.5
E WETSSPF
CTCFTD/R
GM12878
genes above
thqeshold
G O:0006955 immune qesponseG WAS enqichment10
-log p-value
Phenotype SN
P-P
hen
o a
sso
cia
tio
ns
overl
ap
an
y T
F o
cc
up
an
cy
Gm
128
78E
bf
Gm
128
78P
ol2
4h
8
Gm
128
78P
ol2
Hela
s3C
eb
pb
Igg
rab
K56
2C
tcf
Gm
128
78P
u1
Gm
128
78M
ef2
a
K56
2P
ol2
V04
16101
Gm
128
78P
ax5
c20
Gm
128
78N
fkb
Igg
rab
Hep
g2
Ctc
f
Hu
vecG
ata
2U
cd
Gm
128
78E
lf1
sc631
V0
4161
01
Gm
128
78E
gr1
V0416
101
Hep
g2
Mafk
ab
50322
Igg
rab
Hu
vecC
fosU
cd
Gm
128
78B
cl1
1a
Gm
128
78Ir
f4
Gm
128
78B
atf
Gm
128
78Tcf1
2
Hep
g2
Fo
sl2
K56
2Ta
l1sc
12984
Igg
mu
s
K56
2M
ax
V0
4161
02
Hep
g2
Tcf4
Ucd
Hep
g2
Fo
xa2
Hep
g2
P30
0V
0416
101
Hep
g2
Fo
xa1c2
0
Hu
vecC
tcf
Hela
s3C
tcf
Hep
g2
Ju
nd
CA
CO
2.D
S823
5
HU
VE
C.a
ll
Hep
G2.a
ll
Ju
rkat.
DS
12
659
hT
H1.a
ll
hT
H2.D
S784
2
CD
34
.DS
16
814
TOTAL 4860 600 78 57 69 69 72 47 47 71 54 35 54 29 44 28 48 50 38 35 45 37 37 44 62 33 57 46 62 40 55 47 70 85 118 62 192 57 81
Height 204 34 7 3 3 7 6 1 3 2 3 2 6 0 4 6 3 2 3 5 5 2 0 2 3 1 2 0 2 5 4 3 3 6 5 4 9 3 7
Systemic_lupus_erythematosus 62 10 4 6 6 2 1 1 4 0 1 4 1 1 4 2 0 1 2 3 4 2 1 0 1 0 0 0 0 1 1 1 1 2 0 0 4 2 1
Crohn's_disease 105 20 2 2 2 2 1 2 2 0 2 1 2 5 1 1 1 3 2 1 1 0 2 1 1 2 1 2 3 2 3 1 3 6 5 3 9 5 5
Ulcerative_colitis 85 11 2 3 3 0 1 2 3 1 3 3 1 2 3 2 1 1 2 1 2 2 0 2 2 1 0 2 2 0 1 1 3 2 5 3 7 2 3
Multiple_sclerosis 71 15 4 3 3 1 0 3 4 2 4 2 0 2 2 1 0 2 4 3 2 3 0 3 1 0 0 0 0 0 0 0 0 1 1 3 5 4 3
Rheumatoid_arthritis 57 11 4 2 2 1 0 4 3 0 4 4 0 0 1 1 0 0 1 0 2 2 0 1 0 0 0 0 0 0 0 0 0 2 2 1 11 3 1
LDL_cholesterol 45 8 0 0 0 2 2 1 0 4 1 0 1 0 1 0 1 0 0 0 0 0 3 0 3 2 3 2 2 1 1 3 1 0 2 1 0 1 0
Bone_mineral_density 65 9 1 1 1 1 2 2 2 1 2 1 1 0 2 2 2 0 1 2 1 1 0 0 1 0 2 2 3 1 1 1 2 2 4 3 3 2 3
Coronary_heart_disease 107 17 2 0 0 2 4 0 0 4 1 2 0 2 0 0 1 1 1 0 0 1 1 1 3 1 2 2 2 1 1 1 3 2 3 0 6 0 1
Chronic_lymphocytic_leukemia 17 8 1 4 5 0 0 3 1 0 2 1 0 0 2 0 1 0 2 1 1 2 0 1 0 1 0 0 0 0 0 0 1 0 0 0 2 0 1
Prostate_cancer 56 8 0 0 0 0 0 0 0 1 0 0 2 1 0 0 3 2 0 0 0 0 2 1 1 4 3 3 3 0 0 2 2 3 1 1 2 0 1
Triglycerides 48 10 0 0 0 1 2 0 0 2 1 0 2 1 1 0 2 2 0 0 0 0 3 1 2 1 2 2 1 2 2 3 0 2 1 0 2 1 0
Celiac_disease 54 11 4 3 3 0 2 2 1 1 1 2 0 0 1 0 0 0 0 1 1 1 1 0 1 1 1 1 2 0 1 2 0 0 0 2 2 1 2
Colorectal_cancer 18 5 0 0 0 1 0 0 0 1 0 0 2 0 0 0 0 0 0 0 0 0 2 0 0 2 3 3 3 0 0 2 2 0 1 0 2 0 1
Hematological_parameters 85 12 3 0 0 3 1 0 3 0 1 1 2 0 0 0 2 0 1 1 2 2 1 1 0 0 1 1 1 0 0 0 1 1 3 3 6 3 5
HIV-1_control 55 10 0 2 4 1 2 0 1 2 2 0 0 0 1 0 0 0 1 1 1 1 1 1 2 1 1 2 1 0 0 1 0 0 0 0 2 1 0
Protein_quantitative_trait_loci 48 7 2 2 2 0 0 2 1 1 1 0 0 1 2 2 1 0 2 1 2 1 0 0 1 0 0 0 0 0 0 1 1 1 2 1 2 1 1
Alzheimer's_disease 42 5 0 0 0 1 2 0 0 1 0 0 2 0 0 0 0 1 0 0 0 0 1 1 1 0 1 1 0 2 2 1 1 2 0 0 2 0 1
HDL_cholesterol 55 8 1 0 0 1 1 0 0 2 1 0 1 0 0 0 1 2 0 0 0 1 2 1 1 1 2 2 2 1 1 2 0 1 2 0 3 1 0
Cholesterol 16 6 1 0 0 0 2 0 0 2 2 0 1 0 1 0 0 1 0 0 0 1 1 0 1 0 0 0 0 2 1 1 0 0 2 0 1 1 0
Longevity 30 5 0 2 3 1 1 0 0 1 0 0 1 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 2 1 2 0 2 1 0 0 1
Attention_deficit_hyperactivity_disorder 102 9 0 0 0 1 2 0 0 1 0 1 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 0 1 0 0 0 0 2 0 1 0 0
Cognitive_performance 111 8 0 0 2 1 1 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 2 0 0 1 3 0 0 0 0
Type_2_diabetes 97 13 0 0 0 1 1 2 1 1 1 0 1 1 0 0 2 1 1 1 1 0 0 3 1 0 0 0 0 2 1 0 1 0 2 0 4 0 0
Conduct_disorder 38 5 0 1 1 1 3 0 1 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 2 0 0 1 2 2 1 0 1
Type_1_diabetes 67 7 2 1 1 0 0 2 1 0 1 1 0 0 1 0 0 0 1 0 1 0 0 1 1 1 1 2 2 0 1 0 1 0 2 1 5 1 1
Dialysis-related_mortality 26 6 1 0 0 1 1 1 0 0 0 0 2 1 0 0 0 1 0 0 0 0 0 0 0 0 1 1 1 1 1 0 0 1 2 1 0 0 1
Bipolar_disorder 110 6 1 0 0 2 1 0 0 1 0 0 0 1 0 0 0 3 0 0 0 0 1 0 1 0 0 0 0 0 0 1 2 3 1 0 2 0 1
Body_mass 98 5 0 0 0 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 3 0 1 0 0
C-reactive_protein 34 7 0 0 0 0 0 0 0 1 0 0 0 0 0 0 2 0 0 0 0 0 0 0 1 0 1 0 0 1 1 0 1 0 2 1 0 0 1
Menarche_and_menopause 62 6 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 1 0 2 0 0 0 0 0 0 0 1 0 0 2 1 1 0
Breast_cancer 43 6 1 0 0 2 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 1 2 2 0 0
Mean_platelet_volume 15 5 1 0 0 0 0 1 0 0 1 0 0 0 0 0 2 0 1 1 0 0 0 0 0 0 1 0 1 0 0 0 0 1 0 0 2 0 1
Soluble_levels_of_adhesion_molecules 5 5 1 0 1 1 0 1 0 2 1 1 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
Psoriasis 38 6 1 1 2 0 0 2 1 1 0 1 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 1 4 0 0
Parkinson's_disease 46 5 0 1 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 1 0 0 1 1 0 0
Obesity 36 5 0 0 0 1 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Fasting_glucose-related_traits 17 5 1 0 0 0 0 0 0 1 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 1 1 1 1 1 0 0
ENCODE Dimensions
Methods/Factors
Cells
200 C
ell
Lin
es/ T
issues
200 Assays (~165 ChIP-Seq of different TFs)
3,010 Experiments
~10 TeraBases
~3000x of the Human Genome
Mouse:
126 TF ChIP
70 RNA-Seq
ENCODE Uniform Analysis Pipeline
Mapped reads from production (Bam)
Uniform Peak Calling Pipeline (SPP, PeakSeq)
IDR Processing, QC and Blacklist Filtering
Motif Discovery Stats, GSC enrichments, etc.
Signal Aggregation over peaks
Signal Generation (read extension and mappability correction)
Segmentation
Poor reproducibility Good reproducibility
Rep1
Re
p2
Self Organising Maps
ChromHMM/Segway
Anshul Kundaje, Qunhua Li, Michael Hoffman, Jason Ernst, Joel Rozowsky, Pouya Kheradpour
Raw genome coverage of elements Element Type Coverage Cumulative
Coverage
Exons 3% 3%
Chip-seq bound motifs 4.5% 5%
DNaseI Footprints 5.7% 9%
Chip-seq bound regions 8.1% 12%
DNaseI HS regions 15.2% 19.4%
Histone Modifications (*) 44% 49%
RNA 62% 80%
(* excluding broad marks)
Region
Bound Motif/ Footprint
(Union over all experiments and cell types)
ENCODE Integrative Segmentations
~7 Major genome segments
25 “elaborations”
1,000s of details
Well Known: TSS, Gene Start,
Gene Bodies
New Info: “Enhancers” (2 states),
Insulators
Unexpected: Specific Gene End
Experimental Confirmation of New Enhancers
Mann Whitney 0.003 HMM vs Background
1e-7, HMM vs Naïve or Biologist picks
Myers Lab
53% hit rate in Mouse Assay
Pennacchio Lab
Jason Gertz, Barbara Wold, Rick Myers, Len Pennacchio
Many other stories…
TF Co association and
Regulatory Code
Mike Snyder+Mark
Gerstein
Splicing/Histone interaction (Roderic Guigo)
RNA landscape Tom Gingeras
DNAseI footprints – John Stam. DNA Methylation – Rick Myers
16
The ENCODE Consortium Brad Bernstein (Eric Lander, Manolis Kellis, Tony Kouzarides)
Ewan Birney (Jim Kent, Mark Gerstein, Bill Noble, Peter Bickel, Ross Hardison, Zhiping Weng)
Greg Crawford (Ewan Birney, Jason Lieb, Terry Furey, Vishy Iyer)
Jim Kent (David Haussler, Kate Rosenbloom)
John Stamatoyannopoulos (Evan Eichler, George Stamatoyannopoulos, Job Dekker, Maynard Olson, Michael Dorschner, Patrick
Navas, Phil Green)
Mike Snyder (Kevin Struhl, Mark Gerstein, Peggy Farnham, Sherman Weissman)
Rick Myers (Barbara Wold)
Scott Tenenbaum (Luiz Penalva)
Tim Hubbard (Alexandre Reymond, Alfonso Valencia, David Haussler, Ewan Birney, Jim Kent, Manolis Kellis, Mark Gerstein, Michael Brent,
Roderic Guigo)
Tom Gingeras (Alexandre Reymond, David Spector, Greg Hannon, Michael Brent, Roderic Guigo, Stylianos Antonarakis, Yijun Ruan,
Yoshihide Hayashizaki)
Zhiping Weng (Nathan Trinklein, Rick Myers)
Additional ENCODE Participants: Elliott Marguiles, Eric Green, Job Dekker, Laura Elnitski, Len Pennachio, Jochen Wittbrodt
.. and many senior scientists, postdocs, students, technicians, computer scientists, statisticians and administrators in these groups
NHGRI: Elise Feingold, Mike Pazin, Peter Good
Saturation
Number of cell lines
Num
be
r of e
lem
ents
0 5 10 15 20 25 30 35 40 45 50 55 60
020000
040
0000
6000
00
800000
100
0000
12000
00
Most aggressive fit for saturation suggests a maximum of 50% of elements discovered Likely to be lower due to inaccessible cell types etc
Steve Wilder
Discovering functional genome segments
Well understood:
TSS, Gene Start,
Gene Bodies
Reassuringly Interesting
“Enhancers” (2 states)
Insulators
Definitely There, Unexpected
Specific Gene End
Sub-classification of Repeats
~7 Major segments of the genome
25 “elaborations”
1,000s of details
Michael Hoffman, Jason Ernst, Bill Noble, Manolis Kellis
Irreproducible Discovery Rate
(IDR) If one re-ran the experiment, what is the probability one
would observe the same element at this rank or better
Uses ranked element lists from two replicates, and makes
the assumption that there is noise at the bottom of the rank
Chip-seq Dnase-seq RNA-seq
Ben Brown, Qunhau Li, Peter Bickel