Top Banner
1000 Genomes Project: Datasets Gabor Marth Boston College Biology Department 1000 Genomes Project Tutorial ASHG 2010, Washington, DC November 3, 2010
28

1000 Genomes Project: Datasets - National Human Genome ... · 1000 Genomes Project: Datasets Gabor Marth Boston College Biology Department 1000 Genomes Project Tutorial ASHG 2010,

Mar 16, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: 1000 Genomes Project: Datasets - National Human Genome ... · 1000 Genomes Project: Datasets Gabor Marth Boston College Biology Department 1000 Genomes Project Tutorial ASHG 2010,

1000 Genomes Project:Datasets

Gabor MarthBoston College Biology Department

1000 Genomes Project TutorialASHG 2010, Washington, DCNovember 3, 2010

Page 2: 1000 Genomes Project: Datasets - National Human Genome ... · 1000 Genomes Project: Datasets Gabor Marth Boston College Biology Department 1000 Genomes Project Tutorial ASHG 2010,

3 pilot coverage strategies

Page 3: 1000 Genomes Project: Datasets - National Human Genome ... · 1000 Genomes Project: Datasets Gabor Marth Boston College Biology Department 1000 Genomes Project Tutorial ASHG 2010,

Samples

Population YRI LWK CHB CHD JPT CEU TSI All

Samples 112 108 109 107 105 90 66 697

Page 4: 1000 Genomes Project: Datasets - National Human Genome ... · 1000 Genomes Project: Datasets Gabor Marth Boston College Biology Department 1000 Genomes Project Tutorial ASHG 2010,

Pilot datasets

Populations Samples Coverage

2 6 20-40x

4 179 2-4x

7 697 20-50x

Page 5: 1000 Genomes Project: Datasets - National Human Genome ... · 1000 Genomes Project: Datasets Gabor Marth Boston College Biology Department 1000 Genomes Project Tutorial ASHG 2010,

Data processing / variant calling

REF

(ii) read mapping

IND

(i) base calling

IND(iii) SNP and short INDEL calling

(iv) SV calling

Page 6: 1000 Genomes Project: Datasets - National Human Genome ... · 1000 Genomes Project: Datasets Gabor Marth Boston College Biology Department 1000 Genomes Project Tutorial ASHG 2010,

>10% non-unique

mappingdepth

too highNo

coverage

>80% of the genome accessible with short reads

M & D

Page 7: 1000 Genomes Project: Datasets - National Human Genome ... · 1000 Genomes Project: Datasets Gabor Marth Boston College Biology Department 1000 Genomes Project Tutorial ASHG 2010,

PCR-duplicate reads

Page 8: 1000 Genomes Project: Datasets - National Human Genome ... · 1000 Genomes Project: Datasets Gabor Marth Boston College Biology Department 1000 Genomes Project Tutorial ASHG 2010,

Locally misaligned bases

Page 9: 1000 Genomes Project: Datasets - National Human Genome ... · 1000 Genomes Project: Datasets Gabor Marth Boston College Biology Department 1000 Genomes Project Tutorial ASHG 2010,

Un-calibrated base quality values

Page 10: 1000 Genomes Project: Datasets - National Human Genome ... · 1000 Genomes Project: Datasets Gabor Marth Boston College Biology Department 1000 Genomes Project Tutorial ASHG 2010,

SNP calling

Page 11: 1000 Genomes Project: Datasets - National Human Genome ... · 1000 Genomes Project: Datasets Gabor Marth Boston College Biology Department 1000 Genomes Project Tutorial ASHG 2010,

SNP calling (continued)

P(G1=aa|B1=aacc; Bi=aaaac; Bn= cccc)P(G1=cc|B1=aacc; Bi=aaaac; Bn= cccc)P(G1=ac|B1=aacc; Bi=aaaac; Bn= cccc)

P(Gi=aa|B1=aacc; Bi=aaaac; Bn= cccc)P(Gi=cc|B1=aacc; Bi=aaaac; Bn= cccc)P(Gi=ac|B1=aacc; Bi=aaaac; Bn= cccc)

P(Gn=aa|B1=aacc; Bi=aaaac; Bn= cccc)P(Gn=cc|B1=aacc; Bi=aaaac; Bn= cccc)P(Gn=ac|B1=aacc; Bi=aaaac; Bn= cccc)

“SNP call”

“genotype call”

P(B1=aacc|G1=aa)P(B1=aacc|G1=cc)P(B1=aacc|G1=ac)

P(Bi=aaaac|Gi=aa)P(Bi=aaaac|Gi=cc)P(Bi=aaaac|Gi=ac)

P(Bn=cccc|Gn=aa)P(Bn=cccc|Gn=cc)P(Bn=cccc|Gn=ac)

“genotype likelihoods”

Prior(

G1,.

.,G

i,.., G

n)

-----a-----

-----a-----

-----c-----

-----c-----

-----a-----

-----a-----

-----a-----

-----a-----

-----c-----

-----c-----

-----c-----

-----c-----

-----c-----

Page 12: 1000 Genomes Project: Datasets - National Human Genome ... · 1000 Genomes Project: Datasets Gabor Marth Boston College Biology Department 1000 Genomes Project Tutorial ASHG 2010,

Data processing / variant calling pipeline

Page 13: 1000 Genomes Project: Datasets - National Human Genome ... · 1000 Genomes Project: Datasets Gabor Marth Boston College Biology Department 1000 Genomes Project Tutorial ASHG 2010,

Validation by typing a random sample of novel variants

Overall FDR < 5% (10% for large SVs)

SNP calls from the 3 pilot datasets

Trios Low coverage Exon pilot

Samples 6 179 697

Raw data 1.08 Tb 2.22 Tb 1.43 Tb

SNPs found4.03M (CEU)5.01M (YRI)

14.5M 12,761

% novel15% (CEU)29% (YRI)

55% 70%

Short indels 0.68 M 1.12 M -

Deletions ~10,000 15,765 -

SV breakpts 6,169 9,092 -

Mobile element insertions

2,528 4,774 -

Page 14: 1000 Genomes Project: Datasets - National Human Genome ... · 1000 Genomes Project: Datasets Gabor Marth Boston College Biology Department 1000 Genomes Project Tutorial ASHG 2010,

Imputation helps genotype calls

Page 15: 1000 Genomes Project: Datasets - National Human Genome ... · 1000 Genomes Project: Datasets Gabor Marth Boston College Biology Department 1000 Genomes Project Tutorial ASHG 2010,

Power (sensitivity)

Page 16: 1000 Genomes Project: Datasets - National Human Genome ... · 1000 Genomes Project: Datasets Gabor Marth Boston College Biology Department 1000 Genomes Project Tutorial ASHG 2010,

Novel variants

Page 17: 1000 Genomes Project: Datasets - National Human Genome ... · 1000 Genomes Project: Datasets Gabor Marth Boston College Biology Department 1000 Genomes Project Tutorial ASHG 2010,

• 3-4,000,000 variants

• 10-11,000 nonsynonymous changes

• 220-250 in-frame indels

• 80-100 premature stop codons

• 40-50 splice site disruptions

• 50-100 HGMD “recessive disease causing” mutations

Variants per sample genome

Page 18: 1000 Genomes Project: Datasets - National Human Genome ... · 1000 Genomes Project: Datasets Gabor Marth Boston College Biology Department 1000 Genomes Project Tutorial ASHG 2010,

Exon Pilot: high sensitivity for rare variants

Page 19: 1000 Genomes Project: Datasets - National Human Genome ... · 1000 Genomes Project: Datasets Gabor Marth Boston College Biology Department 1000 Genomes Project Tutorial ASHG 2010,

Exon Pilot: most sites low-frequency and novel

Page 20: 1000 Genomes Project: Datasets - National Human Genome ... · 1000 Genomes Project: Datasets Gabor Marth Boston College Biology Department 1000 Genomes Project Tutorial ASHG 2010,

1000G data also supports structural variants

Page 21: 1000 Genomes Project: Datasets - National Human Genome ... · 1000 Genomes Project: Datasets Gabor Marth Boston College Biology Department 1000 Genomes Project Tutorial ASHG 2010,

Opportunity: different variants from the same data

Page 22: 1000 Genomes Project: Datasets - National Human Genome ... · 1000 Genomes Project: Datasets Gabor Marth Boston College Biology Department 1000 Genomes Project Tutorial ASHG 2010,

Data types delivered

Reads: FASTQ

Alignments: SAM/BAM

Variants: VCF

Page 23: 1000 Genomes Project: Datasets - National Human Genome ... · 1000 Genomes Project: Datasets Gabor Marth Boston College Biology Department 1000 Genomes Project Tutorial ASHG 2010,

Tools for analyzing / manipulating 1000G data

• samtools: http://samtools.sourceforge.net/• BamTools: http://sourceforge.net/projects/bamtools/• GATK: http://www.broadinstitute.org/gsa/wiki/index.php/The_Genome_Analysis_Toolkit

• VCFTools: http://vcftools.sourceforge.net/

Alignments: SAM/BAM

Variants: VCF

Page 24: 1000 Genomes Project: Datasets - National Human Genome ... · 1000 Genomes Project: Datasets Gabor Marth Boston College Biology Department 1000 Genomes Project Tutorial ASHG 2010,

Alignment visualization

IGV viewer, GAMBIT viewer

Page 25: 1000 Genomes Project: Datasets - National Human Genome ... · 1000 Genomes Project: Datasets Gabor Marth Boston College Biology Department 1000 Genomes Project Tutorial ASHG 2010,

Current status based on 629 samples

Samples # SNPs FN metrics

Known Novel Total dbSNP missed HM

629 7,922,125 17,564,935 25,487,060 31.08% 1.21%

• As of 11/02/2010• Calls present in at least 2 of Broad Institute, University of Michigan, NCBI, and Boston College call sets

Page 26: 1000 Genomes Project: Datasets - National Human Genome ... · 1000 Genomes Project: Datasets Gabor Marth Boston College Biology Department 1000 Genomes Project Tutorial ASHG 2010,

CEU

JPTCHB

YRI

LWK

MXL

ASW

GBRFIN

TSI

CHS

CLM

PUR

1,100 samples early 2011; 2,500 samples 2011/12

IBSCDX

KHVGWD

ACB

AJM

PEL

PJL

MAB

ADHKAKRDHMRM

The full 1000 Genomes Project data

Page 27: 1000 Genomes Project: Datasets - National Human Genome ... · 1000 Genomes Project: Datasets Gabor Marth Boston College Biology Department 1000 Genomes Project Tutorial ASHG 2010,

YRI • Low-coverage WGS (~4x per sample): a near-complete SNP catalog in the genome AF > 1%

• The deep-coverage WG exomes: rare variants, i.e. AF < 1% in genes

Complementary strategies

Page 28: 1000 Genomes Project: Datasets - National Human Genome ... · 1000 Genomes Project: Datasets Gabor Marth Boston College Biology Department 1000 Genomes Project Tutorial ASHG 2010,