Top Banner
Mining GWAS Catalog & 1000 Genomes Dataset Segun Fatumo
34

Mining GWAS Catalog & 1000 Genomes Dataset - … · Mining GWAS Catalog & 1000 Genomes Dataset Segun Fatumo . ... Figure S2. 1000 Genomes Proj ect Phase I popul at i ons ... EAST

Apr 02, 2018

Download

Documents

phamdang
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Mining GWAS Catalog & 1000 Genomes Dataset - … · Mining GWAS Catalog & 1000 Genomes Dataset Segun Fatumo . ... Figure S2. 1000 Genomes Proj ect Phase I popul at i ons ... EAST

Mining GWAS Catalog &

1000 Genomes Dataset

Segun Fatumo

Page 2: Mining GWAS Catalog & 1000 Genomes Dataset - … · Mining GWAS Catalog & 1000 Genomes Dataset Segun Fatumo . ... Figure S2. 1000 Genomes Proj ect Phase I popul at i ons ... EAST

NHGRI GWA Catalog

www.genome.gov/GWAStudies

What is GWAS Catalog

Page 3: Mining GWAS Catalog & 1000 Genomes Dataset - … · Mining GWAS Catalog & 1000 Genomes Dataset Segun Fatumo . ... Figure S2. 1000 Genomes Proj ect Phase I popul at i ons ... EAST

Citation

• How to cite the NHGRI GWAS Catalog: Hindorff LA, MacArthur J (European Bioinformatics Institute), Morales J (European Bioinformatics Institute), Junkins HA, Hall PN, Klemm AK, and Manolio TA. A Catalog of Published Genome-Wide Association Studies. Available at: www.genome.gov/gwastudies. Accessed [date of access].

• How to cite the NHGRI GWAS Catalog paper:

Welter D, MacArthur J, Morales J, Burdett T, Hall P, Junkins H, Klemm A, Flicek P, Manolio T, Hindorff L, and Parkinson H. The NHGRI GWAS Catalog, a curated resource of SNP-trait associations. Nucleic Acids Research, 2014, Vol. 42 (Database issue): D1001-D1006.

Page 4: Mining GWAS Catalog & 1000 Genomes Dataset - … · Mining GWAS Catalog & 1000 Genomes Dataset Segun Fatumo . ... Figure S2. 1000 Genomes Proj ect Phase I popul at i ons ... EAST

Class Exercise • Download the NGHRI GWAS catalog

– In tab delimited format to your Linux system – What is the number of columns on your downloaded gwas catalog – What is the header on column 28 ? – Using a spreadsheet (Ms Excel)

• Extract All SNPs that have been found to be associated with

– HDL – Cholesterol – BMI – Your trait of interest Print only columns 2,8,30,24,22 and 28 in this order. Extract only where p-value is 5 X 10^-8

• Count the number of SNPs found for each traits. Report only the unique SNPs. • Hint: Use your knowledge of SQL, Bash scripting, python. If you cant get your way around this, we do solve it

together.

Page 5: Mining GWAS Catalog & 1000 Genomes Dataset - … · Mining GWAS Catalog & 1000 Genomes Dataset Segun Fatumo . ... Figure S2. 1000 Genomes Proj ect Phase I popul at i ons ... EAST

1000 Genomes

Page 6: Mining GWAS Catalog & 1000 Genomes Dataset - … · Mining GWAS Catalog & 1000 Genomes Dataset Segun Fatumo . ... Figure S2. 1000 Genomes Proj ect Phase I popul at i ons ... EAST

Glossary • Pilot: The 1000 Genomes project ran a pilot study

between 2008 and 2010 • Phase 1: The initial round of exome and low coverage

sequencing of 1000 individuals • Phase 2: Expanded sequencing of 1700 individuals and

method improvement • Phase 3: Sequencing of 2500 individuals and a new

variation catalogue • SAM/BAM: Sequence Alignment/Map Format, an

alignment format • VCF: Variant Call Format, a variant format

6

Page 7: Mining GWAS Catalog & 1000 Genomes Dataset - … · Mining GWAS Catalog & 1000 Genomes Dataset Segun Fatumo . ... Figure S2. 1000 Genomes Proj ect Phase I popul at i ons ... EAST

7

The 1000 Genomes Project: Overview

• International project to construct a foundational data set for human genetics – Discover virtually all common human variations by

investigating many genomes at the base pair level – Consortium with multiple centers, platforms, funders

• Aims • Discover population level human genetic variations of all

types (95% of variation > 1% frequency) • Define haplotype structure in the human genome • Develop sequence analysis methods, tools, and other

reagents that can be transferred to other sequencing projects

Page 8: Mining GWAS Catalog & 1000 Genomes Dataset - … · Mining GWAS Catalog & 1000 Genomes Dataset Segun Fatumo . ... Figure S2. 1000 Genomes Proj ect Phase I popul at i ons ... EAST

Phase 1 populations

8

Figure S2. 1000 Genomes Project Phase I populations Populations collected as part of the HapMap project (blue) and the 1000 Genomes Project (green) include: Europe (IBS (Iberian populations in Spain), GBR (British from England and Scotland ), CEU (Utah residents with ancestry from northern and western Europe), FIN (Finnish in Finland), TSI (Toscani in Italia)); East Asia (JPT (Japanese in Tokyo, Japan), CHB (Han Chinese Beijing), CHS (Han Chinese South)); Africa (ASW (African Ancestry in SW USA), YRI (Yoruba in Ibadan, Nigeria), LWK (Luhya in Webuye, Kenya)); Americas (MXL (Mexican Ancestry in Los Angeles, CA, USA), PUR (Puerto Rican in Puerto Rico), CLM (Colombian in Medellín, Colombia)). A – Total number of samples sequenced; B – Source of DNA (blood (bld) or LCL); C – Gender composition (Male/Female); D – Number that are part of mother-father-child trios (t), parent-child duos (d) or singletons (s); for trios and duos, only parent samples were sequenced.

TSIA. 98

B. All LCL

C. 50m/48f

D. 98s

FINA. 93

B. All LCL

C. 35m/58f

D. 93s

EUROPE

AMERICAS

AFRICA

EAST ASIA

CEUA. 85

B. All LCL

C. 45m/40f

D. 78t/3d/4s

GBRA. 89

B. All LCL

C. 41m/48f

D. 3d/86s

IBSA. 14

B. All LCL

C. 7m/7f

D. 14t

JPTA. 89

B. All LCL

C. 50m/39f

D. 89s

CHBA. 97

B. All LCL

C. 44m/53f

D. 97s

CHSA. 100

B. All LCL

C. 50m/50f

D. 100tLWK

A. 97

B. All LCL

C. 48m/49f

D. 4d/93s

YRIA. 88

B. All LCL

C. 43m/45f

D. 65t/21d/2s

ASWA. 61

B. All LCL

C. 24m/37f

D. 28t/22d/11s

PURA. 55

B. 35bld/20LCL

C. 28m/27f

D. 47t/8d

CLMA. 60

B. All LCL

C. 29m/31f

D. 55t/5d

MXLA. 66

B. All LCL

C. 31m/35f

D. 59t/3d/4s

New 1000 Genomes

HapMap 3

Los Angeles, USA

Utah, USA

Southwest, USA

Ibadan, Nigeria

Webuye, Kenya

Spain

Great Britain

Finland

ItalyTokyo, Japan

Beijing, China

Hu Nan and Fu JianProvinces, China

Medellín, Colombia

Puerto Rico

Page 9: Mining GWAS Catalog & 1000 Genomes Dataset - … · Mining GWAS Catalog & 1000 Genomes Dataset Segun Fatumo . ... Figure S2. 1000 Genomes Proj ect Phase I popul at i ons ... EAST

Phase 2/3 populations

9

Barbados

Peru

Ghana

Nigeria

Sierra Leone

Sri Lanka

India

Pakistan

Bangladesh

Vietnam

USA

Page 10: Mining GWAS Catalog & 1000 Genomes Dataset - … · Mining GWAS Catalog & 1000 Genomes Dataset Segun Fatumo . ... Figure S2. 1000 Genomes Proj ect Phase I popul at i ons ... EAST

Hapmap, The Pilot Project and The Main Project

10

• 1000 Genomes Phase 2 • Started in 2011

• 1722 individuals

• 19 Populations

• Low coverage and exome next generation sequencing

• Hapmap • Starting in 2002 • Last release contained ~3m snps • 1400 individuals • 11 populations • High Throughput genotyping chips

• 1000 Genomes Pilot project • Started in 2008

• Paper release contained ~14 million snps

• 179 individuals

• 4 populations

• Low coverage next generation sequencing

• 1000 Genomes Phase 1 • Started in 2009

• Phase 1 release has 36.6millon snps, 3.8millon indels and 14K deletions

• 1094 individuals

• 14 populations

• Low coverage and exome next generation sequencing

Page 11: Mining GWAS Catalog & 1000 Genomes Dataset - … · Mining GWAS Catalog & 1000 Genomes Dataset Segun Fatumo . ... Figure S2. 1000 Genomes Proj ect Phase I popul at i ons ... EAST

Timeline

11

• September 2007: 1000 Genomes project formally proposed Cambridge, UK

• April 2008: First Submission of Data to the Short Read Archive.

• May 2008: First public data release.

• October 2008: SAM/BAM Format Defined.

• December 2008: First High Coverage Variants Released.

• December 2008: First 1000 genomes browser released

• May 2009: First Indel Calls released.

• July 2009: VCF Format defined

• August 2009: First Large Scale Deletions released.

• December 2009: First Main Project Sequence Data Released. • March 2010: Low Coverage Pilot Variant Release made

• July 2010: Phased genotypes for 159 Individuals released.

• October 2010: A Map of Human Variation from population

scale sequencing is published in Nature.

• January 2011: Final Phase 1 Low coverage alignments are released

• May 2011: @1000genomes appears on Twitter

• May 2011: First Variant Release made on more than 1000 individuals

• October 2011: Phase 1 integrated variant release made • March 2012: Phase 2 Alignment release

• November 2012: An integrated map of genetic variation from

1,092 human genomes in Nature

Page 12: Mining GWAS Catalog & 1000 Genomes Dataset - … · Mining GWAS Catalog & 1000 Genomes Dataset Segun Fatumo . ... Figure S2. 1000 Genomes Proj ect Phase I popul at i ons ... EAST

Fraction of variant sites present in an individual that are NOT already

represented in dbSNP Date Fraction not in dbSNP

February, 2000 98%

February, 2001 80%

April, 2008 10%

February, 2011 2%

Now <1%

Ryan Poplin, David Altshuler

Page 13: Mining GWAS Catalog & 1000 Genomes Dataset - … · Mining GWAS Catalog & 1000 Genomes Dataset Segun Fatumo . ... Figure S2. 1000 Genomes Proj ect Phase I popul at i ons ... EAST

Data Availability

• FTP site: ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/ – Raw Data Files

• Web site: http://www.1000genomes.org – Release Announcements – Documentation

• Ensembl Style Browser: http://browser.1000genomes.org – Browse 1000 Genomes variants in Genomic Context – Variant Effect Predictor – Data Slicer – Other Tools

13

Page 14: Mining GWAS Catalog & 1000 Genomes Dataset - … · Mining GWAS Catalog & 1000 Genomes Dataset Segun Fatumo . ... Figure S2. 1000 Genomes Proj ect Phase I popul at i ons ... EAST

The 1000 Genomes Project data

• Data are available through:

• The 1000 Genomes website: www.1000genomes.org

• NCBI: ftp://ftp-trace.ncbi.nlm.nih.gov/1000genomes

• EBI: ftp://ftp.1000genomes.ebi.ac.uk

• Amazon: http://s3.amazonaws.com/1000genomes

Page 16: Mining GWAS Catalog & 1000 Genomes Dataset - … · Mining GWAS Catalog & 1000 Genomes Dataset Segun Fatumo . ... Figure S2. 1000 Genomes Proj ect Phase I popul at i ons ... EAST

Sequence Data • Fastq files

– @ERR050087.1 HS18_6628:8:1108:8213:186084#2/1

– GGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGG

– +

– DCDHKHKKIJGNNHIJIIKLLMCLKMAILIJH3K>HL1I=>MK.D

– http://www.1000genomes.org/faq/what-format-are-your-sequence-files

16

Page 17: Mining GWAS Catalog & 1000 Genomes Dataset - … · Mining GWAS Catalog & 1000 Genomes Dataset Segun Fatumo . ... Figure S2. 1000 Genomes Proj ect Phase I popul at i ons ... EAST

Alignment Data • BAM files

• ERR052835 163 11 60239 0 100M = 60609 469

• http://samtools.sourceforge.net/

17

NAME DESCRIPTION

QNAME Query NAME of the read or read pair

FLAG Bitwise FLAG (pairing, strand, mate strand etc

RNAME Reference Sequence NAME

POS 1-Based leftmost POSition of clipped alignment

MAPQ MAPping Quality (Phred-scaled)

CIGAR Extended CIGAR string (operations: MIDNSHP)

MRNM Mate Reference NaMe (‘=’ if same as RNAME)

MPOS 1-Based leftmost Mate POSition

ISIZE Inferred Insert SIZE

SEQ Query SEQuence on the same strand as the reference

QUAL Query QUALity (ASCII-33=Phred base quality)

Page 18: Mining GWAS Catalog & 1000 Genomes Dataset - … · Mining GWAS Catalog & 1000 Genomes Dataset Segun Fatumo . ... Figure S2. 1000 Genomes Proj ect Phase I popul at i ons ... EAST

More Information About BAM Files

• http://samtools.sourceforge.net/

[email protected]

18

Page 19: Mining GWAS Catalog & 1000 Genomes Dataset - … · Mining GWAS Catalog & 1000 Genomes Dataset Segun Fatumo . ... Figure S2. 1000 Genomes Proj ect Phase I popul at i ons ... EAST

Variant Call Data

19

• VCF Files

• TAB Delimited Text Format

NAME DESCRIPTION

CHROM Chromosome name

POS Position in chromosome

ID Unique Identifer of variant

REF Reference Allele

ALT Alternative Allele

QUAL Phred scaled quality value

FILTER Site filter information

INFO User extensible annotation

FORMAT Describes the format of the subsequent fields, must always contain Genotype

Individual Genotype Fields

These columns contain the individual genotype data for each individual in the file

Page 20: Mining GWAS Catalog & 1000 Genomes Dataset - … · Mining GWAS Catalog & 1000 Genomes Dataset Segun Fatumo . ... Figure S2. 1000 Genomes Proj ect Phase I popul at i ons ... EAST

Variant Call Data

• Headers

##fileformat=VCFv4.1

##INFO=<ID=RSQ,Number=1,Type=Float,Description="Genotype imputation quality from MaCH/Thunder">

##INFO=<ID=AC,Number=.,Type=Integer,Description="Alternate Allele Count">

##INFO=<ID=AN,Number=1,Type=Integer,Description="Total Allele Count">

##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele, ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/ancestral_alignments/README">

##INFO=<ID=AF,Number=1,Type=Float,Description="Global Allele Frequency based on AC/AN”>

##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">

##FORMAT=<ID=DS,Number=1,Type=Float,Description="Genotype dosage from MaCH/Thunder">

##FORMAT=<ID=GL,Number=.,Type=Float,Description="Genotype Likelihoods">

20

Page 21: Mining GWAS Catalog & 1000 Genomes Dataset - … · Mining GWAS Catalog & 1000 Genomes Dataset Segun Fatumo . ... Figure S2. 1000 Genomes Proj ect Phase I popul at i ons ... EAST

Variant Call Data

• Example 1000 Genomes Data • CHROM 4 • POS 42208061 • ID rs186575857 • REF T • ALT C • QUAL 100 • FILTER PASS • INFO AA=T;AN=2184;AC=1;RSQ=0.8138;AF=0.0005; • FORMAT GT:DS:GL • GENOTYPE 0|0:0.000:-0.03,-1.19,-5.00

21

Page 22: Mining GWAS Catalog & 1000 Genomes Dataset - … · Mining GWAS Catalog & 1000 Genomes Dataset Segun Fatumo . ... Figure S2. 1000 Genomes Proj ect Phase I popul at i ons ... EAST

More Information About VCF Files

VCF variant files

All indexed for fast retrieval

http://vcftools.sourceforge.net/

[email protected]

Page 23: Mining GWAS Catalog & 1000 Genomes Dataset - … · Mining GWAS Catalog & 1000 Genomes Dataset Segun Fatumo . ... Figure S2. 1000 Genomes Proj ect Phase I popul at i ons ... EAST

Class Exercise • Download from 1000 Genomes the vcf data for a gene of

interest. – If you don’t have a gene of interest, look for a gene in GWAS catalog

that is associated with a Lipid traits eg PCSK9

• Extract the genotype of same gene for only European population

• Convert your vcf genotype data to a plink format (ped & map)

• Hint : You will need to know the chr and location of the gene before you can download it

Page 24: Mining GWAS Catalog & 1000 Genomes Dataset - … · Mining GWAS Catalog & 1000 Genomes Dataset Segun Fatumo . ... Figure S2. 1000 Genomes Proj ect Phase I popul at i ons ... EAST

Data Slicing

• All alignment and variant files are indexed so subsections can be downloaded remotely

• Use samtools to get subsections of bam files – samtools view

http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data/HG01375/alignment/HG01375.mapped.ILLUMINA.bwa.CLM.low_coverage.20111114.bam 6:31833200-31834200

• Use tabix to get subsections of vcf files – tabix -h

ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/working/20120131_omni_genotypes_and_intensities/Omni25_genotypes_2141_samples.b37.vcf.gz 6:31833200-31834200

• You can also use the web Data Slicer interface to do this

24

Page 25: Mining GWAS Catalog & 1000 Genomes Dataset - … · Mining GWAS Catalog & 1000 Genomes Dataset Segun Fatumo . ... Figure S2. 1000 Genomes Proj ect Phase I popul at i ons ... EAST

Data Slicing

• VCFtools provides some useful additional functionality on the command line including:

• vcf-compare, comparision and stats about two or more vcf files

• vcf-isec, creates an intersection of two or more vcf files

• vcf-subset, will subset a vcf file only retaining the specified individual columns

• vcf-validator, will validate a particular

25

Page 26: Mining GWAS Catalog & 1000 Genomes Dataset - … · Mining GWAS Catalog & 1000 Genomes Dataset Segun Fatumo . ... Figure S2. 1000 Genomes Proj ect Phase I popul at i ons ... EAST

Data Slicing • http://browser.1000genomes.org/tools.html

• http://browser.1000genomes.org/Homo_sapiens/UserD

ata/SelectSlice

Page 27: Mining GWAS Catalog & 1000 Genomes Dataset - … · Mining GWAS Catalog & 1000 Genomes Dataset Segun Fatumo . ... Figure S2. 1000 Genomes Proj ect Phase I popul at i ons ... EAST

Variant Effect Predictor

• Predicts Functional Consequences of Variants • Both Web Front end and API script • Can provide

– sift/polyphen/condel consequences – Refseq gene names – HGVS output

• Can run from a cache as well as Database • Convert from one input format to another • Script available for download from: • ftp://ftp.ensembl.org/pub/misc-scripts/Variant_effect_predictor/ • http://browser.1000genomes.org/Homo_sapiens/UserData/Upload

Variations

27

Page 28: Mining GWAS Catalog & 1000 Genomes Dataset - … · Mining GWAS Catalog & 1000 Genomes Dataset Segun Fatumo . ... Figure S2. 1000 Genomes Proj ect Phase I popul at i ons ... EAST
Page 29: Mining GWAS Catalog & 1000 Genomes Dataset - … · Mining GWAS Catalog & 1000 Genomes Dataset Segun Fatumo . ... Figure S2. 1000 Genomes Proj ect Phase I popul at i ons ... EAST

Variant Effect Predictor

• perl variant_effect_predictor.pl -input 6_381831625_3184704.vcf -sift p -polyphen p –check_existing

• less variant_effect_output.txt #Uploaded_variation Location Allele Gene Feature Feature_type Consequence

cDNA_position CDS_position Protein_position Amino_acids Codons Exi

sting_variation Extra

rs138094825 6:31831667 A ENSG00000204385 ENST00000414427 Transcript DOWNSTREAM - - - - - rs138094825 -

rs138094825 6:31831667 A ENSG00000204385 ENST00000229729 Transcript INTRONIC - - - - - rs138094825 -

6_31832657_C/T 6:31832657 T ENSG00000204385 ENST00000229729 Transcript NON_SYNONYMOUS_CODING 1883 1862 621 R/H cGc/cAc - PolyPhen=possibly_damaging;SIFT=deleterious

29

Page 30: Mining GWAS Catalog & 1000 Genomes Dataset - … · Mining GWAS Catalog & 1000 Genomes Dataset Segun Fatumo . ... Figure S2. 1000 Genomes Proj ect Phase I popul at i ons ... EAST

VCF to PED

• LD Visualization tools like Haploview require PED files

• VCF to PED converts VCF to PED

• Will a file divide by individual or population

• http://browser.1000genomes.org/Homo_sapiens/UserData/Haploview

30

Page 31: Mining GWAS Catalog & 1000 Genomes Dataset - … · Mining GWAS Catalog & 1000 Genomes Dataset Segun Fatumo . ... Figure S2. 1000 Genomes Proj ect Phase I popul at i ons ... EAST

VCF to PED

31

Page 32: Mining GWAS Catalog & 1000 Genomes Dataset - … · Mining GWAS Catalog & 1000 Genomes Dataset Segun Fatumo . ... Figure S2. 1000 Genomes Proj ect Phase I popul at i ons ... EAST

VCF to PED

• perl vcf_to_ped_convert.pl -vcf ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20110521/ALL.chr6.phase1_integrated_calls.20101123.snps_indels_svs.genotypes.vcf.gz -sample_panel_file ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20110521/phase1_integrated_calls.20101123.ALL.panel -region 6:31830969-31846823 -population CEU

• Output should be two files

• 6_31830969-31846823.info

• 6_31830969-31846823.ped

32

Page 33: Mining GWAS Catalog & 1000 Genomes Dataset - … · Mining GWAS Catalog & 1000 Genomes Dataset Segun Fatumo . ... Figure S2. 1000 Genomes Proj ect Phase I popul at i ons ... EAST

Access to backend Ensembl databases

• Public MySQL database at – mysql-db.1000genomes.org port 4272

• Full programmatic access with Ensembl API

– The 1000 Genomes Pilot uses Ensembl v60 databases and the NCBI36 assembly (this is frozen)

– The 1000 Genomes main project currently uses Ensembl v63 databases

• http://jun2011.archive.ensembl.org/info/docs/api/variation/index.html

• http://www.ensembl.org/info/docs/api/variation/index.html

• http://www.1000genomes.org/node/517