Top Banner
ParSNP Hash Pipeline to parse SNP data and output summary statistics across sliding windows
17

ParSNP Hash Pipeline to parse SNP data and output summary statistics across sliding windows.

Jan 17, 2016

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: ParSNP Hash Pipeline to parse SNP data and output summary statistics across sliding windows.

ParSNP Hash

Pipeline to parse SNP data and output summary statistics across sliding windows

Page 2: ParSNP Hash Pipeline to parse SNP data and output summary statistics across sliding windows.

Objective

• Parse VCF files• Calculate summary statistics across sliding windows

throughout the genome• Implement NTFreq module to calculate nucleotide

frequencies for each population and combined population

• Implement TajimasD module to calculate Tajima’s D • Implement GO module to annotate identified SNPs

Page 3: ParSNP Hash Pipeline to parse SNP data and output summary statistics across sliding windows.

Data set• Simulated data set for chromosome 2R in

Drosophila melanogaster• 1.4 Mbp– 2 populations• Pooled individuals per population

– 75bp reads, error rate 1%– 10,000 simulated SNPs• 100x coverage per variant• At least 100bp apart• Allelic Frequencies ranging from .1 to .9 per population

Page 4: ParSNP Hash Pipeline to parse SNP data and output summary statistics across sliding windows.

Data to Variant Call Format

Index Reference GenomeOnly chromosome 2R of D. melanogaster -Genome build Dmel 3 from Flybase

Use BWA to Align FastQ to Reference GenomeGap open penalty = 1 Disallowing deletion within 12 bp of 3’UTR

Gap extension max = 12 Maximum level of gap extensions = 12

Use SAMTools to Remove Ambiguously mapped Regions (MAPQ >= 20)

Use BCFTools mpileup to Generate a Binary Code Format (BCF)BCF -> VCF

FastQ -> sai -> SAM -> BAM - > .bcf -> VCF

Page 5: ParSNP Hash Pipeline to parse SNP data and output summary statistics across sliding windows.

Formatting data: Parse VCFFor each window:

• Fetch the VCF rows from each BCF file

• Convert the VCF rows into hashes of arrays

• Compute the Theta, Pi, Tajima’s D for each population

• Compute Fst for each window between each population

Page 6: ParSNP Hash Pipeline to parse SNP data and output summary statistics across sliding windows.

Sliding windows

• Sliding window size is specified, and called modules are calculated across specified window size

Page 7: ParSNP Hash Pipeline to parse SNP data and output summary statistics across sliding windows.

Module 1: Calculate allele frequencies

• Input is taken from parsed VCF file

• Hashes are created for each population with the following structure– {SNP_location} {nucleotide} -> frequency;

• Hashes created for full dataset– {SNP_location}{Population} -> {nucleotide} ->frequency

Page 8: ParSNP Hash Pipeline to parse SNP data and output summary statistics across sliding windows.

Output site frequency spectra

• Site frequency spectrum (SFS) output as the following hash:– {nonref_allele}{frequency}->count;

• Allows us to calculate a histogram for the non-reference allele frequencies

• Send output to R to generate SFS graphs

Page 9: ParSNP Hash Pipeline to parse SNP data and output summary statistics across sliding windows.

Module 2: Calculate Summary Statistics and Tajima’s D

• theta_pi (index of diversity)

• theta_watterson (index of diversity)

Page 10: ParSNP Hash Pipeline to parse SNP data and output summary statistics across sliding windows.

Module 2: Calculate Summary Statistics and Tajima’s D

• Tajima’s D (index of selection/population expansion)

Page 11: ParSNP Hash Pipeline to parse SNP data and output summary statistics across sliding windows.

Module 3: FST for DNA sequence

• Calculate FST (index of differentiation) according to Hudson et al. 1992

1 – Hw/Hb

Hw: average number of differences within each population

Hb: average number of differences between the 2 populations

Page 12: ParSNP Hash Pipeline to parse SNP data and output summary statistics across sliding windows.

Module 4: GO annotations

• Module takes SNP list as input

• Outputs the following:– List of genes that have overlap with SNP positions– Gene Ontology (GO) IDs and terms associated with

each SNP matched gene– List of genes for a selected window

• Visualization using GOSlim

Page 13: ParSNP Hash Pipeline to parse SNP data and output summary statistics across sliding windows.

Data visualization

• Integrated Genomics Viewer (IGV)

• Broad Institute

• http://www.broadinstitute.org/igv/

Page 14: ParSNP Hash Pipeline to parse SNP data and output summary statistics across sliding windows.

SFS for population 1 and 2

Page 15: ParSNP Hash Pipeline to parse SNP data and output summary statistics across sliding windows.

Sliding window for summary statistics

Phist greater than 0.1 in window 1080001 - 1100000Go Accession ID Ontology SpecificGO:0000124 Cellular Component Spt-Ada-Gcn5-acetyltransferase complexGO:0005703 Cellular Component (Thought to be a site of active transcription)GO:0005634 Cellular Component (Nucleus)GO:0006911 Biological Process Phagosome biosynthesis/formationGO:0045747 Biological Process Up regulation of Notch signaling pathwayGO:0006355 Biological Process Regulation of cellular transcription, DNA-dependentGO:0000910 Biological Process (Cytoplasm division)GO:0016773 Molecular Function (Intermolecular transfer of phosphorus group to an alcohol group)GO:0005700 Cellular Component (Polytene associated)GO:0005488 Molecular Function (Ligand, non-covalent partner)GO:0005737 Cellular Component (Ambiguous)GO:0035222 Biological Process (Patterning in wing imaginal disc)GO:0005875 Cellular Component (Microtubule associated)GO:0004672 Molecular Function Protamine kinase activityGO:0000123 Cellular Component Histone acetylase complex

Page 16: ParSNP Hash Pipeline to parse SNP data and output summary statistics across sliding windows.

Identify differentiated genomic regions

• For each window with a Fst > 0.1, print the name of the SNP and associated GO term

Phist (Fst) greater than 0.1 in window 1080001 - 1100000Go Accession ID Ontology SpecificGO:0000124 Cellular Component Spt-Ada-Gcn5-acetyltransferase complexGO:0005703 Cellular Component (Thought to be a site of active transcription)GO:0005634 Cellular Component (Nucleus)GO:0006911 Biological Process Phagosome biosynthesis/formationGO:0045747 Biological Process Regulation of cellular transcription, DNA-dependentGO:0000910 Biological Process (Cytoplasm division)GO:0016773 Molecular Function (Intermolecular transfer of phosphorus group to an alcohol group)GO:0005700 Cellular Component (Polytene associated)GO:0005488 Molecular Function (Ligand, non-covalent partner)GO:0005737 Cellular Component (Ambiguous)GO:0035222 Biological Process (Patterning in wing imaginal disc)GO:0005875 Cellular Component (Microtubule associated)GO:0004672 Molecular Function Protamine kinase activityGO:0000123 Cellular Component Histone acetylase complex

Page 17: ParSNP Hash Pipeline to parse SNP data and output summary statistics across sliding windows.

Thank You

Use PERL or die , print “ (X_x)”;

##Hashes to Hashes##Print “ % 2 %”;