Top Banner
EDACC Primary Analysis Pipelines Cristian Coarfa Bioinformatics Research Laboratory Molecular and Human Genetics
19

EDACC Primary Analysis Pipelines

Feb 01, 2016

Download

Documents

sevita

EDACC Primary Analysis Pipelines. Cristian Coarfa Bioinformatics Research Laboratory Molecular and Human Genetics. Data Levels. ChIP-Seq Shotgun Bisulfite Sequencing Methyl-C Reduced Representation Bisulfite Sequencing RRBS MRE-Seq MeDIP-Seq Chromatin Accessibility - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: EDACC Primary Analysis Pipelines

EDACCPrimary Analysis Pipelines

Cristian CoarfaBioinformatics Research Laboratory

Molecular and Human Genetics

Page 2: EDACC Primary Analysis Pipelines

Data Levels

Page 3: EDACC Primary Analysis Pipelines

Data Types Submitted To EDACC

• ChIP-Seq • Shotgun Bisulfite Sequencing

– Methyl-C

• Reduced Representation Bisulfite Sequencing– RRBS

• MRE-Seq • MeDIP-Seq • Chromatin Accessibility • small RNA-Seq • mRNA-Seq

Page 4: EDACC Primary Analysis Pipelines

Read Mapping• Common processing step to all pipelines• High throughput

– Sequence space: Illumina– Color space: SOLID

• Quick and accurate anchoring• Reads size varies 36-76 bp• Short read aligners

– 1st generation: Maq, soap• Ungapped alignment

– 2nd generation: bowtie, bwa, soap 2• Tradeoff speed for sensitivity, good enough for many applications

• Mapping tools– Robust to indels– Sensitive to variable number of mismatches

Page 5: EDACC Primary Analysis Pipelines

Pash 3.0

• Positional Hashing

• Regular reads mapping• Bisulfite sequencing mapping• Integrate basepair variation with epigenetic variation

• SAM output, easy integration with other analysis tools• Accuracy without sacrificing efficiency

Page 6: EDACC Primary Analysis Pipelines

Bisulfite Sequencing• Current tools: BSMAP, RMAP-BS, mrsFast, Zoom

• Pash 3.0– Integrate mutation discovery with basepair-level methylation discovery– Speedup

• General approach– Covert C’s to T’s in reads and/or reference– Use mappings, reads and reference to determine methylated sites

• Pash 3– Generate and hash all possible kmers for reads– CTT: CCC, CCT, CTC, CTT– Map against forward and reverse complement chromosome strands

• Superior sensitivity to other tools, without loss of efficiency

Page 7: EDACC Primary Analysis Pipelines

Galaxy/Genboree

• Developed at Penn State University• Benefits

– Rapid deployment tool– Share pipelines w/ others

• Alan Harris, Sriram Raghuram– Deployed Galaxy/Genboree– Integration w/ Genboree

• API for upload/download– Adaptors for LFF file format support– EDACC XML validation tools

• Sriram Raghuram, Andrew Jackson, Cristian Coarfa– Integration with compute clusters

• Arpit Tandon, Sriram Raghuram– Deployed analysis tools

http://genboree.org/galaxy

Page 8: EDACC Primary Analysis Pipelines

Primary Analysis Pipelines

• Implemented & exposed via Galaxy/Genboree– Read mapping– Bisulfite Sequencing read mapping– Peak calling (ChIP-Seq, MeDIP-Seq)

• MACS (Harvard), FindPeaks (UBC)– Chromatin accessibility

• HotSpot (UW)– Small RNA-seq

• Coming soon– mRNA seq– Expression, alternative splicing– Gene fusion

• Typical user interaction– Use Galaxy for user input– Submit jobs to a cluster– Upload results to Genboree

Page 9: EDACC Primary Analysis Pipelines

Reads Mapping

Page 10: EDACC Primary Analysis Pipelines

ChIP-Seq

• Select uniquely mapping reads • Build read density maps

– Extend each read 200bp along the mapping strand– Remove monoclonal reads– Generate WIG data– Can be visualized in Genboree and UCSC

• Peak calling– FindPeaks, MACS

• Intepret Peaks– Overlap with genomic features of interest: gene promoters, etc

Page 11: EDACC Primary Analysis Pipelines

MeDIP-Seq

• Select uniquely mapping reads • Build read density maps• Determine methylated CpGs

– FindPeaks

Page 12: EDACC Primary Analysis Pipelines

Finding methylated CpGs

Page 13: EDACC Primary Analysis Pipelines

MeDIP-Seq Signal Visualization

Page 14: EDACC Primary Analysis Pipelines

MRE-Seq

• Select uniquely mapping reads • Determine unmethylated CpGs

Page 15: EDACC Primary Analysis Pipelines

Bisulfite Sequencing

• Shotgun Bisulfite Sequencing– Methyl-C– Genome wide

• Reduced Representation Bisulfite Sequencing– RRBS– Enzyme cocktail

• Map using Pash• Build methylation maps

Page 16: EDACC Primary Analysis Pipelines

Bisulfite Sequencing Read Mapping

Page 17: EDACC Primary Analysis Pipelines

Methylation Maps

Position Strand CHHStatus Methylation Unmethylated TotalReads50100242 + CG 1 0 150100243 - CG 40 11 5150100250 + CG 1 0 150100251 - CG 37 8 46

Page 18: EDACC Primary Analysis Pipelines

Small RNA-Seq

• Trim adapters

• Map reads onto target genome– up to 100 locations per read

• Interpret– Overlap w/ miRNAs, piRNAs, sno/scaRNAs

Page 19: EDACC Primary Analysis Pipelines

Exercise

• Download the input MeDIP-Seq file from the workshop wiki

• Analyze it using FindPeaks in Galaxy– Obtain results in Genboree Lff format

• Upload the results to Genboree database

• View the results in a tabular view

• Find the largest peaks

• Explore them in the Genboree browser