NGS Data Congress London, June 2015 P. Misiser Scalable WES Processing And Variant Interpretation With Provenance Recording Using Workflow On The Cloud Paolo Missier, Jacek Cała, Yaobo Xu, Eldarina Wijaya, Ryan Kirby School of Computing Science and Institute of Genetic Medicine Newcastle University, Newcastle upon Tyne, UK NGS Data Congress London, June 15th, 2015
25
Embed
Invited cloud-e-Genome project talk at 2015 NGS Data Congress
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
NG
S D
ata
Con
gres
sLo
ndon
, Jun
e 20
15P.
Mis
iser
Scalable WES Processing And Variant InterpretationWith Provenance Recording
Using Workflow On The Cloud
Paolo Missier, Jacek Cała, Yaobo Xu,
Eldarina Wijaya, Ryan Kirby
School of Computing Science and Institute of Genetic MedicineNewcastle University, Newcastle upon Tyne, UK
NGS Data Congress
London, June 15th, 2015
NG
S D
ata
Con
gres
sLo
ndon
, Jun
e 20
15P.
Mis
iser
The Cloud-e-Genome project at Newcastle
1. NGS data processing:
• Implement a flexible WES/WGS pipeline
• Scalable deployment over a public cloud
• Cost control• Scalability• Flexibility
• Of design• Of maintenance
• Ensure accountability through traceability
• Enable analytics over past patient cases
2. Traceable variant interpretation:
• Design a simple-to-use tool to facilitate clinical diagnosis by clinicians
• Maintain history of past investigations for analytical purposes
Objectives: With an aim to:
• 2 year pilot project: 2013-2015• Funded by UK’s National Institute for Health Research (NIHR)• Cloud resources from Azure for Research Award
NG
S D
ata
Con
gres
sLo
ndon
, Jun
e 20
15P.
Mis
iser
Part I: data processing
Objectives:• Design and Implement a flexible WES/WGS pipeline
• Using workflow technology high level programming
• Providing scalable deployment over a public cloud
NG
S D
ata
Con
gres
sLo
ndon
, Jun
e 20
15P.
Mis
iser
Scripted NGS data processing pipeline
RecalibrationCorrects for system bias on quality scores assigned by sequencerGATK
Computes coverage of each read.
VCF Subsetting by filtering, eg non-exomic variants
Annovar functional annotations (eg MAF, synonimity, SNPs…)followed by in house annotations
Aligns sample sequence to HG19 reference genomeusing BWA aligner
Cleaning, duplicate elimination
Picard tools
Variant calling operates on multiple samples simultaneouslySplits samples into chunks.Haplotype caller detects both SNV as well as longer indels
Variant recalibration attempts to reduce false positive rate from caller
NG
S D
ata
Con
gres
sLo
ndon
, Jun
e 20
15P.
Mis
iser
Scripts to workflow - Design
Design Cloud Deployment Execution Analysis
• Better abstraction
• Easier to understand, share, maintain
• Better exploit data parallelism
• Extensible by wrapping new tools
Theoretical advantages of using a workflow programming model
NG
S D
ata
Con
gres
sLo
ndon
, Jun
e 20
15P.
Mis
iser
Workflow Design
echo Preparing directories $PICARD_OUTDIR and $PICARD_TEMPmkdir -p $PICARD_OUTDIRmkdir -p $PICARD_TEMP
echo Starting PICARD to clean BAM files...$Picard_CleanSam INPUT=$SORTED_BAM_FILE OUTPUT=$SORTED_BAM_FILE_CLEANED
echo Adding read group information to bam file...$Picard_AddRG INPUT=$SORTED_BAM_FILE_NODUPS_NO_RG OUTPUT=$SORTED_BAM_FILE_NODUPS RGID=$READ_GROUP_ID RGPL=illumina RGSM=$SAMPLE_ID \RGLB="${SAMPLE_ID}_${READ_GROUP_ID}” RGPU="platform_Unit_${SAMPLE_ID}_${READ_GROUP_ID}”
echo Indexing bam files...samtools index $SORTED_BAM_FILE_NODUPS
“Wrapper”blocksUtility
blocks
NG
S D
ata
Con
gres
sLo
ndon
, Jun
e 20
15P.
Mis
iser
Workflow design
Conceptual:
Actual:
NG
S D
ata
Con
gres
sLo
ndon
, Jun
e 20
15P.
Mis
iser
Anatomy of a complex parallel dataflow
eScience Central: simple dataflow model…
Sample-split:Parallel processing of samples in a batch
NG
S D
ata
Con
gres
sLo
ndon
, Jun
e 20
15P.
Mis
iser
Anatomy of a complex parallel dataflow
… with hierarchical structure
NG
S D
ata
Con
gres
sLo
ndon
, Jun
e 20
15P.
Mis
iser
Phase II, top level
Chromosome-split:Parallel processing of each chromosome across all samples
NG
S D
ata
Con
gres
sLo
ndon
, Jun
e 20
15P.
Mis
iser
Phase III
Sample-split:Parallel processing of samples
NG
S D
ata
Con
gres
sLo
ndon
, Jun
e 20
15P.
Mis
iser
Implicit parallelism in the pipeline
align-clean-recalibrate-coverage
…
align-clean-recalibrate-coverage
Sample1
Samplen
Variant callingrecalibration
Variant callingrecalibration
Variant filtering annotation
Variant filtering annotation
……
Chromosomesplit
Per-sample Parallelprocessing
Per-chromosomeParallelprocessing
Stage I Stage II Stage III
How does the workflow design exploit this parallelism?