Top Banner
Bioinformatics- Data Analysis Erin H. Graf, PhD, D(ABMM) Infectious Disease Diagnostics Laboratory, Children’s Hospital of Philadelphia Department of Pathology and Laboratory Medicine, University of Pennsylvania
32

Bioinformatics- Data Analysis · Bioinformatics tools: known virus •Goal: Generate full length virus sequence for downstream analysis •Typing, epidemiology, resistance marker

Jul 06, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Bioinformatics- Data Analysis · Bioinformatics tools: known virus •Goal: Generate full length virus sequence for downstream analysis •Typing, epidemiology, resistance marker

Bioinformatics- Data AnalysisErin H. Graf, PhD, D(ABMM)

Infectious Disease Diagnostics Laboratory, Children’s Hospital of Philadelphia

Department of Pathology and Laboratory Medicine, University of Pennsylvania

Page 2: Bioinformatics- Data Analysis · Bioinformatics tools: known virus •Goal: Generate full length virus sequence for downstream analysis •Typing, epidemiology, resistance marker

Outline

• Goal: Raw data virus detection and/or typing/epi• Making sense of large amounts of data can be intimidating• Focus on tools you can go home and use immediately

• Pre-processing• Quality analysis• Filtering steps and tools

• Processing• Bioinformatic tools when virus is known• Bioinformatic tools when virus is unknown (Agnostic)

• Interpretation• Important variables• Sources of error

• Standardization and Validation

Page 3: Bioinformatics- Data Analysis · Bioinformatics tools: known virus •Goal: Generate full length virus sequence for downstream analysis •Typing, epidemiology, resistance marker

Pre-processing: Raw data

Page 4: Bioinformatics- Data Analysis · Bioinformatics tools: known virus •Goal: Generate full length virus sequence for downstream analysis •Typing, epidemiology, resistance marker

Pre-processing steps

• Goal: Refine sequence data to contain only the best quality reads to reduce downstream errors• Select examples in Interpretation section

• Quality summary provided by instrument

• Bar code/adapter trim, instrument-specific filtering

• Secondary custom filtering

Page 5: Bioinformatics- Data Analysis · Bioinformatics tools: known virus •Goal: Generate full length virus sequence for downstream analysis •Typing, epidemiology, resistance marker

Pre-processing: Instrument report

• Example of Illumina • Example of Ion

Shotgun DNAseq Targeted FFPE sequencing

Page 6: Bioinformatics- Data Analysis · Bioinformatics tools: known virus •Goal: Generate full length virus sequence for downstream analysis •Typing, epidemiology, resistance marker

Pre-processing: Filter and Trim

• Reads are filtered and excluded based on instrument quality cutoffs

• Fastq files can be downloaded or analyzed directly through Apps (Illumina) or Plugins (Ion)

• Secondary analysis through “Fastqc”

• Bioinformatic suites capable of custom trimming

Page 7: Bioinformatics- Data Analysis · Bioinformatics tools: known virus •Goal: Generate full length virus sequence for downstream analysis •Typing, epidemiology, resistance marker

Q s

core Targeted DNAseq from FFPE

Page 8: Bioinformatics- Data Analysis · Bioinformatics tools: known virus •Goal: Generate full length virus sequence for downstream analysis •Typing, epidemiology, resistance marker

Processing

• Trimmed, quality filtered reads now ready for alignment

• Bioinformatic tools for known virus

• Bioinformatic tools for unknown virus (Agnostic)

Page 9: Bioinformatics- Data Analysis · Bioinformatics tools: known virus •Goal: Generate full length virus sequence for downstream analysis •Typing, epidemiology, resistance marker

Bioinformatics tools: known virus

• Goal: Generate full length virus sequence for downstream analysis• Typing, epidemiology, resistance marker analysis

• Align to single (or list of) reference genome(s)• Various alignment algorithms

• Custom trimming

• Creates Sequence Alignment Map (SAM) file

Virus reference genome:

Fastq files:

Page 10: Bioinformatics- Data Analysis · Bioinformatics tools: known virus •Goal: Generate full length virus sequence for downstream analysis •Typing, epidemiology, resistance marker

Bioinformatics tools: known virus

Bioinformatic suites• Geneious

• CLC workbench

• Bionumerics

• Others

• Graphical User Interface (GUI) • Very intuitive and user friendly

• Alignment plugins• Pull reference genomes from NCBI

Page 11: Bioinformatics- Data Analysis · Bioinformatics tools: known virus •Goal: Generate full length virus sequence for downstream analysis •Typing, epidemiology, resistance marker

Bioinformatics tools: known virus

• Downstream analysis tools• Annotation

• Phylogenetic analysis

Giberson et al (2011) NAR Stephanie Mitchell, PhD

Page 12: Bioinformatics- Data Analysis · Bioinformatics tools: known virus •Goal: Generate full length virus sequence for downstream analysis •Typing, epidemiology, resistance marker

Bioinformatic tools: unknown virus (agnostic)

• Goal: Detect any virus sequence present in a clinical sample

• Bioinformatic suites previously mentioned• Manually curate list of reference sequences to align reads against

• Web-based metagenomic pipelines• OneCodex

• Taxonomer

• CosmosID

Page 13: Bioinformatics- Data Analysis · Bioinformatics tools: known virus •Goal: Generate full length virus sequence for downstream analysis •Typing, epidemiology, resistance marker

k-mer classification

Wood & Salzberg (2014), Genome Biology 15(3)

Page 14: Bioinformatics- Data Analysis · Bioinformatics tools: known virus •Goal: Generate full length virus sequence for downstream analysis •Typing, epidemiology, resistance marker

OneCodex

Page 15: Bioinformatics- Data Analysis · Bioinformatics tools: known virus •Goal: Generate full length virus sequence for downstream analysis •Typing, epidemiology, resistance marker

Taxonomer

Page 16: Bioinformatics- Data Analysis · Bioinformatics tools: known virus •Goal: Generate full length virus sequence for downstream analysis •Typing, epidemiology, resistance marker

CosmosID

Page 17: Bioinformatics- Data Analysis · Bioinformatics tools: known virus •Goal: Generate full length virus sequence for downstream analysis •Typing, epidemiology, resistance marker

Pipeline feature OneCodex Taxonomer CosmosID

Analysis Time* ~8 minutes ~5 minutes ~5 minutes

Number of virus genomes 5,137 >90,000 5,025

Comparison between samples Yes No Yes

Upload many samples at once Yes No Yes

Searchable reference genome list Yes No No

Independent view of virus hits No Yes Yes

Visual manipulation No Yes Yes

Connection to sequencer’s cloud No No Yes

*for 1 sample with 2 million reads

Page 18: Bioinformatics- Data Analysis · Bioinformatics tools: known virus •Goal: Generate full length virus sequence for downstream analysis •Typing, epidemiology, resistance marker

Pipeline comparison: Adenovirus from a conjunctival swab

Page 19: Bioinformatics- Data Analysis · Bioinformatics tools: known virus •Goal: Generate full length virus sequence for downstream analysis •Typing, epidemiology, resistance marker

Pipeline comparison: Adenovirus from a conjunctival swab

Bioinformatic method Reads of Adenovirus Type of Adenovirus

OneCodex 39,175 B

Taxonomer 5,833 B

CosmosID 39,930 B

Manual alignment and BLAST analysis

38,573 B, serotype 3

Page 20: Bioinformatics- Data Analysis · Bioinformatics tools: known virus •Goal: Generate full length virus sequence for downstream analysis •Typing, epidemiology, resistance marker

Pipeline comparison: Enterovirus from a nasopharyngeal aspirate

Page 21: Bioinformatics- Data Analysis · Bioinformatics tools: known virus •Goal: Generate full length virus sequence for downstream analysis •Typing, epidemiology, resistance marker

Pipeline comparison: Enterovirus from a nasopharyngeal aspirate

Bioinformatic method Reads of Enterovirus Type of Enterovirus

OneCodex 2 Not typed

Taxonomer 609 EV-D68

CosmosID 1,498 EV-D68

Manual alignment and BLAST analysis

1,174 EV-D68

Page 22: Bioinformatics- Data Analysis · Bioinformatics tools: known virus •Goal: Generate full length virus sequence for downstream analysis •Typing, epidemiology, resistance marker

Interpretation

• Goal: Make accurate prediction with sequence data• Is the patient infected with virus “X”?• Is this SNP real?

• Important variables• Number of reads• Location of reads• Depth of coverage

• Sources of error• PCR errors during library prep or cluster generation• Read length (over-trimming)• Sequencing errors• Mapping errors• Contamination

Page 23: Bioinformatics- Data Analysis · Bioinformatics tools: known virus •Goal: Generate full length virus sequence for downstream analysis •Typing, epidemiology, resistance marker

Interpretation

• You have results from web-based pipeline, now what?• How do you decide what is real/meaningful?

• Manual confirmation is recommended• Can differentiate real hits from false-positives

• Genotyping/resistance mutation detection• SNPs result of quality issues

Page 24: Bioinformatics- Data Analysis · Bioinformatics tools: known virus •Goal: Generate full length virus sequence for downstream analysis •Typing, epidemiology, resistance marker

Interpretation: Important variables

• Number of reads

• Location of reads• All in one region?

• Depth of Coverage

• Average coverage across the viral genome

• Coverage at each individual base

Page 25: Bioinformatics- Data Analysis · Bioinformatics tools: known virus •Goal: Generate full length virus sequence for downstream analysis •Typing, epidemiology, resistance marker

Depth of coverage example

Page 26: Bioinformatics- Data Analysis · Bioinformatics tools: known virus •Goal: Generate full length virus sequence for downstream analysis •Typing, epidemiology, resistance marker

Low read count example

Manual alignment to Torque teno midi virus 2 reference genome= 0 reads

Page 27: Bioinformatics- Data Analysis · Bioinformatics tools: known virus •Goal: Generate full length virus sequence for downstream analysis •Typing, epidemiology, resistance marker

Interpretation: Sources of Error

• PCR errors during library prep or cluster generation

• Read length (over-trimming)

• Sequencing errors

• Mapping errors

• Contamination

• **Positive and negative controls can help with some of these issues

Page 28: Bioinformatics- Data Analysis · Bioinformatics tools: known virus •Goal: Generate full length virus sequence for downstream analysis •Typing, epidemiology, resistance marker

Mapping error example

Measles virus reference genome

Measles virus reference genome

A= in silico Dolphin morbillivirusB= in silico low level measles virus

Schlaberg et al (2017) Arch Path & Lab Med

Page 29: Bioinformatics- Data Analysis · Bioinformatics tools: known virus •Goal: Generate full length virus sequence for downstream analysis •Typing, epidemiology, resistance marker

Contamination example

Rhinovirus reads in contaminating sample

Pe

rce

nt

fals

e-p

osi

tive

Rh

ino

viru

s re

ads

in s

amp

le w

ith

co

nta

min

atio

n

A sample with 652,676 reads of Rhinoviruscontaminates a neighbor with 100 reads of Rhinovirus

neighbor was sequenced at depth of 1 million reads

Page 30: Bioinformatics- Data Analysis · Bioinformatics tools: known virus •Goal: Generate full length virus sequence for downstream analysis •Typing, epidemiology, resistance marker

Standardization and Validation

• NY State validation guidelines• Minimum of Q20 per base and Q20 per mapped read

• FDA Draft Guidance• Need for regulatory-grade sequence database

• Cutoff values for positivity

• ARUP, UCSF, ASM PPC & CAP MRC clinical validation guidance• 5 million total reads per sample-cutoff in CSF

• Minimum of 3 non-overlapping virus gene reads- cutoff for positive result

• ASM-AAM NGS Report• Interpretive guidelines

• Quality standards

Page 31: Bioinformatics- Data Analysis · Bioinformatics tools: known virus •Goal: Generate full length virus sequence for downstream analysis •Typing, epidemiology, resistance marker

Conclusions

• Lack of standardized NGS analysis protocols• Laboratories should look to published guidance

• All analysis pipelines have limitations • Number and diversity of curated sequences

• False-positive hits

• Data should be scrutinized• Manual confirmation of pipeline hits

• Quality filtering

Page 32: Bioinformatics- Data Analysis · Bioinformatics tools: known virus •Goal: Generate full length virus sequence for downstream analysis •Typing, epidemiology, resistance marker

Resources

References:

• Wadsworth validation guidelines: http://www.wadsworth.org/sites/default/ files/WebDoc/ID WGS NGS Molecular Guidelines for Isolates_0.pdf.

• ARUP, UCSF, ASM PPC, CAP MRC Validation guidance: Schlaberg RS, Chiu CY, et al (2017) Archives of Path and Lab Medicine. doi: http://dx.doi.org/10.5858/arpa.2016-0539-RA

• FDA Draft guidance: https://www.fda.gov/downloads/MedicalDevices/DeviceRegulationandGuidance/GuidanceDocuments/UCM500441.pdf

• Broad Institute Best Practices: https://software.broadinstitute.org/gatk/best-practices/

• ASM-AAM Report: https://www.asm.org/images/Colloquia-report/NGS_Report.pdf

Talk to your genomics colleagues, they have probably encountered many of the same issues already