8/19/2019 NGS ToolsFormats r1 bdg
1/32
NGS Data AnalysisTools & Formats
8/19/2019 NGS ToolsFormats r1 bdg
2/32
Basic Workflow
SEQUENCER
REFERENCE
SAM
FASTQ
BAM
8/19/2019 NGS ToolsFormats r1 bdg
3/32
File Formats
8/19/2019 NGS ToolsFormats r1 bdg
4/32
FASTA
● Simple text-based format.● Sequence starts with a > followed by the sequence identifier
and optionally, a description● Usually indicated with the suffix *.fa or *.fasta or *.fsa
>seq_1 descriptionATGCTGCTGACGTAGCGATGCAGTAGCAGGTACGAGTCGCAGTGCAGATGCA>seq_2GTAGACGATCGATGCAGCATGACGATGACGATGACGACGATGA
CGATAGCAGATGCA
8/19/2019 NGS ToolsFormats r1 bdg
5/32
FASTQ
● text-based format● four lines entry per sequence● storing sequence and its corresponding quality score● most commonly used format to store sequencing reads● usually indicated with the suffix *.fastq or *.fq
@SEQ_IDGATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT+
!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65
8/19/2019 NGS ToolsFormats r1 bdg
6/32
8/19/2019 NGS ToolsFormats r1 bdg
7/32
Quality
Q = -10logP, where P is base-calling error probabilities(i.e., the probability that the corresponding base call isincorrect)
!#$%&'()*+,-./0123456789:;?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~
8/19/2019 NGS ToolsFormats r1 bdg
8/32
http://en.wikipedia.org
8/19/2019 NGS ToolsFormats r1 bdg
9/32
SAM
● SAM stands for Sequence Alignment/Map format● TAB-delimited text format● flexible enough to store all the alignment information
generated● allows most of operations on the alignment to work on a
stream without loading the whole alignment into memory●
allows the file to be indexed by genomic position to efficientlyretrieve all reads aligning to a locus● consists of a header section (optional) and an alignment
section
Li et al., 2009.
8/19/2019 NGS ToolsFormats r1 bdg
10/32
Li et al., 2009.
8/19/2019 NGS ToolsFormats r1 bdg
11/32
BAM
• BAM is the compressed binary version of the SAM format
• compact and index-able representation of nucleotide sequencealignments.
• uses a modified form of gzip format called BGZF (BlockedGNU Zip Format)
8/19/2019 NGS ToolsFormats r1 bdg
12/32
VCF
Variant Call Format
VCF is a textfile format (most likely stored in a compressed manner). Itcontains meta-information lines, a header line, and thendata lines each containing information about a position inthe genome. The format also has the ability to containgenotype information on samples for each position
VCF specs v4.2
8/19/2019 NGS ToolsFormats r1 bdg
13/32
VCF specs v4.2
8/19/2019 NGS ToolsFormats r1 bdg
14/32
Hapmap
• text-based file format• information for a series of SNPs as well as the germplasmlines are stored in one file
• the first row contains the header labels, and each additionalrow contains all the information associated with a single SNP
• the first 11 columns describe attributes of the SNP, while the
following columns describe the SNP value for a singlegermplasm line
http://www.maizegenetics.net
8/19/2019 NGS ToolsFormats r1 bdg
15/32
8/19/2019 NGS ToolsFormats r1 bdg
16/32
8/19/2019 NGS ToolsFormats r1 bdg
17/32
GFF : General Feature Format
Ca8 GLEAN mRNA 76315 78595 0.990688 + . ID=Ca_11934;Ca8 GLEAN CDS 76315 76450 . + 0 Parent=Ca_11934;Ca8 GLEAN CDS 76668 76852 . + 2 Parent=Ca_11934;Ca8 GLEAN CDS 77457 77657 . + 0 Parent=Ca_11934;Ca8 GLEAN CDS 77994 78155 . + 0 Parent=Ca_11934;Ca8 GLEAN CDS 78233 78595 . + 0 Parent=Ca_11934;Ca8 GLEAN mRNA 85322 90545 0.655887 + . ID=Ca_11933;Ca8 GLEAN CDS 85322 86173 . + 0 Parent=Ca_11933;Ca8 GLEAN CDS 88630 89316 . + 0 Parent=Ca_11933;Ca8 GLEAN CDS 89970 90545 . + 0 Parent=Ca_11933;Ca8 GLEAN mRNA 94102 99473 0.967529 - . ID=Ca_11932;Ca8 GLEAN CDS 98946 99473 . - 0 Parent=Ca_11932;Ca8 GLEAN CDS 97180 97620 . - 0 Parent=Ca_11932;Ca8 GLEAN CDS 96589 96819 . - 0 Parent=Ca_11932;Ca8 GLEAN CDS 95733 95797 . - 0 Parent=Ca_11932;Ca8 GLEAN CDS 95601 95658 . - 1 Parent=Ca_11932;Ca8 GLEAN CDS 94282 94350 . - 0 Parent=Ca_11932;Ca8 GLEAN CDS 94102 94200 . - 0 Parent=Ca_11932;
8/19/2019 NGS ToolsFormats r1 bdg
18/32
GTF : General Transfer Format
• first 8 column same as GFF
• 9th column is structured differently
• it must begin with 'gene_id' and 'transcript_id' attributes
• attribute must end with a semi-colon
ID=geneA;Name=geneA
ID=exonA1;Parent=geneA
gene_id "geneA" ;transcript_id "geneA.1" ;
GFF
GTF
8/19/2019 NGS ToolsFormats r1 bdg
19/32
Tools
8/19/2019 NGS ToolsFormats r1 bdg
20/32
Quality Control
Why Quality Control ?•sequencing a poor library on multiple runs• time required for analysis
•cost of analyzing data
•raw sequence data storage•hours spent in analysis could be wasted
8/19/2019 NGS ToolsFormats r1 bdg
21/32
QC Tools
• FastQC http://www.bioinformatics.babraham.ac.uk/projects/fastqc/
• providing a quick overview to tell you in which areasthere may be problems
• summary graphs and tables to quickly assess your
dataexport of results to an HTML based permanent report• offline operation to allow automated generation of
reports without running the interactive application
• PrintSeq Schmieder R and Edwards R, 2011• summary statistics for your sequence data• reformat and trim your sequences• easily configurable
8/19/2019 NGS ToolsFormats r1 bdg
22/32
• Trimmomatic Bolger et al., 2014
• flexible read trimming tool for Illumina NGS data
• trims adapter • fast, multithreaded command line toolt
• Sickle https://github.com/najoshi/sickle
• supports gzipped file inputs• with both paired-end and single-end• easily configurable
• Cutadapt Marcel Martin, 2011
• trims reads from current high-throughput sequencingmachines
• errors in the adapter are tolerated• input or output file can be gzip-compressed
8/19/2019 NGS ToolsFormats r1 bdg
23/32
http://www.bioinformatics.babraham.ac.uk/projects/fastqc/
BAD
8/19/2019 NGS ToolsFormats r1 bdg
24/32
http://prinseq.sourceforge.net
BAD
8/19/2019 NGS ToolsFormats r1 bdg
25/32
GOOD
http://www.bioinformatics.babraham.ac.uk/projects/fastqc/
8/19/2019 NGS ToolsFormats r1 bdg
26/32
Alignment Tools
• Also called mapping• experiments with known genome• align reads to the reference genome• computationally intensive for huge volume data and large referencegenome
Bowtie2 Langmead and Salzberg, 2012
• an ultrafast and memory-efficient tool for aligningsequencing reads• supports gapped, local, and paired-end alignment• no upper limit on read length
8/19/2019 NGS ToolsFormats r1 bdg
27/32
BWA Li and Durbin, 2009
• fast and require less memory compare to many othertools• supports gapped alignment• supports read lengths upto 1 Mb• default configuration works for most typical inputs
GS Reference Mapper Roche
• rapidly and accurately align reads to any referencegenome
• identify differences compared to the reference• annotate reference features and variations• explore the full spectrum of genomic variation
8/19/2019 NGS ToolsFormats r1 bdg
28/32
IGV : Integrative Genomics Viewer
James et al., 2011
The Integrative Genomics Viewer (IGV) is a high-performance visualization tool for interactiveexploration of large, integrated genomic datasets
Supports multiple data types• Sequence alignments• Genome annotations• Variants/SNPsetc.
8/19/2019 NGS ToolsFormats r1 bdg
29/32
James et al., 2011
8/19/2019 NGS ToolsFormats r1 bdg
30/32
CLC Genomics Workbench
• commercial / paid application• computationally less intensive• proprietary internal algorithms• flexible and scalable
• supports all typical NGS workflow• Resquencing• Mapping• Variant Detection• RNA-seq
• De novo assemblyetc.
http://www.clcbio.com
8/19/2019 NGS ToolsFormats r1 bdg
31/32
References• Bolger, A. M., Lohse, M., & Usadel, B. (2014). Trimmomatic: A flexible trimmer for Illumina Sequence Data. Bioinformatics, btu170.
•James T. Robinson, Helga Thorvaldsdóttir, Wendy Winckler, Mitchell Guttman,Eric S. Lander, Gad Getz, Jill P. Mesirov. Integrative Genomics Viewer. NatureBiotechnology (2011), 29, 24–26.
•Langmead B, Salzberg S. Fast gapped-read alignment with Bowtie 2. NatureMethods. (2012), 9:357-359.
•Li et al., The Sequence Alignment/Map format and SAMtools. Bioinformatics(2009), 25 (16): 2078-2079.
• Li H. and Durbin R. Fast and accurate short read alignment with Burrows-Wheeler Transform. Bioinformatics (2009), 25:1754-60.
• Marcel Martin. Cutadapt removes adapter sequences from high-throughputsequencing reads. EMBnet.journal (2011), 17:10-12
• Schmieder R and Edwards R: Quality control and preprocessing of metagenomic datasets. Bioinformatics (2011), 27:863-864.
8/19/2019 NGS ToolsFormats r1 bdg
32/32
Thank you!