Top Banner

of 14

NGS ToolsFormats r1 bdg

Jul 08, 2018

Download

Documents

Rangga K Negara
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
  • 8/19/2019 NGS ToolsFormats r1 bdg

    1/32

    NGS Data AnalysisTools & Formats

  • 8/19/2019 NGS ToolsFormats r1 bdg

    2/32

    Basic Workflow

    SEQUENCER

    REFERENCE

    SAM

    FASTQ

    BAM

  • 8/19/2019 NGS ToolsFormats r1 bdg

    3/32

    File Formats

  • 8/19/2019 NGS ToolsFormats r1 bdg

    4/32

    FASTA

    ● Simple text-based format.● Sequence starts with a > followed by the sequence identifier

    and optionally, a description● Usually indicated with the suffix *.fa or *.fasta or *.fsa

    >seq_1 descriptionATGCTGCTGACGTAGCGATGCAGTAGCAGGTACGAGTCGCAGTGCAGATGCA>seq_2GTAGACGATCGATGCAGCATGACGATGACGATGACGACGATGA

    CGATAGCAGATGCA

  • 8/19/2019 NGS ToolsFormats r1 bdg

    5/32

    FASTQ

    ● text-based format● four lines entry per sequence● storing sequence and its corresponding quality score● most commonly used format to store sequencing reads● usually indicated with the suffix *.fastq or *.fq

    @SEQ_IDGATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT+

    !''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65

  • 8/19/2019 NGS ToolsFormats r1 bdg

    6/32

  • 8/19/2019 NGS ToolsFormats r1 bdg

    7/32

    Quality

    Q = -10logP, where P is base-calling error probabilities(i.e., the probability that the corresponding base call isincorrect)

    !#$%&'()*+,-./0123456789:;?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~

  • 8/19/2019 NGS ToolsFormats r1 bdg

    8/32

    http://en.wikipedia.org

  • 8/19/2019 NGS ToolsFormats r1 bdg

    9/32

    SAM

    ● SAM stands for Sequence Alignment/Map format● TAB-delimited text format● flexible enough to store all the alignment information

    generated● allows most of operations on the alignment to work on a

    stream without loading the whole alignment into memory●

    allows the file to be indexed by genomic position to efficientlyretrieve all reads aligning to a locus● consists of a header section (optional) and an alignment

    section

    Li et al., 2009.

  • 8/19/2019 NGS ToolsFormats r1 bdg

    10/32

    Li et al., 2009.

  • 8/19/2019 NGS ToolsFormats r1 bdg

    11/32

    BAM

    • BAM is the compressed binary version of the SAM format

    • compact and index-able representation of nucleotide sequencealignments.

    • uses a modified form of gzip format called BGZF (BlockedGNU Zip Format)

  • 8/19/2019 NGS ToolsFormats r1 bdg

    12/32

    VCF

    Variant Call Format

    VCF is a textfile format (most likely stored in a compressed manner). Itcontains meta-information lines, a header line, and thendata lines each containing information about a position inthe genome. The format also has the ability to containgenotype information on samples for each position

    VCF specs v4.2

  • 8/19/2019 NGS ToolsFormats r1 bdg

    13/32

    VCF specs v4.2

  • 8/19/2019 NGS ToolsFormats r1 bdg

    14/32

    Hapmap

    • text-based file format• information for a series of SNPs as well as the germplasmlines are stored in one file

    • the first row contains the header labels, and each additionalrow contains all the information associated with a single SNP

    • the first 11 columns describe attributes of the SNP, while the

    following columns describe the SNP value for a singlegermplasm line

    http://www.maizegenetics.net

  • 8/19/2019 NGS ToolsFormats r1 bdg

    15/32

  • 8/19/2019 NGS ToolsFormats r1 bdg

    16/32

  • 8/19/2019 NGS ToolsFormats r1 bdg

    17/32

    GFF : General Feature Format

    Ca8 GLEAN mRNA 76315 78595 0.990688 + . ID=Ca_11934;Ca8 GLEAN CDS 76315 76450 . + 0 Parent=Ca_11934;Ca8 GLEAN CDS 76668 76852 . + 2 Parent=Ca_11934;Ca8 GLEAN CDS 77457 77657 . + 0 Parent=Ca_11934;Ca8 GLEAN CDS 77994 78155 . + 0 Parent=Ca_11934;Ca8 GLEAN CDS 78233 78595 . + 0 Parent=Ca_11934;Ca8 GLEAN mRNA 85322 90545 0.655887 + . ID=Ca_11933;Ca8 GLEAN CDS 85322 86173 . + 0 Parent=Ca_11933;Ca8 GLEAN CDS 88630 89316 . + 0 Parent=Ca_11933;Ca8 GLEAN CDS 89970 90545 . + 0 Parent=Ca_11933;Ca8 GLEAN mRNA 94102 99473 0.967529 - . ID=Ca_11932;Ca8 GLEAN CDS 98946 99473 . - 0 Parent=Ca_11932;Ca8 GLEAN CDS 97180 97620 . - 0 Parent=Ca_11932;Ca8 GLEAN CDS 96589 96819 . - 0 Parent=Ca_11932;Ca8 GLEAN CDS 95733 95797 . - 0 Parent=Ca_11932;Ca8 GLEAN CDS 95601 95658 . - 1 Parent=Ca_11932;Ca8 GLEAN CDS 94282 94350 . - 0 Parent=Ca_11932;Ca8 GLEAN CDS 94102 94200 . - 0 Parent=Ca_11932;

  • 8/19/2019 NGS ToolsFormats r1 bdg

    18/32

    GTF : General Transfer Format

    • first 8 column same as GFF

    • 9th column is structured differently

    • it must begin with 'gene_id' and 'transcript_id' attributes

    • attribute must end with a semi-colon

    ID=geneA;Name=geneA

    ID=exonA1;Parent=geneA

    gene_id "geneA" ;transcript_id "geneA.1" ;

    GFF

    GTF

  • 8/19/2019 NGS ToolsFormats r1 bdg

    19/32

    Tools

  • 8/19/2019 NGS ToolsFormats r1 bdg

    20/32

    Quality Control

    Why Quality Control ?•sequencing a poor library on multiple runs• time required for analysis

    •cost of analyzing data

    •raw sequence data storage•hours spent in analysis could be wasted

  • 8/19/2019 NGS ToolsFormats r1 bdg

    21/32

    QC Tools

    • FastQC http://www.bioinformatics.babraham.ac.uk/projects/fastqc/

    • providing a quick overview to tell you in which areasthere may be problems

    • summary graphs and tables to quickly assess your

    dataexport of results to an HTML based permanent report• offline operation to allow automated generation of

    reports without running the interactive application

    • PrintSeq Schmieder R and Edwards R, 2011• summary statistics for your sequence data• reformat and trim your sequences• easily configurable

  • 8/19/2019 NGS ToolsFormats r1 bdg

    22/32

    • Trimmomatic Bolger et al., 2014

    • flexible read trimming tool for Illumina NGS data

    • trims adapter • fast, multithreaded command line toolt

    • Sickle https://github.com/najoshi/sickle

    • supports gzipped file inputs• with both paired-end and single-end• easily configurable

    • Cutadapt Marcel Martin, 2011

    • trims reads from current high-throughput sequencingmachines

    • errors in the adapter are tolerated• input or output file can be gzip-compressed

  • 8/19/2019 NGS ToolsFormats r1 bdg

    23/32

    http://www.bioinformatics.babraham.ac.uk/projects/fastqc/

    BAD

  • 8/19/2019 NGS ToolsFormats r1 bdg

    24/32

    http://prinseq.sourceforge.net

    BAD

  • 8/19/2019 NGS ToolsFormats r1 bdg

    25/32

    GOOD

    http://www.bioinformatics.babraham.ac.uk/projects/fastqc/

  • 8/19/2019 NGS ToolsFormats r1 bdg

    26/32

    Alignment Tools

    • Also called mapping• experiments with known genome• align reads to the reference genome• computationally intensive for huge volume data and large referencegenome

    Bowtie2 Langmead and Salzberg, 2012

    • an ultrafast and memory-efficient tool for aligningsequencing reads• supports gapped, local, and paired-end alignment• no upper limit on read length

  • 8/19/2019 NGS ToolsFormats r1 bdg

    27/32

    BWA Li and Durbin, 2009

    • fast and require less memory compare to many othertools• supports gapped alignment• supports read lengths upto 1 Mb• default configuration works for most typical inputs

    GS Reference Mapper Roche

    • rapidly and accurately align reads to any referencegenome

    • identify differences compared to the reference• annotate reference features and variations• explore the full spectrum of genomic variation

  • 8/19/2019 NGS ToolsFormats r1 bdg

    28/32

    IGV : Integrative Genomics Viewer

    James et al., 2011

    The Integrative Genomics Viewer (IGV) is a high-performance visualization tool for interactiveexploration of large, integrated genomic datasets

    Supports multiple data types• Sequence alignments• Genome annotations• Variants/SNPsetc.

  • 8/19/2019 NGS ToolsFormats r1 bdg

    29/32

    James et al., 2011

  • 8/19/2019 NGS ToolsFormats r1 bdg

    30/32

    CLC Genomics Workbench

    • commercial / paid application• computationally less intensive• proprietary internal algorithms• flexible and scalable

    • supports all typical NGS workflow• Resquencing• Mapping• Variant Detection• RNA-seq

    • De novo assemblyetc.

    http://www.clcbio.com

  • 8/19/2019 NGS ToolsFormats r1 bdg

    31/32

    References• Bolger, A. M., Lohse, M., & Usadel, B. (2014). Trimmomatic: A flexible trimmer for Illumina Sequence Data. Bioinformatics, btu170.

    •James T. Robinson, Helga Thorvaldsdóttir, Wendy Winckler, Mitchell Guttman,Eric S. Lander, Gad Getz, Jill P. Mesirov. Integrative Genomics Viewer. NatureBiotechnology (2011), 29, 24–26.

    •Langmead B, Salzberg S. Fast gapped-read alignment with Bowtie 2. NatureMethods. (2012), 9:357-359.

    •Li et al., The Sequence Alignment/Map format and SAMtools. Bioinformatics(2009), 25 (16): 2078-2079.

    • Li H. and Durbin R. Fast and accurate short read alignment with Burrows-Wheeler Transform. Bioinformatics (2009), 25:1754-60.

    • Marcel Martin. Cutadapt removes adapter sequences from high-throughputsequencing reads. EMBnet.journal (2011), 17:10-12

    • Schmieder R and Edwards R: Quality control and preprocessing of  metagenomic datasets. Bioinformatics (2011), 27:863-864.

  • 8/19/2019 NGS ToolsFormats r1 bdg

    32/32

    Thank you!