Top Banner
Vanderbilt Center for Quantitative Sciences Summer Institute Sequencing Analysis (DNA) Yan Guo
51

Vanderbilt Center for Quantitative Sciences Summer Institute Sequencing Analysis (DNA) Yan Guo.

Dec 21, 2015

Download

Documents

Dina Wilcox
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Vanderbilt Center for Quantitative Sciences Summer Institute Sequencing Analysis (DNA) Yan Guo.

Vanderbilt Center for Quantitative Sciences Summer Institute

Sequencing Analysis (DNA)

Yan Guo

Page 2: Vanderbilt Center for Quantitative Sciences Summer Institute Sequencing Analysis (DNA) Yan Guo.
Page 3: Vanderbilt Center for Quantitative Sciences Summer Institute Sequencing Analysis (DNA) Yan Guo.

Alignment

ATCGGGAATGCCGTTAACGGTTGGCGT

Reference genome

Human genome is about 3 billion base pair (3,000,000,000)in length.If read is 100 bp long, what is the probability of unique alignment?

1/(4x4x4…4) =1/4100 =1/1.60694E+60

Page 4: Vanderbilt Center for Quantitative Sciences Summer Institute Sequencing Analysis (DNA) Yan Guo.

Alignment Tools

• BWA http://bio-bwa.sourceforge.net/

• Bowtie http://bowtie-bio.sourceforge.net/index.shtml

Doing accurate alignment for a 30 million reads will take 30 million x 3billion time units.

Both are based on Borrows-Wheeler Algorithm

Page 5: Vanderbilt Center for Quantitative Sciences Summer Institute Sequencing Analysis (DNA) Yan Guo.

Alignment Results – Bam files

• SAM – uncompressed• Bam – compressed• http://

samtools.github.io/hts-specs/SAMv1.pdf• Sort and index before performing analysis• Don’t forget to perform QC on alignment

Page 6: Vanderbilt Center for Quantitative Sciences Summer Institute Sequencing Analysis (DNA) Yan Guo.

How to call SNPs

http://www.broadinstitute.org/igv/

Page 7: Vanderbilt Center for Quantitative Sciences Summer Institute Sequencing Analysis (DNA) Yan Guo.

Local Realignment

Page 8: Vanderbilt Center for Quantitative Sciences Summer Institute Sequencing Analysis (DNA) Yan Guo.

Recalibration

Why do we need realignment and recalibration for DNA but not RNA?

Page 9: Vanderbilt Center for Quantitative Sciences Summer Institute Sequencing Analysis (DNA) Yan Guo.

SNP calling

• GATK https://www.broadinstitute.org/gatk/

• Varscan http://varscan.sourceforge.net/

Page 10: Vanderbilt Center for Quantitative Sciences Summer Institute Sequencing Analysis (DNA) Yan Guo.

VCF files

Page 11: Vanderbilt Center for Quantitative Sciences Summer Institute Sequencing Analysis (DNA) Yan Guo.

Annotation using ANNOVAR

http://www.openbioinformatics.org/annovar/

Page 12: Vanderbilt Center for Quantitative Sciences Summer Institute Sequencing Analysis (DNA) Yan Guo.

Somatic Mutation

• Different from SNP (not germline)• Both tumor and normal samples are needed

to accurately define a somatic mutation

• Tumor sample is almost never 100% tumor

Page 13: Vanderbilt Center for Quantitative Sciences Summer Institute Sequencing Analysis (DNA) Yan Guo.

Somatic mutation callers

• MuTect http://www.broadinstitute.org/cancer/cga/mutect

• Varscan http://varscan.sourceforge.net/

Page 14: Vanderbilt Center for Quantitative Sciences Summer Institute Sequencing Analysis (DNA) Yan Guo.

Quality Control on SNPs

• Number of Novel Non-synonymous SNP ~ 100 – 200

• Transition / transversion ratio• Heterozygous / non reference

homozygous ratio• Heterozygous consistency• Strand Bias• Cycle Bias

Page 15: Vanderbilt Center for Quantitative Sciences Summer Institute Sequencing Analysis (DNA) Yan Guo.

Ti/Tv ratio

Page 16: Vanderbilt Center for Quantitative Sciences Summer Institute Sequencing Analysis (DNA) Yan Guo.

Heterozygous / non reference homozygous ratio

Page 17: Vanderbilt Center for Quantitative Sciences Summer Institute Sequencing Analysis (DNA) Yan Guo.

Ti/Tv ratio by race and regions

Page 18: Vanderbilt Center for Quantitative Sciences Summer Institute Sequencing Analysis (DNA) Yan Guo.

Heterozygous / non reference homozygous ratio by race and regions

Page 19: Vanderbilt Center for Quantitative Sciences Summer Institute Sequencing Analysis (DNA) Yan Guo.

Heterozygous Genotype Consistency

Page 20: Vanderbilt Center for Quantitative Sciences Summer Institute Sequencing Analysis (DNA) Yan Guo.

Strand Bias

Table 1 . Strand bias examples from real data

Chr Pos depth a1 b2 c3 d4

Forward Strand Genotype

Reverse Strand Genotype

6 32975014 21 5 5 10 1Heterzygous Homozygous

1 81967962 38 20 11 7 0Heterzygous Homozygous

12 10215654 31 15 9 7 0Heterzygous Homozygous

1. Forward strand reference allele    

2. Forward strand non reference allele

3. Reverse strand reference allele

4. Reverse strand non reference allele

Page 21: Vanderbilt Center for Quantitative Sciences Summer Institute Sequencing Analysis (DNA) Yan Guo.

Cycle Bias

Page 22: Vanderbilt Center for Quantitative Sciences Summer Institute Sequencing Analysis (DNA) Yan Guo.

Pooled Analysis

• Pool samples together without barcode• Save money• Can only be used to evaluate allele frequency

Page 23: Vanderbilt Center for Quantitative Sciences Summer Institute Sequencing Analysis (DNA) Yan Guo.

Pooled Analysis - Conclusion

Page 24: Vanderbilt Center for Quantitative Sciences Summer Institute Sequencing Analysis (DNA) Yan Guo.

Advanced Data Mining

Page 25: Vanderbilt Center for Quantitative Sciences Summer Institute Sequencing Analysis (DNA) Yan Guo.

The known and unknown of sequencing data

Page 26: Vanderbilt Center for Quantitative Sciences Summer Institute Sequencing Analysis (DNA) Yan Guo.

The known and unknown of sequencing data

KnownUnknown

Page 27: Vanderbilt Center for Quantitative Sciences Summer Institute Sequencing Analysis (DNA) Yan Guo.

The known and unknown of sequencing data

KnownKnown UnknownUnkown Unkown

Page 28: Vanderbilt Center for Quantitative Sciences Summer Institute Sequencing Analysis (DNA) Yan Guo.

Known – Things we always know that Sequencing data can do

SNV, mutation

CNV

Xie et al. BMC Bioinformatics 2009

Structural VariantsAlkan et al. Nature Review Genetics, 2011

Page 29: Vanderbilt Center for Quantitative Sciences Summer Institute Sequencing Analysis (DNA) Yan Guo.

Known Unknown – Other information we found that sequencing data contain

KnownKnown UnknownUnkown Unkown

SNVs and Mutations in non targeted regions Mitochondria

Virus and Microbe

Page 30: Vanderbilt Center for Quantitative Sciences Summer Institute Sequencing Analysis (DNA) Yan Guo.

How is additional data mining possible?

• Data mining is possible because capture techniques are not perfect.

Page 31: Vanderbilt Center for Quantitative Sciences Summer Institute Sequencing Analysis (DNA) Yan Guo.

Capture Efficiency of The Three Major Capture Kits

Page 32: Vanderbilt Center for Quantitative Sciences Summer Institute Sequencing Analysis (DNA) Yan Guo.

Potential Functions of Intron and Intergenic

ENCODE suggested that over 80% human genome maybe functional.

Majority of the GWAS SNPs are not in coding regions (706 exon, 3986 intron, 3323 intergenic)

Page 33: Vanderbilt Center for Quantitative Sciences Summer Institute Sequencing Analysis (DNA) Yan Guo.

Coverage of the Unintended Regions

• The coverage don’t just drop off suddenly after the capture region end.

• Capture region example: chr1 1000 1500

1000 1500

1000 1500

Page 34: Vanderbilt Center for Quantitative Sciences Summer Institute Sequencing Analysis (DNA) Yan Guo.

Reads Aligned to Non Target Regions Can Be Used to Detect SNPs

• Tibetan exome study : Through exome sequencing of 50 Tibetan subjects, 2 intron SNPs were identified to be associated with high altitude. (Yi, et al. Science 2010)

• Non capture region study: Non capture region’s reads were studied to show they can infer reliable SNPs. (Guo, et al BMC Genomics)

Page 35: Vanderbilt Center for Quantitative Sciences Summer Institute Sequencing Analysis (DNA) Yan Guo.

Known unknown - Mitochondria

However, mitochondria is only 16569 BP

Assumptions: 40 mil reads 100BP long read

Page 36: Vanderbilt Center for Quantitative Sciences Summer Institute Sequencing Analysis (DNA) Yan Guo.

Dealing with nuMTs

Page 37: Vanderbilt Center for Quantitative Sciences Summer Institute Sequencing Analysis (DNA) Yan Guo.

Alignment Results

Page 38: Vanderbilt Center for Quantitative Sciences Summer Institute Sequencing Analysis (DNA) Yan Guo.

Extract mitochondria from exome sequencing

Tools:• Picardi et al. Nature Methods 2012• Guo et al. Bioinformatics, 2013 (MitoSeek)

Diagnosis:• Dinwiddie et al. Genmics 2013• Nemeth et al, Brain 2013

Page 39: Vanderbilt Center for Quantitative Sciences Summer Institute Sequencing Analysis (DNA) Yan Guo.

Virus

• Virus sequences can be captured through high throughput sequencing of human samples

• HBV in liver cancer samples (Sung, et al. Nature Genetics, 2012) (Jiang, et al. Genome Research, 2012)

• HPV in head and neck cancer (Chen, et al. Bioinformatics, 2012)

Page 40: Vanderbilt Center for Quantitative Sciences Summer Institute Sequencing Analysis (DNA) Yan Guo.

HPV AlignmentExample

Page 41: Vanderbilt Center for Quantitative Sciences Summer Institute Sequencing Analysis (DNA) Yan Guo.

Tools for Detecting Virus from Sequencing data

• PathSeq (Kostic, et al. Nature, 2011 Biotechnology)

• VirusSeq (Chen, et al. Bioinformatics, 2012)• ViralFusionSeq (Li, et al. Bioinformatics, 2012)• VirusFinder (Wang, et al. PlOS ONE, 2013)

Page 42: Vanderbilt Center for Quantitative Sciences Summer Institute Sequencing Analysis (DNA) Yan Guo.

The Data Mining Ideas applied to RNA

• RNAseq has been used a replacement of microarray.

• Other application of RNAseq include dection of alternative splicing, and fusion genes.

• Additional data mining opportunities also available for RNAseq data

Page 43: Vanderbilt Center for Quantitative Sciences Summer Institute Sequencing Analysis (DNA) Yan Guo.

SNV and Indel

• Difficulty due to high false positive rate• RNAMapper (Miller, et al. Genome Research,

2013)• SNVQ (Duitama, et al. (BMC Genomics, 2013)• FX (Hong, et al. Bioinformatics, 2012)• OSA (Hu, et al. Binformatics, 2012)

Page 44: Vanderbilt Center for Quantitative Sciences Summer Institute Sequencing Analysis (DNA) Yan Guo.

Microsatellite instability

Examples:• Yoon, et al. Genome Research 2013• Zheng, et al. BMC Genomics, 2013

Page 45: Vanderbilt Center for Quantitative Sciences Summer Institute Sequencing Analysis (DNA) Yan Guo.

RNA Editing and Allele-specific expression

RNA editing tools and database• DARNED, REDidb, dbRES, RADAR

Allele-specific expression• asSeq (Sun, et al. Biometrics, 2012)• AlleleSeq (Rozowsky, et al. Molecular Systems

Biology, 2011)

Page 46: Vanderbilt Center for Quantitative Sciences Summer Institute Sequencing Analysis (DNA) Yan Guo.

Exogenous RNA• Virus (Same as DNA)• Food RNA (you are what you eat)

Wang, et al. PLOS ONE, 2012

Page 47: Vanderbilt Center for Quantitative Sciences Summer Institute Sequencing Analysis (DNA) Yan Guo.

nonCoding RNA

Page 48: Vanderbilt Center for Quantitative Sciences Summer Institute Sequencing Analysis (DNA) Yan Guo.

Unknown Unknown

Contamination

Reference is not perfect

Unknown treasures

Page 49: Vanderbilt Center for Quantitative Sciences Summer Institute Sequencing Analysis (DNA) Yan Guo.

Exome

Samuels, et al. Trends in Genetics, 2013

Page 50: Vanderbilt Center for Quantitative Sciences Summer Institute Sequencing Analysis (DNA) Yan Guo.

RNAseq

Page 51: Vanderbilt Center for Quantitative Sciences Summer Institute Sequencing Analysis (DNA) Yan Guo.

Quality Control

Quality Quantity

Guo et al. Briefings in Bioinformatics, 2013