Top Banner
Genomics Method Seminar - BWA October 15, 2014 Sora Kim Researcher [email protected] Yonsei Biomedical Science Institute Yonsei University College of Medicine
28

Genomics Method Seminar - BWA October 15, 2014 Sora Kim Researcher [email protected] Yonsei Biomedical Science Institute Yonsei University College of Medicine.

Dec 25, 2015

Download

Documents

Emil Douglas
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Genomics Method Seminar - BWA October 15, 2014 Sora Kim Researcher sora15@yuhs.ac Yonsei Biomedical Science Institute Yonsei University College of Medicine.

Genomics Method Seminar- BWA

October 15, 2014

Sora Kim

[email protected]

Yonsei Biomedical Science InstituteYonsei University College of Medicine

Page 2: Genomics Method Seminar - BWA October 15, 2014 Sora Kim Researcher sora15@yuhs.ac Yonsei Biomedical Science Institute Yonsei University College of Medicine.

2/12

Today’s paper

• PhD. Heng Li– a research scientist at the Broad Institute, working

with David Reich and David Altshuler.– principal developer of several projects including SAM-

tools, BWA, MAQ, TreeSoft and TreeFam with most of them started when he was a postdoctoral fellow of Richard Durbin at the Wellcome Trust Sanger Institute.

Page 3: Genomics Method Seminar - BWA October 15, 2014 Sora Kim Researcher sora15@yuhs.ac Yonsei Biomedical Science Institute Yonsei University College of Medicine.

3/12

Software information

• Purpose– BWA-MEM is a new alignment algorithm for aligning se-

quence reads or assembly contigs against a large refer-ence genome such as human.

• Category– aligner

• Software URL– http://bio-bwa.sourceforge.net/

• License– Free, Open Source under Artistic License

Page 4: Genomics Method Seminar - BWA October 15, 2014 Sora Kim Researcher sora15@yuhs.ac Yonsei Biomedical Science Institute Yonsei University College of Medicine.

4/12

RNA-seq

ChIP-seq

WGS, WES

Page 5: Genomics Method Seminar - BWA October 15, 2014 Sora Kim Researcher sora15@yuhs.ac Yonsei Biomedical Science Institute Yonsei University College of Medicine.

5/12

Previous work

• Bowtie

– BWT + FM index– LF mapping– Backtracking

Page 6: Genomics Method Seminar - BWA October 15, 2014 Sora Kim Researcher sora15@yuhs.ac Yonsei Biomedical Science Institute Yonsei University College of Medicine.

6/12

Conceptual Overview

BWA• For

short read

BWA-SW• For

long read

BWA-MEM• For both

Page 7: Genomics Method Seminar - BWA October 15, 2014 Sora Kim Researcher sora15@yuhs.ac Yonsei Biomedical Science Institute Yonsei University College of Medicine.

7/12

CUSHAW2 - MEMs

• Long read alignment based on maximal ex-act match seeds, Yongchao Liu and Bertil Schmidt, Bioinformat-ics (2012) 28 (18):i318-i324

• CUSHAW2, a parallelized, accurate, and memory-efficient long read aligner. It is based on the seed-and-extend approach and uses maximal exact matches as seeds to find gapped alignments.

Page 8: Genomics Method Seminar - BWA October 15, 2014 Sora Kim Researcher sora15@yuhs.ac Yonsei Biomedical Science Institute Yonsei University College of Medicine.

8/12

CUSHAW2 - MEMs

Page 9: Genomics Method Seminar - BWA October 15, 2014 Sora Kim Researcher sora15@yuhs.ac Yonsei Biomedical Science Institute Yonsei University College of Medicine.

9/12

CUSHAW2 - MEMs

1. Estimation of the minimal seed size2. Generation of maximal exact

matches

Page 10: Genomics Method Seminar - BWA October 15, 2014 Sora Kim Researcher sora15@yuhs.ac Yonsei Biomedical Science Institute Yonsei University College of Medicine.

10/12

1. Estimation of the minimal seed size

• qgram lemma states that two strings P and S with an edit distance of e share at least t qgrams, that is substrings of length q, where t = max(|P|,|S|)-q+1-q*e (Exact and complete short-read alignment to microbial genomes using Graphics Pro-cessing Unit programming, Bioinformatics, Vol. 27 no. 10 2011, pages 1351–1358)

• That means that every error may destroy up to q*e overlapping qgrams.

• For non-overlapping qgrams, one error can destroy only the qgram in which it is located.

• Given this assumption, we define the length q of the qgrams as the largest value below such that

Page 11: Genomics Method Seminar - BWA October 15, 2014 Sora Kim Researcher sora15@yuhs.ac Yonsei Biomedical Science Institute Yonsei University College of Medicine.

11/12

1. Estimation of the minimal seed size

• A = ACGT• B = ACTT• q=2, e=1 이라고 가정

q(A) = {AC, CG, GT}q(B) = {AC, CT, TT}

• t = max(|A|,|B|)-q+1-q*et = max(4, 4)-2+1-2*1 = 1

• A_q 와 B_q 는 최소 t, 1 만큼은 share 하는 구간이 있어야 한다 .

Page 12: Genomics Method Seminar - BWA October 15, 2014 Sora Kim Researcher sora15@yuhs.ac Yonsei Biomedical Science Institute Yonsei University College of Medicine.

12/12

1. Estimation of the minimal seed size

• The estimation is based on the pigeonhole principle for non-overlapping q-grams, meaning that at least one q-gram of length Q is shared by S and its aligned substring mate on the genome.

• QL: global lower-bound = (default) 13• QH: global upper-bound = (default) 49

• employ a simplified error model for ungapped alignments to esti-mate e. w follows a binomial distribution.

Page 13: Genomics Method Seminar - BWA October 15, 2014 Sora Kim Researcher sora15@yuhs.ac Yonsei Biomedical Science Institute Yonsei University College of Medicine.

13/12

2. Generation of maximal exact matches

• To identify MEMs between S and T, we ad-vance the starting position p in S, from left to right, to find the longest exact matches (LEMs) using the BWT and the FM-index.

• LEMs are right/left maximal if it is not part of any previously identified MEM.

• discard the MEMs whose lengths are less than Q.– we only keep its first h (h=1024 by default) occurrences

and discard the others.

Page 14: Genomics Method Seminar - BWA October 15, 2014 Sora Kim Researcher sora15@yuhs.ac Yonsei Biomedical Science Institute Yonsei University College of Medicine.

14/12

BWA-MEM

1. Aligning a single query sequencea. Seeding and re-seedingb. Chaining and chain filteringc. Seed extension

2. Paired-end mappinga. Rescuing missing hitsb. Pairing

Page 15: Genomics Method Seminar - BWA October 15, 2014 Sora Kim Researcher sora15@yuhs.ac Yonsei Biomedical Science Institute Yonsei University College of Medicine.

15/12

SE. Seeding and re-seeding

• BWA-MEM follows the canonical seed-and-ex-tend paradigm.

• Seed an alignment with SMEMs (Super Maximal Exact Matches), which essentially finds at each query position the longest exact match cov-ering the position.

• Suppose we have a SMEM of length l with k occurrences in the reference genome.

• To reduce mismappings caused by missing seeds, we introduce re-seeding.

Page 16: Genomics Method Seminar - BWA October 15, 2014 Sora Kim Researcher sora15@yuhs.ac Yonsei Biomedical Science Institute Yonsei University College of Medicine.

16/12

SE. Chaining and chain filtering

• We call a group of seeds that are colinear and close to each other as a chain.

• We greedily chain the seeds while seeding and then filter out short chains that are largely con-tained in a long chain and are much worse than the long chain (by default, both 50% and 38bp shorter than the long chain).

• Chain filtering aims to reduce unsuccessful seed extension at a later step.

• Chains detected here do not need to be accurate.

Page 17: Genomics Method Seminar - BWA October 15, 2014 Sora Kim Researcher sora15@yuhs.ac Yonsei Biomedical Science Institute Yonsei University College of Medicine.

17/12

SE. Seed extension

• rank a seed by length of the chain it belongs to and then by the seed length.

• drop the seed if it is already contained in an alignment found before, or extend the seed with a banded affine-gap-penalty dynamic pro-gramming (DP) if it potentially leads to a new alignment.

Page 18: Genomics Method Seminar - BWA October 15, 2014 Sora Kim Researcher sora15@yuhs.ac Yonsei Biomedical Science Institute Yonsei University College of Medicine.

18/12

SE. Seed extension

• banded affine-gap-penalty dynamic pro-gramming

Page 19: Genomics Method Seminar - BWA October 15, 2014 Sora Kim Researcher sora15@yuhs.ac Yonsei Biomedical Science Institute Yonsei University College of Medicine.

19/12

SE. Seed extension

• BWA-MEM’s seed extension differs from the standard seed extension in two aspects.1. suppose at a certain extension step we

come to reference position x with the best extension score achieved at query position y.

2. while extending a seed, BWA-MEM tries to keep track of the best extension score reaching the end of the query sequence.

Page 20: Genomics Method Seminar - BWA October 15, 2014 Sora Kim Researcher sora15@yuhs.ac Yonsei Biomedical Science Institute Yonsei University College of Medicine.

20/12

PE. Rescuing missing hits

• estimates the mean and the variance of the in-sert size distribution from reliable single-end hits.

• For the top 100 hits (by default) of either end, if the mate is unmapped in a window [] from each hit, BWA-MEM performs SSE2-based Smith-Waterman alignment for the mate within the window.

Page 21: Genomics Method Seminar - BWA October 15, 2014 Sora Kim Researcher sora15@yuhs.ac Yonsei Biomedical Science Institute Yonsei University College of Medicine.

21/12

PE. Rescuing missing hits

• Hits found from both the single-sequence align-ment and SW rescuing will be used for pairing.

Page 22: Genomics Method Seminar - BWA October 15, 2014 Sora Kim Researcher sora15@yuhs.ac Yonsei Biomedical Science Institute Yonsei University College of Medicine.

22/12

PE. Rescuing missing hits

• Hits found from both the single-sequence align-ment and SW rescuing will be used for pairing.

Page 23: Genomics Method Seminar - BWA October 15, 2014 Sora Kim Researcher sora15@yuhs.ac Yonsei Biomedical Science Institute Yonsei University College of Medicine.

23/12

PE. Pairing

• Given i-th hit for the first read, j-th hit for the second read• BWA-MEM computes their distance if the two hits are in the

right orientation, or sets to infinity otherwise.

• scores the pair (i, j)

– P(d) gives the probability of observing an insert size larger than d assuming a normal distribution

– ‘log4’ arises when we interpret SW score as odds ratio.– U is a threshold that controls pairing:

if is small enough such that , BWA-MEM prefers to pair the two ends;otherwise it prefers the unpaired alignments.

Page 24: Genomics Method Seminar - BWA October 15, 2014 Sora Kim Researcher sora15@yuhs.ac Yonsei Biomedical Science Institute Yonsei University College of Medicine.

24/12

Results

Page 25: Genomics Method Seminar - BWA October 15, 2014 Sora Kim Researcher sora15@yuhs.ac Yonsei Biomedical Science Institute Yonsei University College of Medicine.

25/12

Running Operation

• MEM mode

Page 26: Genomics Method Seminar - BWA October 15, 2014 Sora Kim Researcher sora15@yuhs.ac Yonsei Biomedical Science Institute Yonsei University College of Medicine.

26/12

SAM format - spec

Page 27: Genomics Method Seminar - BWA October 15, 2014 Sora Kim Researcher sora15@yuhs.ac Yonsei Biomedical Science Institute Yonsei University College of Medicine.

27/12

SAM format - example

Page 28: Genomics Method Seminar - BWA October 15, 2014 Sora Kim Researcher sora15@yuhs.ac Yonsei Biomedical Science Institute Yonsei University College of Medicine.

28/12

Discussion

• 100bp 이상의 확실한 long read 일 때 MEM 방식을 주로 사용하고 100bp 이하의 short read 일 때는 aln 을 쓰는 것을 추천

• Seed extend 와 local alignment 사용으로 인한 불필요하게 많이 split 되어 나타나는 alignment 결과물에 대해서 결과 보정 혹은 후처리를 위해 옵션 조정이 필요