Next-Generation Sequencing (NGS) T echnologies and Data Analysis · 2010-05-04 · Next-Generation Sequencing (NGS) T echnologies and Data Analysis Christopher E. Mason T A: Paul

Post on 02-Aug-2020

2 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

Transcript

Next-Generation Sequencing (NGS) Technologies and Data Analysis

Christopher E. Mason

TA: Paul Zumbo

Spring 2010

Class #2: Alignments, QC, and data processing

!"#$%"&'()*"+,*-(,"

copy the version of BWA into the 1KG directory

!"#$%&'$(()*+,)$

+,."/-(,"0).)123+4"

,-.$/0-$#123-45-04$610-$

!$7.#8))7.#5.31"-(0"92(02:(;/<)*===;-0/>-?)7.#)41.1)@'=ABCD)?-EF-0"-G3-14)HII==*=*JG*(726.(71?.E(;K$

!$7.#8))7.#5.31"-(0"92(02:(;/<)*===;-0/>-?)7.#)41.1)@'=ABCD)?-EF-0"-G3-14)HII==*=*JGL(726.(71?.E(;K$

!"#$%&'()*'+%,*-.'

!$;F0K2#$M(;K$

/001'2('()*'32(2.'

!$6?$N6:$

40"#('()*',%#*-'56/7'847'9:;<=8$

!$O"$HII==*=*JGL(726.(71?.E$

>0'-0?*'?2().'

!$-P#3$LQJBJQ==$)$J$

http://www.1000genomes.org

5,67-6(8*9"!:89*(,*.;"R-37/3>$.:-$162;0>-0.$S160T8$

!()9O1$160$(();-0/>-?):;*B(71$HII==*=*JG*(726.(71?.E$U$HII==*=*JG*(?12$

!()9O1$160$(();-0/>-?):;*B(71$HII==*=*JGL(726.(71?.E$U$HII==*=*JGL(?12$

V.:-3$/#.2/0?$62?.-4$1.8$:..#8))92/59O1(?/F3"-7/3;-(0-.)9O1(?:.>6$

<-*=,6."/'778>"!66)?;".-"5-;8.8-*;",-0-31.-$'62;0>-0.?$20$W'X$7/3>1.$SW20;6-$H04$I-14?T$

!$()9O15=(D(Y)9O1$?1>?-$:;*B(71$HII==*=*JG*(?12$HII==*=*JG*(726.(71?.E$UHII==*=*JG*(?1>$

,-0-31.-$'62;0>-0.?$20$W'X$7/3>1.$SR123-4$H04$I-14?T$

!$()9O15=(D(Y)9O1$?1>#-$:;*B(71$HII==*=*JG*(?12$HII==*=*JGL(?12$HII==*=*JG*(726.(71?.E$

HII==*=*JGL(726.(71?.E$UHII==*=*JGRH(?1>$

!:;-@"?-'6"ABC";?;.,("D)*"()E,")"F877,6,*D,"

,-.$1$6F?.3-$726-?Z?.->$27$Z/F$"10[$S?:/3.$7/3$\20FP$]6F?.-3T$

S@RWT$@-.O/3^$_26-$WZ?.->$SR`_WT$R13166-6$`23.F16$_26-$WZ?.->$Sa-331_W$Ta-331W"16-$a-":($_26-$WZ?.->$S,R_WTb%X$,-0-316$R13166-6$_26-$WZ?.->$

Sun Microsystems

%G6,)F;@"H66-6;@")*F"A*F,:;";.6-*9:?")77,D."

.G,"):89*(,*.;I";J,,F")*F")DD'6)D?"!$.2>-$()9O1$160$(();-0/>-?):;*B(71$HII==*=*JG*(726.(71?.E$

$ $!"#$%&'(')*

!$.2>-()9O1$160$N.$C$(();-0/>-?):;*B(71$HII==*=*JG*(726.(71?.E$

$ +#"&'"")*

!$.2>-$()9O1$160$N.$C$N-$*=$(();-0/>-?):;*B(71$HII==*=*JG*(726.(71?.E$

$ J>*=(YCJ?$

<@2?&,*-'A%()'2'+%,*'A%()'1#0A#'B2C%2#(-.'

)((&.DD&)E-%0,0FEG?*3GH0C#*,,G*3"D+2H",(ED?2-0#D,2ID32(2D6!JD'

!()9O1$160$(();-0/>-?):;*B(71$b04-6?(71?.E$Ub04-6?(?12$

!()9O1$?1>?-$(();-0/>-?):;*B(71$b04-6?(?12$b04-6?(71?.E$Ub04-6?(?1>$

60A'+%#3'-*K"*#H*-'A%()',2CF*C'%#3*,-'

!$()9O1$160$N-$*=$(();-0/>-?):;*B(71$b04-6?(71?.E$Ub04-6?G-*=(?12$

!$()9O1$?1>?-$(();-0/>-?):;*B(71$b04-6?G-*=(?12$b04-6?(71?.E$Ub04-6?G-*=(?1>$

3'6.-;8;")*F"/E,K*,;;"H;.8().,"5H"L<""

Skewness:

As close to zero as possible Kurtosis:

As high as possible (at least >0.6)

$G)."8;"/!MN"SAM is a rapidly developing data specification and format

for the storage of sequence alignments and their mapping coordinates.

Sequence Alignment/Map (SAM) also has a binary version of the format, called BAM.

SAMtools is a set of tools for manipulating and controlling SAM/BAM files

Bam-Bam of the Flintstones is currently unrelated to Heng Li and Richard Durbin’s work with SAM/BAM

/!M"C'.J'."

@HD = Header

@SQ = Sequence Dictionary LN=length of sequence

@RG= Read Group ID=unique read group identifier’

PU=Platform Unit

LB=Library SM=Sample

/!M"C'.J'."

QNAME = name of read

FLAG = Bitwise FLAG (216-1) RNAME = Reference sequence name

POS = Position (1-based) MAPQ = Mapping Quality (Phred-based)

CIGAR = CIGAR STRING

MRNM = Mate Reference Sequence MPOS = 1-based Mate Position of the other seq

ISIZE = Inferred Insert Size SEQ = Sequence reported on the + strand

QUAL = Quality scores (ASCII-33 = Phred)

TAG = TAG

#8.K8;,"O:)9;")6,"<-(P8*,F"#8.;" 0100101001010

Bit 0 = The read was part of a pair during sequencing

Bit 1 = The read is mapped in a pair Bit 2 = The query sequence is unmapped

Bit 3 = The mate is unmapped

Bit 4 = Strand of query (0=forward 1=reverse)

To find the value from the individual flags is additive. If the flag is false, don't add anything to the total. If it’s true then add 2 raised to the power of the bit position.

For example:

Bit 0 - false - add nothing

Bit 1 - true - add 2**1 = 2 Bit 2 - false - add nothing

Bit 3 - true - add 2**3 = 8

Bit 4 - true - add 2**4 = 16

Bit pattern = 11010 = 16+8+2 = 26 So the flag value would be 26.

Other Examples:

0=0000000

99 = 01100011 147 = 10010011

0 = Not paired, mapped, forward strand.

99 = Paired, Proper Pair, Mapped, Mate Mapped, Forward, Mate Reverse, First in pair, Not second in pair

147 = Paired, Proper Pair, Mapped, Mate Mapped, Reverse, Mate Forward, Not first in pair, Second in pair

#8.K8;,"O:)9"H>J:)*).8-*"

!"#$%& !"#$%'& ($)*+#,-.$& /0-1&

Q" 2"

8;R,)F5)6.C7!5)86,F!:89*

(,*." Q>QQQ2""

2" S"

8;R,)F!56-J,65)86,F!:89*

(,*." Q>QQQS""

S" T" 8;L',6?U*()JJ,F" Q>QQT""

V" W" 8;M).,U*M)JJ,F" Q>QQW""

T" 2X" 8;L',6?R,=,6;,/.6)*F"

Q>QQ2Q""

17):;,"YZ".6',"[4"

\" VS" 8;M).,R,=,6;,/.6)*F" Q>QQSQ""

X" XT" 8;R,)FO86;.5)86" Q>QQTQ""

]" 2SW" 8;R,)F/,D-*F5)86" Q>QQWQ""

W" S\X" 8;!:89*(,*.^-.568()6?" Q>Q2QQ""

_" \2S" F-,;R,)FO)8:`,*F-6L<" Q>QSQQ""

2Q" 2QST" 8;R,)F!0'J:8D).," Q>QTQQ""

/!M"/J,D878D).8-*;"SAM can store various alignments as a CIGAR format:

1.!Standard

2.!Clipped

a. soft-clipped= non-matched sequence present in alignment

b. Hard-clipped= non-matched sequence missing from alignment

4. Spliced (Intron (N))

5. Multi-part

6. Padded (Insertions (I) and Deletions (D))

7. Color-space

/!M"/J,D878D).8-*;"

Li et al, 2010

/-(,")(P89'8.8,;"6,()8*"CIGAR format is a short way of storing mis-aligned bases

to a reference genome.

In certain cases, CIGAR will need pileup-based padding,

though this is currently not supported.

$G)."87"A"F-*I."G)=,"7);.a"78:,;@"-6"

A"):6,)F?"G)=,"):89*(,*.;N"/-(,".--:;"):6,)F?",>8;.".-"DG)*9,"7-6().;"b"

"#$!"

"/!M.--:;"

" <-*=,6.,6;" " " "%--:;"samtools.pl

wgsim_eval.pl

blast2sam.pl

bowtie2sam.pl export2sam.pl

novo2sam.pl sam2vcf.pl

soap2sam.pl

zoom2sam.pl

qualfa2fq.pl

solid2fastq.pl (not recommended!)

56,F8D.8*9"+,*,.8D"`)68).8-*"K8.G"/!M.--:;"

_23?.c$Z/F$O266$0--4$./$;-.$1$.//6^2.c$104$O-$O266$F?-$W'X.//6?8$

:..#8))?/F3"-7/3;-(0-.)#3/d-".?)?1>.//6?)726-?)$

:..#8))?1>.//6?(?/F3"-7/3;-(0-.)$

e/O06/14$.:-$?/F3"-$"/4-8$

!$:..#8))?/F3"-7/3;-(0-.)#3/d-".?)?1>.//6?)726-?)?1>.//6?)=(*(Y)?1>.//6?5=(*(Y1(.13(9KL)4/O06/14$

f0K2#$.:-$.139166$S/3$4/F96-5"62"^$27$/0$e-?^./#T8$

!9K2#L$5"4$?1>.//6?5=(*(Y1(.13(9KL$g$.13$P<7$5$

]:10;-$20./$.:-$0-O$423-"./3Z$

!"4$?1>.//6?5=(*(Y1$

]/>#26-$.:-$#3/;31>$

!>1^-$

X/<-$.:-$-P-"F.196-$20./$>120$423-"./3Z$

!"#$?1>.//6?$(()$

/!M.--:;"()8*"-J.8-*;"

58:,'J".G,"R,)F;".-"D)::"=)68)*.;"!"#$%&'&()'%)*+,'-.&$'/01'23-.*%4'5016'7$%"*&''

!()?1>.//6?$2>#/3.$(();-0/>-?):;*B(71$b04-6?G-*=(?1>$b04-6?G-*=(91>$

5$%&'&()'/01'7-8)'27$%'7*,&)%'#%$9),,-.:'8*&)%6'

!()?1>.//6?$?/3.$b04-6?G-*=(91>$b04-6?G-*=(?/3.-4$$

;)%7$%"'*';-8)<#'28*4)%'&()'%)*+,'$.'&$#'$7')*9('$&()%6'

!$()?1>.//6?$#26-F#$5<"7$(();-0/>-?):;*B(71$b04-6?G-*=(?/3.-4(91>$Ub04-6?G-*=(#26-F#G31O$

!7'4$<'=*.&'&$'98)*.'<#'&()'#-8)<#'34'+)#&('$7'9$>)%*:)?'

!#-36$?1>.//6?5=(*(Y)?1>.//6?(#6$<13_26.-3$54$*=$726-(#26-F#(31O$Ub04-6?G-*=(#26-F#G*=h$

!7'4$<'=*.&'&$'98)*.'<#'&()'#-8)<#'34'@<*8-&4',9$%),'$7'@AB'29$8<".'C6D''!"#$%&'()'*+,-./'*")0'1.(.-'

2.3$4.-5.-0'#$/'6-3#$'7.-$35"#$'8*279:'

!1O^$ijAUkL=i$726-(#26-F#(31O$U726-(#26-F#GIXWL=(/F.$

E$'8$$F'98$,)%'*&'4$<%'8-,&'$7'>*%-*.&,?'

!6-??$HII(#26-F#G*=h$

$G)."8;"8*".G,"-'.J'.N"1.! Chromosome: reference sequence name

2.! Position: reference coordinate in position (1-based)

3.! Reference Base: base of the genome, or `*' for an indel line

4.! Genotype: where heterozygotes are encoded in the IUPAC/IUB code: M=A/C, R=A/G, W=A/T, S=C/G, Y=C/T and K=G/T; indels are indicated by, for example, */+A, -A/* or +CC/-C. There is no difference between */+A or +A/*.

5.! Consensus Quality: Phred-scaled likelihood that the genotype is wrong

6.! SNP Quality: Phred-scaled likelihood that the genotype is identical to the reference, which is also called `SNP quality'. Suppose the reference base is A and in alignment we see 17 G and 3 A. We will get a low consensus quality because it is difficult to distinguish an A/G heterozygote from a G/G homozygote. We will get a high SNP quality, though, because the evidence of a SNP is very strong.

7.! RMS: root mean square (RMS) mapping quality, a measure of the variance of quality scores

8.! Coverage: # reads covering the position

9.! Bases with Support/Indel#1: Bases used for SNP line, “^” from CIGAR N/S/H break, “$” end of read segment; the 1st indel allele otherwise

10. !Quality of bases/Indel#2: base quality at a SNP line; the 2nd indel allele otherwise

11. !INDEL#1: # reads directly supporting the 1st indel allele

12. !INDEL#2: # reads directly supporting the 2nd indel allele

13. !INDEL#3: # reads supporting a third indel allele

14. !Blank

/!M%--:;"J8:,'J"';,;".G,"/^5"(-F,:"76-("M!L@"

)*F"G);")"7,K"-J.8-*;"

V#.2/0?8$

5"$]3-1.-$.:-$"/0?-0?F?$91?-$1.$-1":$#/?2.2/0$

5<$W:/O$#/?2.2/0?$.:1.$4/$0/.$1;3--$O2.:$.:-$3-7-3-0"-$;-0/>-$S:;*B(71T$

57$a:-$3-7-3-0"-$;-0/>-$2?$20$_'Wa'$7/3>1.$

$G).")6,";-(,"-.G,6"78:.,6;N"

Parameter INTEGER [Default Value]

-Q INT minimum RMS mapping quality for SNPs [25]

-q INT minimum RMS mapping quality for gaps [10]

-d INT minimum read depth [3]

-D INT maximum read depth [100]

-G INT min indel score for nearby SNP filtering [25]

-w INT SNP within X bp around a gap to filter [10]

-W INT window size for filtering dense SNPs [10]

-N INT max number of SNPs in a window [2]

-l INT window size for filtering adjacent gaps [30]

International Union of Biochemistry (IUB) / or Intn’l Union of Pure and Applied Chemistry (IUPAC) Codes

2+#$& 3$45"5)5+"& ($-"5"1&

!" !F,*8*," 6"

<" <?.-;8*," 2"

+" +')*8*," 7"

%" %G?(8*," 8"

R" !+" J'98*,"

c" <%" J:68(8F8*,"

3" +%" ;,.-"

M" !<" )(8*-"

/" +<" <.6-*9"

$" !%" =,)E"

#" <+%" ^-."!"

0" !+%" ^-."<"

&" !<%" ^-."+"

`" !<+" ^-."%"

^" !+<%" ),?"

%=8,K""G-%,&H'"*F)'4$<%'/01'!.+)I'

!()?1>.//6?$204-P$b04-6?G-*=(?/3.-4(91>$

J$='4$<'9*.'8$$F'*&'4$<%'*8-:.").&,'

!()?1>.//6?$.<2-O$b04-6?G-*=(?/3.-4(91>$(();-0/>-?):;*B(71$!()?1>.//6?$.<2-O$HII==*=*JGRH(?/3.-4(91>$(();-0/>-?):;*B(71$

!,/$./$1$?#-"272"$20.-3<16$;$

a:-0$.Z#-8$":3Y8DYDJAJ*A$$

CJ,*"O)D.-6;"-7")"`)68)*.I;"O8F,:8.?"

l/O$4/$O-$^0/O$.:-$EF162.Z$2?$;//4m$

(N) Number of reads supporting that site,

(Pv) Probability of that platform-specific variant change,

(QVD) The average deviation of the quality values, (T) The set of alignments with unique start sites,

(D) PCR Duplicates,

(S) Strand representation (half on one, half on the other),

(Z) Zygosity change (CNV regions)

(C) Cellular heterogeneity

+,*-(8D"R,:).8=8.?"D)*"D6,).,"8*.,6,;.8*9"

D-(J-'*F"G,.,6-d?9-.,;"

>*?+.+& 0+>-)5+"& @-)5$")A3& ?$-#B& @-)5$")A9& ?$-#B& @-)5$")'3& ?$-#B& @-)5$")'9& ?$-#B&

DG622" X2X\S2_]" ^!" Q" [+BY+%<+" 2S" ^!" Q" [+B[+" W"

-G!Paternal A!

C!

A!

T!

T!

G!

G!

G!

A!

C!

A!

T!

G!

T!

G!

G!

G!

Reference Maternal A!

C!

A!

T!

G!

G!

T!

C!

G!

T!

G!

G!

G!

+GTCG!

[++%<+BY++%<+"

W[2Qe"D-=,6)9,";'778D8,*."7-6"G89G[a'):8.?"/^5"D)::;""

`)68)*."<)::"O-6()."

G..JbBB2QQQ9,*-(,;f-69BK8E8BF-E'fJGJN

8Fg2QQQh9,*-(,;b)*):?;8;b=D7=VfS"

Columns:

1.! #CHROM 2.! POS

3.! ID 4.! REF

5.! ALT

6.! QUAL 7.! FILTER

8.! INFO

1Kg

C.G,6"!*):?;8;"CJ.8-*;"

a:-$,-0/>-$'016Z?2?$a//6^2.$S,'a+T$

:..#8))OOO(93/1420?.2.F.-(/3;);?1)O2^2)204-P(#:#)a:-G,-0/>-G'016Z?2?Ga//6^2.$

R2"134$

:..#8))#2"134(?/F3"-7/3;-(0-.)204-P(?:.>6$

#8-5,6:b"

" #8-bb0#bb/)("

!*):?d,"O,).'6,;"-7"H66-6;"7-6"

,)DG"P);,"K8.G"+!%3"

•!R,J-6.,F"a'):8.?";D-6,""

•!%G,"J-;8.8-*"K8.G8*".G,"6,)F""

•!%G,"J6,D,F8*9")*F"D'66,*."*'D:,-.8F,"

1;,a',*D8*9"DG,(8;.6?",77,D.4"-P;,6=,F"P?"

.G,";,a',*D8*9"()DG8*,""

•!56-P)P8:8.?"-7"(8;().DG8*9".G,"6,7,6,*D,"

9,*-(,"

•!R,[D):D':).,"L;D-6,;"

+!%3"H>)(J:,;"1,>.6)4"_23?.c$6-.n?$>1^-$?F3-$Z/F$:1<-$d1<1$

!$d1<1$N<-3?2/0$

@/Oc$6-.n?$;-.$,'a+$

!$O;-.$7.#8))7.#(93/1420?.2.F.-(/3;)#F9);?1),-0/>-'016Z?2?a+),-0/>-'016Z?2?a+561.-?.(.13(9KL$

%F0K2#Lc$104$F0.13$.:-$726-($$]4$20./$.:1.$423-"./3Z($

\-.n?$;-.$?/>-$-P1>#6-$41.1$./$O/3^$O2.:8$

!"K9,."7.JbBB7.JfP6-)F8*;.8.'.,f-69BJ'PB9;)B,>)(J:,O8:,;B,>)(J:,O8:,;f.)6fPdS"

&-K"()*?"6,)F;"F-"?-'"G)=,N"

i"j)=)"[j)6"+,*-(,!*):?;8;%3fj)6"[R",>)(J:,O!/%!f7);.)""[A",>)(J:,#!MfP)("[%"<-'*.R,)F;"

l/O$>10Z$6/"2$4/$Z/F$:1<-m$

!"j)=)"[j)6"+,*-(,!*):?;8;%3fj)6"[R",>)(J:,O!/%!f7);.)""[A",>)(J:,#!MfP)("[%"<-'*.k-D8$

+!%3"+,*-(,"56-D,;;8*9")F?3)6/"16)"6F?.-3)?/7.O13-)d3-*(A51>4AJ)d3-*(A(=G=*)920)d1<1$5d13$,-0/>-'016Z?2?a+(d13$5I$:;*C(71?.1$5b$(()YD*QD*I(91>(?/3.-4(91>$5a$]/F0.I-14?$

+!%3"CJ.8-*;"<+.$&6C-50-D0$&-"-0EB$BF&

`)68)*.!**-.).-6""""""""!**-.).,;"=)68)*."D)::;"K8.G"D-*.,>."8*7-6().8-*f"

0,J.GC7<-=,6)9," "<-(J'.,;".G,"F,J.G"-7"D-=,6)9,").")::":-D8"8*".G,";J,D878,F"6,98-*"-7".G,"6,7,6,*D,f"

`)68)*.O8:.6).8-*"O8:.,6;"=)68)*."D)::;"';8*9")"*'(P,6"-7"';,6[;,:,D.)P:,@"J)6)(,.,68d)P:,"D68.,68)f"

U*878,F+,*-.?J,6 "!"=)68)*."D)::,6"KG8DG"'*878,;".G,")JJ6-)DG,;"-7";,=,6):"F8;J)6).,"D)::,6;f"

A*F,:+,*-.?J,6`S" "%G8;"8;")";8(J:,@"D-'*.;[)*F[D'.-77;"P);,F".--:"7-6"D)::8*9"8*F,:;"76-("):89*,F" " "

" " " ";,a',*D8*9"F).)f"""

A*F,:R,):89*,6 "5,67-6(;":-D):"6,):89*(,*."-7"6,)F;"P);,F"-*"(8;):89*(,*.;"F',".-".G,"J6,;,*D,"-7"8*F,:;f"""

R,):89*,6%)69,.<6,).-6""H(8.;"8*.,6=):;"7-6".G,"k-D):"A*F,:"R,):89*,6".-".)69,."7-6"D:,)*8*9f"

<-'*.k-D8 " "$):E;"-=,6".G,"8*J'."F).)";,.@"D):D':).8*9".G,".-.):"l"-7"D-=,6,F":-D8"7-6"F8)9*-;.8D"J'6J-;,;f"""

<-'*.R,)F; "$):E;"-=,6".G,"8*J'."F).)";,.@"D):D':).8*9".G,"l"-7"6,)F;";,,*"7-6"F8)9*-;.8D"J'6J-;,;f"""

`):8F).8*958:,'J"!.",=,6?":-D';"8*".G,"8*J'.";,.@"D-(J)6,;".G,"J8:,'J"F).)"16,7,6,*D,"P);,@"):89*,F"P);,"

`)68)*.H=): "!"6-P';.")*F"9,*,6):"J'6J-;,".--:"7-6"DG)6)D.,68d8*9".G,"a'):8.?"-7"/^5;@"A*F,:;@")*F"

-.G,6"""""""""""""""""""""""""""=)68)*.;".G)."8*D:'F,;"P);8D"D-'*.8*9@".8B.=@"FP/^5m"187"[0"8;"J6-=8F,F4@"D-*D-6F)*D,".-"

DG8J""-6"=):8F).8-*"F).)@")*F"K8::";G-K"8*.,6,;.8*9";8.,;"1[`4".G).")6,"O^;@"O5@",.DfK):E,6;"

A7";G)68*9")"D:';.,6@"P,")"9--F"*,.8d,*"

L-*'()*-*'(%&-'(0'I*'2'F003'&0C(?2#(*2".'

*(!]:-"^$Z/F3$42?^$F?1;-$S47$N:T$

L(!WF9>2.$d/9?$./$.:-$EF-F-c$27$1<126196-$

Q(!W:13-$;-0/>-$2042"-?$20$/0-$#61"-$

J(!\-1<-$.:-$"1>#$?2.-$9-..-3$.:10$Z/F$7/F04$2.($$]6-10$F#$/64$726-?[$

D(!@-<-3$1??F>-$1$91"^F#$2?$.:-3-o$>1^-$Z/F3$/O0$*(! \2<-WZ0"$7/3$?Z0":/032K1.2/0$

L(! a2>-$]1#?F6-$7/3$%1"^F#$

top related