Statistical Machine learning from HIV genomic data using HMMsfiles.meetup.com/2894492/SML talk2.pdf · Path: S R R R R S S S S S R R R R R R S S S R February 3, 2012 Statistical Machine

Post on 05-Jun-2020

3 Views

Category:

Documents

6 Downloads

Preview:

Click to see full reader

Transcript

February 3, 2012

Statistical Machine learning from HIV genomic data using HMMs

Jedidiah Francis Twitter: @jedidiahfrancis Email: jedidiah.francis@gmail.com Blog: jedidiahfrancis.com Mobile: 07917184089

February 3, 2012 Statistical Machine learning from HIV genomic data using HMMs

Page 2

Talk outline

Primer on Hidden Markov Models (HMMs) Inference in HIV genomic data Conclusion

Practical uses

uses include: §  finance (time series modeling), speech recognition, handwriting

recognition, medical (heart attack prediction), genomics (sequence analysis & alignment), robotics, meteorological (weather forecasting / modeling)

February 3, 2012 Statistical Machine learning from HIV genomic data using HMMs

Page 3

Introduction to HMMs 1st order Markov chain:

February 3, 2012 Statistical Machine learning from HIV genomic data using HMMs

Page 4

S R

0.4

0.80.6

0.2

S S S S

R R R R

S R

0.4

0.80.6

0.2

T W T W

0.2 0.8 0.9 0.1

S S S S

R R R R

T WT T

Pr(Xt|X1, X2, . . . , Xt�1) = Pr(Xt|Xt�1)

Problem 1

Given some model & parameters and sequence of observation D, compute . Observation: W T T W T W W W T W T T T T W T T W T T §  Naïve approach sum over all possible paths (221≈2.1 million

paths).

§  Luckily we can use dynamic programming (forward algorithm) to reduce this mn operations to mn (42).

§  A similar algorithm (backward algorithm) does the same thing but in reverse order.

February 3, 2012 Statistical Machine learning from HIV genomic data using HMMs

Page 5

� = (A,B,⇡)Pr(D|�)

Solution 1

Algorithm: forward algorithm

February 3, 2012 Statistical Machine learning from HIV genomic data using HMMs

Page 6

S S

R R

T

Pr(Xt|X1, X2, . . . , Xt�1) = Pr(Xt|Xt�1)� = (A,B,⇡)

Emission probability: ✏S(Xi)Transition probability: qij

Initialisation (i = 0) :f0(0) = 1, fk(0) = 0 8 k > 0

Recursion (i = 1, . . . , L) :fs(i+ 1) = [fS(i) qSS + fR(i) qRS ]⇥ ✏S(Xi+1)

Termination :Pr(D|�) =

Pk fk(L)

1

Problem 2

Given some model 𝜆=(A,B,π) and sequence of observation D, find the most probable sequence of the underlying states. Observation: W T T W T W W W T W T T T T W T T W T T Path: ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? §  use the Viterbi algorithm

§  A trace back matrix keeps track of which is the most likely path

§  The most likely path can be found from:

February 3, 2012 Statistical Machine learning from HIV genomic data using HMMs

Page 7

Vk(i+ 1) = max[Vj(i) qjk]⇥ ✏S(Xi+1)

tk(i+ 1) = argmaxj [Vj(i)qjk]

maxk[Vk(L)]

Solution 2

Observation: W T T W T W W W T W T T T T W T T W T T Path: S R R R R S S S S S R R R R R R S S S R

February 3, 2012 Statistical Machine learning from HIV genomic data using HMMs

Page 8

S S

R R

S

R

Xi-1 Xi Xi+1

VS(i+ 1) = max[Vj(i) qjS ]⇥ ✏S(Xi+1)

tS(i+ 1) = argmaxj [Vj(i)qjS ]

HIV recombination

February 3, 2012 Statistical Machine learning from HIV genomic data using HMMs

Page 9

MASTER - 1003-102301p-8 1003-102301p-8 1003-022002p-25 1003-011702p-3 1003-103001p-50 1003-011702p-12 1003-102301p-29 1003-102301p-31 1003-011702p-21 1003-103001p-32 1003-102301p-22a 1003-022002p-1 1003-102301p-11 1003-103001p-68 1003-102301p-35 1003-103001p-35 1003-102301p-48 1003-102301p-7 1003-022002p-19 1003-022002p-11 1003-103001p-30 1003-022002p-15 1003-103001p-28 1003-102301p-14 1003-103001p-47 1003-022002p-52 1003-102301p-12a 1003-022002p-51 1003-103001p-54 1003-011702p-21a 1003-022002p-12 1003-103001p-10 1003-102301p-11a 1003-022002p-32 1003-103001p-21 1003-102301p-20 1003-102301p-15a 1003-103001p-38 1003-103001p-46 1003-103001p-25 1003-103001p-14 1003-111301p-3 1003-103001p-41a 1003-102301p-20a 1003-022002p-40 1003-102301p-53 1003-102301p-4a 1003-103001p-44a 1003-102301p-21a 1003-011702p-13 1003-022002p-7 1003-103001p-8 1003-102301p-9a 1003-022002p-30 1003-022002p-28 1003-103001p-1 1003-022002p-45 1003-022002p-17 1003-011702p-20 1003-102301p-3 1003-011702p-11 1003-022002p-20 1003-011702p-20a 1003-103001p-48 1003-103001p-6 1003-022002p-38 1003-022002p-3 1003-022002p-37 1003-102301p-6a 1003-022002p-31 1003-103001p-43 1003-011702p-26 1003-011702p-2 1003-103001p-46a 1003-022002p-4 1003-011702p-10 1003-103001p-9 1003-022002p-42 1003-011702p-23 1003-022002p-47 1003-102301p-52 1003-102301p-10a 1003-102301p-1 1003-022002p-44 1003-103001p-12 1003-011702p-23a 1003-102301p-19 1003-022002p-13 1003-022002p-33 1003-103001p-69 1003-022002p-53 1003-103001p-33a 1003-102301p-47 1003-103001p-49a 1003-102301p-54 1003-022002p-49 1003-103001p-44 1003-103001p-60 1003-022002p-41 1003-103001p-40 1003-011702p-16 1003-102301p-50 1003-022002p-46 1003-103001p-7 1003-103001p-50a 1003-103001p-16 1003-022002p-54 1003-102301p-55 1003-111301p-9 1003-102301p-30 1003-102301p-17 1003-102301p-42 1003-103001p-39 1003-011702p-22 1003-022002p-50 1003-111301p-4 1003-103001p-27a 1003-102301p-6 1003-102301p-45 1003-103001p-64 1003-102301p-51 1003-103001p-39a 1003-103001p-24 1003-111301p-12 1003-022002p-35 1003-103001p-52a 1003-103001p-58 1003-022002p-34 1003-102301p-49 1003-111301p-18 1003-103001p-48a 1003-103001p-15 1003-022002p-9 1003-102301p-43 1003-111301p-8 1003-102301p-10 1003-102301p-23 1003-103001p-61 1003-011702p-24 1003-011702p-22a 1003-103001p-59 1003-011702p-30 1003-103001p-29a 1003-103001p-38a 1003-103001p-51 1003-022002p-14 1003-103001p-41 1003-103001p-34a 1003-103001p-2 1003-102301p-18 1003-102301p-1a 1003-022002p-2 1003-103001p-36a 1003-111301p-5 1003-102301p-33 1003-102301p-41 1003-103001p-62 1003-103001p-49 1003-103001p-65 1003-102301p-7a 1003-102301p-4 1003-103001p-70 1003-011702p-18a 1003-103001p-53 1003-011702p-19a 1003-103001p-63 1003-011702p-19 1003-111301p-2 1003-111301p-21 1003-022002p-21 1003-111301p-1 1003-102301p-24a 1003-103001p-37a 1003-022002p-22 1003-011702p-18 1003-103001p-56 1003-011702p-1 1003-103001p-55 1003-102301p-15 1003-103001p-43a 1003-022002p-29 1003-022002p-48 1003-011702p-8 1003-022002p-36 1003-022002p-23 1003-103001p-42a 1003-103001p-45 1003-022002p-8 1003-103001p-57 1003-011702p-15 1003-111301p-7 1003-011702p-6 1003-103001p-42 1003-111301p-10 1003-011702p-14 1003-103001p-3 1003-022002p-18 1003-022002p-39 1003-103001p-37 1003-111301p-6 1003-103001p-13 1003-103001p-31 1003-102301p-12 1003-011702p-5 1003-103001p-20 1003-102301p-44 1003-103001p-45a 1003-102301p-37 1003-111301p-23 1003-111301p-22 1003-111301p-11 1003-022002p-6 1003-111301p-16 1003-111301p-24 1003-103001p-52 1003-102301p-38 1003-103001p-71 1003-111301p-17 1003-011702p-28 1003-011702p-25 1003-103001p-18 1003-102301p-9 1003-103001p-66 1003-011702p-7 1003-011702p-32 1003-022002p-27 1003-111301p-15 1003-103001p-51a 1003-103001p-40a 1003-111301p-19 1003-103001p-4 1003-111301p-20 1003-102301p-24 1003-011702p-9 1003-102301p-3a 1003-103001p-26 1003-102301p-16a 1003-103001p-36 1003-102301p-16 1003-102301p-13 1003-102301p-25 1003-102301p-13a 1003-102301p-36 1003-102301p-17a 1003-103001p-23 1003-103001p-47a 1003-022002p-26 1003-102301p-14a 1003-102301p-46 1003-102301p-8a 1003-102301p-2 1003-103001p-67 1003-102301p-19a 1003-102301p-26 1003-102301p-23a 1003-102301p-5a 1003-102301p-28 1003-102301p-27 1003-102301p-5

0 500 1000

Sequences compared to master

Base number

A:G

reen

, T:R

ed, G

:Ora

nge,

C:L

ight

blu

e, IU

PAC:

Dark

blu

e, G

aps:

Gra

y

Generating estimates for 𝜌

builds hk+1 as an imperfect mosaic of h1,…,hk. Imperfect copying process

February 3, 2012 Statistical Machine learning from HIV genomic data using HMMs

Page 10

Modeling the copy process

February 3, 2012 Statistical Machine learning from HIV genomic data using HMMs

Page 11

K

K+1

t1

t2

Δt

Single time point

Two time points

Viterbi most likely path

February 3, 2012 Statistical Machine learning from HIV genomic data using HMMs

Page 12

Statistical inference

February 3, 2012 Statistical Machine learning from HIV genomic data using HMMs

Page 13

Closing remarks

Advantages of HMMs §  Easy enough to implement and allows for tractable

computation §  Rich enough to model very complex biological process Disadvantages §  States are supposed to be conditionally independent, this is

sometimes not true. §  Local maxima

§  Model may not converge to a truly global parameter max §  Speed

§  Almost everything one does in an HMM involves enumerating all possible paths through the model

§  Can be sped up in various ways but still can be relatively slow.

February 3, 2012 Statistical Machine learning from HIV genomic data using HMMs

Page 14

top related