Top Banner
February 3, 2012 Statistical Machine learning from HIV genomic data using HMMs Jedidiah Francis Twitter: @jedidiahfrancis Email: [email protected] Blog: jedidiahfrancis.com Mobile: 07917184089
14

Statistical Machine learning from HIV genomic data using HMMsfiles.meetup.com/2894492/SML talk2.pdf · Path: S R R R R S S S S S R R R R R R S S S R February 3, 2012 Statistical Machine

Jun 05, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Statistical Machine learning from HIV genomic data using HMMsfiles.meetup.com/2894492/SML talk2.pdf · Path: S R R R R S S S S S R R R R R R S S S R February 3, 2012 Statistical Machine

February 3, 2012

Statistical Machine learning from HIV genomic data using HMMs

Jedidiah Francis Twitter: @jedidiahfrancis Email: [email protected] Blog: jedidiahfrancis.com Mobile: 07917184089

Page 2: Statistical Machine learning from HIV genomic data using HMMsfiles.meetup.com/2894492/SML talk2.pdf · Path: S R R R R S S S S S R R R R R R S S S R February 3, 2012 Statistical Machine

February 3, 2012 Statistical Machine learning from HIV genomic data using HMMs

Page 2

Talk outline

Primer on Hidden Markov Models (HMMs) Inference in HIV genomic data Conclusion

Page 3: Statistical Machine learning from HIV genomic data using HMMsfiles.meetup.com/2894492/SML talk2.pdf · Path: S R R R R S S S S S R R R R R R S S S R February 3, 2012 Statistical Machine

Practical uses

uses include: §  finance (time series modeling), speech recognition, handwriting

recognition, medical (heart attack prediction), genomics (sequence analysis & alignment), robotics, meteorological (weather forecasting / modeling)

February 3, 2012 Statistical Machine learning from HIV genomic data using HMMs

Page 3

Page 4: Statistical Machine learning from HIV genomic data using HMMsfiles.meetup.com/2894492/SML talk2.pdf · Path: S R R R R S S S S S R R R R R R S S S R February 3, 2012 Statistical Machine

Introduction to HMMs 1st order Markov chain:

February 3, 2012 Statistical Machine learning from HIV genomic data using HMMs

Page 4

S R

0.4

0.80.6

0.2

S S S S

R R R R

S R

0.4

0.80.6

0.2

T W T W

0.2 0.8 0.9 0.1

S S S S

R R R R

T WT T

Pr(Xt|X1, X2, . . . , Xt�1) = Pr(Xt|Xt�1)

Page 5: Statistical Machine learning from HIV genomic data using HMMsfiles.meetup.com/2894492/SML talk2.pdf · Path: S R R R R S S S S S R R R R R R S S S R February 3, 2012 Statistical Machine

Problem 1

Given some model & parameters and sequence of observation D, compute . Observation: W T T W T W W W T W T T T T W T T W T T §  Naïve approach sum over all possible paths (221≈2.1 million

paths).

§  Luckily we can use dynamic programming (forward algorithm) to reduce this mn operations to mn (42).

§  A similar algorithm (backward algorithm) does the same thing but in reverse order.

February 3, 2012 Statistical Machine learning from HIV genomic data using HMMs

Page 5

� = (A,B,⇡)Pr(D|�)

Page 6: Statistical Machine learning from HIV genomic data using HMMsfiles.meetup.com/2894492/SML talk2.pdf · Path: S R R R R S S S S S R R R R R R S S S R February 3, 2012 Statistical Machine

Solution 1

Algorithm: forward algorithm

February 3, 2012 Statistical Machine learning from HIV genomic data using HMMs

Page 6

S S

R R

T

Pr(Xt|X1, X2, . . . , Xt�1) = Pr(Xt|Xt�1)� = (A,B,⇡)

Emission probability: ✏S(Xi)Transition probability: qij

Initialisation (i = 0) :f0(0) = 1, fk(0) = 0 8 k > 0

Recursion (i = 1, . . . , L) :fs(i+ 1) = [fS(i) qSS + fR(i) qRS ]⇥ ✏S(Xi+1)

Termination :Pr(D|�) =

Pk fk(L)

1

Page 7: Statistical Machine learning from HIV genomic data using HMMsfiles.meetup.com/2894492/SML talk2.pdf · Path: S R R R R S S S S S R R R R R R S S S R February 3, 2012 Statistical Machine

Problem 2

Given some model 𝜆=(A,B,π) and sequence of observation D, find the most probable sequence of the underlying states. Observation: W T T W T W W W T W T T T T W T T W T T Path: ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? §  use the Viterbi algorithm

§  A trace back matrix keeps track of which is the most likely path

§  The most likely path can be found from:

February 3, 2012 Statistical Machine learning from HIV genomic data using HMMs

Page 7

Vk(i+ 1) = max[Vj(i) qjk]⇥ ✏S(Xi+1)

tk(i+ 1) = argmaxj [Vj(i)qjk]

maxk[Vk(L)]

Page 8: Statistical Machine learning from HIV genomic data using HMMsfiles.meetup.com/2894492/SML talk2.pdf · Path: S R R R R S S S S S R R R R R R S S S R February 3, 2012 Statistical Machine

Solution 2

Observation: W T T W T W W W T W T T T T W T T W T T Path: S R R R R S S S S S R R R R R R S S S R

February 3, 2012 Statistical Machine learning from HIV genomic data using HMMs

Page 8

S S

R R

S

R

Xi-1 Xi Xi+1

VS(i+ 1) = max[Vj(i) qjS ]⇥ ✏S(Xi+1)

tS(i+ 1) = argmaxj [Vj(i)qjS ]

Page 9: Statistical Machine learning from HIV genomic data using HMMsfiles.meetup.com/2894492/SML talk2.pdf · Path: S R R R R S S S S S R R R R R R S S S R February 3, 2012 Statistical Machine

HIV recombination

February 3, 2012 Statistical Machine learning from HIV genomic data using HMMs

Page 9

MASTER - 1003-102301p-8 1003-102301p-8 1003-022002p-25 1003-011702p-3 1003-103001p-50 1003-011702p-12 1003-102301p-29 1003-102301p-31 1003-011702p-21 1003-103001p-32 1003-102301p-22a 1003-022002p-1 1003-102301p-11 1003-103001p-68 1003-102301p-35 1003-103001p-35 1003-102301p-48 1003-102301p-7 1003-022002p-19 1003-022002p-11 1003-103001p-30 1003-022002p-15 1003-103001p-28 1003-102301p-14 1003-103001p-47 1003-022002p-52 1003-102301p-12a 1003-022002p-51 1003-103001p-54 1003-011702p-21a 1003-022002p-12 1003-103001p-10 1003-102301p-11a 1003-022002p-32 1003-103001p-21 1003-102301p-20 1003-102301p-15a 1003-103001p-38 1003-103001p-46 1003-103001p-25 1003-103001p-14 1003-111301p-3 1003-103001p-41a 1003-102301p-20a 1003-022002p-40 1003-102301p-53 1003-102301p-4a 1003-103001p-44a 1003-102301p-21a 1003-011702p-13 1003-022002p-7 1003-103001p-8 1003-102301p-9a 1003-022002p-30 1003-022002p-28 1003-103001p-1 1003-022002p-45 1003-022002p-17 1003-011702p-20 1003-102301p-3 1003-011702p-11 1003-022002p-20 1003-011702p-20a 1003-103001p-48 1003-103001p-6 1003-022002p-38 1003-022002p-3 1003-022002p-37 1003-102301p-6a 1003-022002p-31 1003-103001p-43 1003-011702p-26 1003-011702p-2 1003-103001p-46a 1003-022002p-4 1003-011702p-10 1003-103001p-9 1003-022002p-42 1003-011702p-23 1003-022002p-47 1003-102301p-52 1003-102301p-10a 1003-102301p-1 1003-022002p-44 1003-103001p-12 1003-011702p-23a 1003-102301p-19 1003-022002p-13 1003-022002p-33 1003-103001p-69 1003-022002p-53 1003-103001p-33a 1003-102301p-47 1003-103001p-49a 1003-102301p-54 1003-022002p-49 1003-103001p-44 1003-103001p-60 1003-022002p-41 1003-103001p-40 1003-011702p-16 1003-102301p-50 1003-022002p-46 1003-103001p-7 1003-103001p-50a 1003-103001p-16 1003-022002p-54 1003-102301p-55 1003-111301p-9 1003-102301p-30 1003-102301p-17 1003-102301p-42 1003-103001p-39 1003-011702p-22 1003-022002p-50 1003-111301p-4 1003-103001p-27a 1003-102301p-6 1003-102301p-45 1003-103001p-64 1003-102301p-51 1003-103001p-39a 1003-103001p-24 1003-111301p-12 1003-022002p-35 1003-103001p-52a 1003-103001p-58 1003-022002p-34 1003-102301p-49 1003-111301p-18 1003-103001p-48a 1003-103001p-15 1003-022002p-9 1003-102301p-43 1003-111301p-8 1003-102301p-10 1003-102301p-23 1003-103001p-61 1003-011702p-24 1003-011702p-22a 1003-103001p-59 1003-011702p-30 1003-103001p-29a 1003-103001p-38a 1003-103001p-51 1003-022002p-14 1003-103001p-41 1003-103001p-34a 1003-103001p-2 1003-102301p-18 1003-102301p-1a 1003-022002p-2 1003-103001p-36a 1003-111301p-5 1003-102301p-33 1003-102301p-41 1003-103001p-62 1003-103001p-49 1003-103001p-65 1003-102301p-7a 1003-102301p-4 1003-103001p-70 1003-011702p-18a 1003-103001p-53 1003-011702p-19a 1003-103001p-63 1003-011702p-19 1003-111301p-2 1003-111301p-21 1003-022002p-21 1003-111301p-1 1003-102301p-24a 1003-103001p-37a 1003-022002p-22 1003-011702p-18 1003-103001p-56 1003-011702p-1 1003-103001p-55 1003-102301p-15 1003-103001p-43a 1003-022002p-29 1003-022002p-48 1003-011702p-8 1003-022002p-36 1003-022002p-23 1003-103001p-42a 1003-103001p-45 1003-022002p-8 1003-103001p-57 1003-011702p-15 1003-111301p-7 1003-011702p-6 1003-103001p-42 1003-111301p-10 1003-011702p-14 1003-103001p-3 1003-022002p-18 1003-022002p-39 1003-103001p-37 1003-111301p-6 1003-103001p-13 1003-103001p-31 1003-102301p-12 1003-011702p-5 1003-103001p-20 1003-102301p-44 1003-103001p-45a 1003-102301p-37 1003-111301p-23 1003-111301p-22 1003-111301p-11 1003-022002p-6 1003-111301p-16 1003-111301p-24 1003-103001p-52 1003-102301p-38 1003-103001p-71 1003-111301p-17 1003-011702p-28 1003-011702p-25 1003-103001p-18 1003-102301p-9 1003-103001p-66 1003-011702p-7 1003-011702p-32 1003-022002p-27 1003-111301p-15 1003-103001p-51a 1003-103001p-40a 1003-111301p-19 1003-103001p-4 1003-111301p-20 1003-102301p-24 1003-011702p-9 1003-102301p-3a 1003-103001p-26 1003-102301p-16a 1003-103001p-36 1003-102301p-16 1003-102301p-13 1003-102301p-25 1003-102301p-13a 1003-102301p-36 1003-102301p-17a 1003-103001p-23 1003-103001p-47a 1003-022002p-26 1003-102301p-14a 1003-102301p-46 1003-102301p-8a 1003-102301p-2 1003-103001p-67 1003-102301p-19a 1003-102301p-26 1003-102301p-23a 1003-102301p-5a 1003-102301p-28 1003-102301p-27 1003-102301p-5

0 500 1000

Sequences compared to master

Base number

A:G

reen

, T:R

ed, G

:Ora

nge,

C:L

ight

blu

e, IU

PAC:

Dark

blu

e, G

aps:

Gra

y

Page 10: Statistical Machine learning from HIV genomic data using HMMsfiles.meetup.com/2894492/SML talk2.pdf · Path: S R R R R S S S S S R R R R R R S S S R February 3, 2012 Statistical Machine

Generating estimates for 𝜌

builds hk+1 as an imperfect mosaic of h1,…,hk. Imperfect copying process

February 3, 2012 Statistical Machine learning from HIV genomic data using HMMs

Page 10

Page 11: Statistical Machine learning from HIV genomic data using HMMsfiles.meetup.com/2894492/SML talk2.pdf · Path: S R R R R S S S S S R R R R R R S S S R February 3, 2012 Statistical Machine

Modeling the copy process

February 3, 2012 Statistical Machine learning from HIV genomic data using HMMs

Page 11

K

K+1

t1

t2

Δt

Single time point

Two time points

Page 12: Statistical Machine learning from HIV genomic data using HMMsfiles.meetup.com/2894492/SML talk2.pdf · Path: S R R R R S S S S S R R R R R R S S S R February 3, 2012 Statistical Machine

Viterbi most likely path

February 3, 2012 Statistical Machine learning from HIV genomic data using HMMs

Page 12

Page 13: Statistical Machine learning from HIV genomic data using HMMsfiles.meetup.com/2894492/SML talk2.pdf · Path: S R R R R S S S S S R R R R R R S S S R February 3, 2012 Statistical Machine

Statistical inference

February 3, 2012 Statistical Machine learning from HIV genomic data using HMMs

Page 13

Page 14: Statistical Machine learning from HIV genomic data using HMMsfiles.meetup.com/2894492/SML talk2.pdf · Path: S R R R R S S S S S R R R R R R S S S R February 3, 2012 Statistical Machine

Closing remarks

Advantages of HMMs §  Easy enough to implement and allows for tractable

computation §  Rich enough to model very complex biological process Disadvantages §  States are supposed to be conditionally independent, this is

sometimes not true. §  Local maxima

§  Model may not converge to a truly global parameter max §  Speed

§  Almost everything one does in an HMM involves enumerating all possible paths through the model

§  Can be sped up in various ways but still can be relatively slow.

February 3, 2012 Statistical Machine learning from HIV genomic data using HMMs

Page 14