Top Banner
Bioinformatics lectures at Rice University Li Zhang Lecture 7: HMM models and application in DNA sequence analysis http://odin.mdacc.tmc.edu/~llzhang/ RiceCourse
24

Bioinformatics lectures at Rice University Li Zhang Lecture 7: HMM models and application in DNA sequence analysis llzhang/RiceCourse.

Dec 15, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Bioinformatics lectures at Rice University Li Zhang Lecture 7: HMM models and application in DNA sequence analysis llzhang/RiceCourse.

Bioinformatics lectures at Rice University

Li ZhangLecture 7: HMM models and application in DNA

sequence analysishttp://odin.mdacc.tmc.edu/~llzhang/RiceCourse

Page 2: Bioinformatics lectures at Rice University Li Zhang Lecture 7: HMM models and application in DNA sequence analysis llzhang/RiceCourse.

A loaded die

Observed data:

213444244423563232546344443444235632321242454344426

Page 3: Bioinformatics lectures at Rice University Li Zhang Lecture 7: HMM models and application in DNA sequence analysis llzhang/RiceCourse.

Markov chain and Hidden Markov Model

• Markov chain:

A Markov chain is a sequence of random variables X1, X2, X3, ... with the Markov property, namely that, given the present state, the future and past states are independent.

HMM (Hidden Markov Model):

Probabilistic parameters of a hidden Markov model (example)x — statesy — possible observationsa — state transition probabilitiesb — output probabilities

Page 4: Bioinformatics lectures at Rice University Li Zhang Lecture 7: HMM models and application in DNA sequence analysis llzhang/RiceCourse.

A simple example of HMM• Consider two friends, Alice and Bob, who live far apart from each other and who talk

together daily over the telephone about what they did that day. Bob is only interested in three activities: walking in the park, shopping, and cleaning his apartment. The choice of what to do is determined exclusively by the weather on a given day. Alice has no definite information about the weather where Bob lives, but she knows general trends. Based on what Bob tells her he did each day, Alice tries to guess what the weather must have been like.

• Alice believes that the weather operates as a discrete Markov chain. There are two states, "Rainy" and "Sunny", but she cannot observe them directly, that is, they are hidden from her. On each day, there is a certain chance that Bob will perform one of the following activities, depending on the weather: "walk", "shop", or "clean". Since Bob tells Alice about his activities, those are theobservations. The entire system is that of a hidden Markov model (HMM).

• Alice knows the general weather trends in the area, and what Bob likes to do on average. In other words, the parameters of the HMM are known.

Page 5: Bioinformatics lectures at Rice University Li Zhang Lecture 7: HMM models and application in DNA sequence analysis llzhang/RiceCourse.

Load HMM package in R#load HMM package:source("http://bioconductor.org/biocLite.R")biocLite("HMM")library("HMM")

Documentation:

Page 6: Bioinformatics lectures at Rice University Li Zhang Lecture 7: HMM models and application in DNA sequence analysis llzhang/RiceCourse.
Page 7: Bioinformatics lectures at Rice University Li Zhang Lecture 7: HMM models and application in DNA sequence analysis llzhang/RiceCourse.

Infer HMM parameters and hidden states from observations

# Initialize HMMhmm = initHMM(c("A","B"),c("L","R"),transProbs=matrix(c(.9,.1,.1,.9),2),emissionProbs=matrix(c(.5,.51,.5,.49),2))print(hmm)

# Sequence of observationa = sample(c(rep("L",10),rep("R",30)))b = sample(c(rep("L",30),rep("R",10)))observation = c(a,b)#take a look at the sequences:paste(a, collapse='')paste(b, collapse='')

# Baum-Welchbw = baumWelch(hmm,observation,10)print(bw$hmm)

#Observations:"RLRRLRRLRRLRRRRRRRRRLLRRRLLRRRRLRRRRLRRR""LLLLLLLRRLLLRLLLLLLLRLRRRLLLLLLLRRLLLLLR”

…$transposfrom A B A 0.972790696 0.0272093 B 0.003085187 0.9969148

$emissionProbs symbolsstates L R A 0.2585718 0.7414282 B 0.7334568 0.2665432

paste(viterbi(hmm,observation), collapse='')"BBBBBBBBBBBBBBBBBBBBAABBBAABBBBBBBBBBBBBAAAAAAAAAAAAAAAAAAAAAABBBAAAAAAAAAAAAAAA"

Page 8: Bioinformatics lectures at Rice University Li Zhang Lecture 7: HMM models and application in DNA sequence analysis llzhang/RiceCourse.

Generate simulated observation from a given HMM

Page 9: Bioinformatics lectures at Rice University Li Zhang Lecture 7: HMM models and application in DNA sequence analysis llzhang/RiceCourse.

Simulation

> simHMM(hmm, 20)$states [1] "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "B" "B" "B" "B" "B" "A" "A" "A"

$observation [1] "L" "L" "L" "L" "L" "L" "R" "L" "R" "L" "L" "L" "R" "R" "R" "L" "R" "L" "L" "L"

> hmm = initHMM(c("A","B"), c("L","R"), transProbs=matrix(c(.9,.1,.1,.9),2),+ emissionProbs=matrix(c(.9,.1,.1,.9),2))> simHMM(hmm, 20)$states [1] "B" "B" "B" "B" "B" "B" "A" "B" "B" "B" "B" "A" "A" "A" "A" "A" "A" "A" "B" "B"

$observation [1] "R" "R" "R" "R" "R" "R" "R" "R" "R" "R" "R" "L" "L" "L" "L" "L" "L" "L" "L" "L"

Page 10: Bioinformatics lectures at Rice University Li Zhang Lecture 7: HMM models and application in DNA sequence analysis llzhang/RiceCourse.

Homework

Dishonest casino demo

Page 11: Bioinformatics lectures at Rice University Li Zhang Lecture 7: HMM models and application in DNA sequence analysis llzhang/RiceCourse.

On transprob and emission

High entropy in emission weak surrogate marker of hidden states, hard to uncover hidden states.High entropy in transprob Lack of coherence in states, also makes it hard to uncover hidden states.Given enough sample size, the hmm parameters can be determined accurately, but the hidden states cannot be uncovered in high entropy cases.

Page 12: Bioinformatics lectures at Rice University Li Zhang Lecture 7: HMM models and application in DNA sequence analysis llzhang/RiceCourse.

HMM to identify Gene in E Coli

• E coli genome:

Page 13: Bioinformatics lectures at Rice University Li Zhang Lecture 7: HMM models and application in DNA sequence analysis llzhang/RiceCourse.

Observed data

Page 14: Bioinformatics lectures at Rice University Li Zhang Lecture 7: HMM models and application in DNA sequence analysis llzhang/RiceCourse.

The model

Page 15: Bioinformatics lectures at Rice University Li Zhang Lecture 7: HMM models and application in DNA sequence analysis llzhang/RiceCourse.

The model with long intergenic region

Page 16: Bioinformatics lectures at Rice University Li Zhang Lecture 7: HMM models and application in DNA sequence analysis llzhang/RiceCourse.

Hmm models of Exons and Introns

Page 17: Bioinformatics lectures at Rice University Li Zhang Lecture 7: HMM models and application in DNA sequence analysis llzhang/RiceCourse.

The model

Page 18: Bioinformatics lectures at Rice University Li Zhang Lecture 7: HMM models and application in DNA sequence analysis llzhang/RiceCourse.
Page 19: Bioinformatics lectures at Rice University Li Zhang Lecture 7: HMM models and application in DNA sequence analysis llzhang/RiceCourse.

Nucleosome binding on DNA

Page 20: Bioinformatics lectures at Rice University Li Zhang Lecture 7: HMM models and application in DNA sequence analysis llzhang/RiceCourse.

HMM of nucleosome binding• Genomic Sequence Is Highly Predictive of Local Nucleosome Depletion• The regulation of DNA accessibility through nucleosome positioning is

important for transcription control. Computational models have been developed to predict genome-wide nucleosome positions from DNA sequences, but these models consider only nucleosome sequences, which may have limited their power. We developed a statistical multi-resolution approach to identify a sequence signature, called the N-score, that distinguishes nucleosome binding DNA from non-nucleosome DNA. This new approach has significantly improved the prediction accuracy. The sequence information is highly predictive for local nucleosome enrichment or depletion, whereas predictions of the exact positions are only modestly more accurate than a null model, suggesting the importance of other regulatory factors in fine-tuning the nucleosome positions. The N-score in promoter regions is negatively correlated with gene expression levels. Regulatory elements are enriched in low N-score regions. While our model is derived from yeast data, the N-score pattern computed from this model agrees well with recent high-resolution protein-binding data in human.

Page 21: Bioinformatics lectures at Rice University Li Zhang Lecture 7: HMM models and application in DNA sequence analysis llzhang/RiceCourse.

Comparing different models

Page 22: Bioinformatics lectures at Rice University Li Zhang Lecture 7: HMM models and application in DNA sequence analysis llzhang/RiceCourse.
Page 23: Bioinformatics lectures at Rice University Li Zhang Lecture 7: HMM models and application in DNA sequence analysis llzhang/RiceCourse.

GENESCAN model• Genscan is a gene-prediction algorithm that,

like other hidden Markov models (HMMs), models the transition probabilities from one part (state) of a gene to another. Here, each circle or square represents a functional unit (a state) of a gene on its forward strand (for example, Einit is the 5' coding sequence (CDS) and Eterm is the 3' CDS, and the arrows represent the transition probability from one state to another. The Genscan algorithm is trained by pre-computing the transition probabilities from a set of known gene structures. Test sequence data can then be run one base position at a time, and the model will predict the optimal state for that position. The model for the reverse strand (beneath the dashed line) is in mirror symmetry to the model shown, with respect to the horizontal axis. Please note that these 'UTRs' (untranslated regions) might contain introns and so should not be confused with the standard UTR. E, exon; I, intron; pro, promoter.

Page 24: Bioinformatics lectures at Rice University Li Zhang Lecture 7: HMM models and application in DNA sequence analysis llzhang/RiceCourse.

Limitations of HMM

• Too many parameters• First order HMMs ignore dependencies

between hidden states