Hidden markov chain and bayes belief networks doctor consortium

Graphical Models of Probability

Graphical models use directed or undirected graphs over a set of random variables to explicitly specify variable dependencies and allow for less restrictive independence assumptions while limiting the number of parameters that must be estimated.

Bayesian Networks: Directed acyclic graphs that indicate causal structure.

Markov Networks: Undirected graphs that capture general dependencies.

Middle ware, CCNT, ZJU04/10/23

Hidden Markov Model

Zhejiang Univ

CCNT

Yueshen Xu


Overview

Markov Chain HMM Three Core Problems and Algorithms Application

Middleware, CCNT, ZJU04/10/23

Markov Chain

Instance

We can regard the weather as three states： state1 ： Rain

state2 ： Cloudy

state3 ： Sun

Tomorrow

Rain Cloudy Sun

Today

Rain 0.4 0.3 0.3

Cloudy 0.2 0.6 0.2

Sun 0.1 0.1 0.8

We can obtain the transition matrix with long term observation


Definition

one-step transition probability

That is to say, the evolvement of the stochastic process only relies on the current state and has nothing to do with those states before. Then we call this Markov property, and the process is regarded as Markov Process

State Space:

Observation Sequence:


Keystone

Middleware, CCNT, ZJU

state transition matrix

其中：

Initial state probability matrix

04/10/23

HMM

A HMM is a double random process, consisting of two parallel parts: Markov Chain: Describe the transition of the states, which is unobservable, by

means of transition probability matrix. Common stochastic process: Describe the stochastic process of the

observable events

Markov Chain（ , A）

Stochastic Process（ B）

State Sequence Observation Sequence

q1, q2, ..., qT o1, o2, ..., oT

HMM


Unobservable ObservableCoreFeature

S1 S2 S3

a11 0.3a

b0.80.2

a22 0.4a

b0.30.7

a12 0.5a

b1

0

a23 0.6a

b0.50.5 a13 0.2

a

b0

1

Example:What’s the probability of producing the sequence “abb” for this stochastic process?


S1 S2 S3

a11 0.3a

b0.80.2

a12 0.5a

b1

0

a23 0.6a

b0.50.5 a13 0.2

a

b0

1

S1→S1→S2→S3 0.3*0.8*0.5*1.0*0.6*0.5=0.036

a22 0.4a

b0.30.7

Instance1:


S1 S2 S3

a11 0.3a

b0.80.2

a12 0.5a

b1

0

a23 0.6a

b0.50.5 a13 0.2

a

b0

1

S1→S2→S2→S3 0.5*1.0*0.4*0.3*0.6*0.5=0.018

a22 0.4a

b0.30.7

Instance2:


S1 S2 S3

a11 0.3a

b0.80.2

a12 0.5a

b1

0

a23 0.6a

b0.50.5 a13 0.2

a

b0

1

S1→S1→S1→S3 0.3*0.8*0.3*0.8*0.2*1.0=0.01152

Therefore, the total probability is: 0.036+0.018+0.01152=0.06552

a22 0.4a

b0.30.7

Instance3:


We just know “abb”, but don’t know “S?S?S?”-----That’s the point.

04/10/23

Description

A HMM can be identified by those parameters below:

N: the number of states

M: the number of observable events for each state

A: the state transition matrix

B: observable event probability

: the initial state probability


We generally record it as ),,( BA

04/10/23

Three Core Problem

Evaluation: In the case that the observation sequence and the

model have been preseted, then how can we calculate ?

Optimization:Based on question 1, the question is how to choose a special

sequence so that the observation sequence O can be explained reasonably?

TrainingBased on question 1, here is how to adjust parameters of the

model to maximize ?


TOOOO 21,),,( BA

)|( OP

TqqqS 21

),,( BA )|( OP

We know O, but don’t know Q

04/10/23

Solution

There is no need to expound those algorithms, since we should pay attention to the application context.

Evaluation——Dynamic Programming Forward Backward Optimization——Greedy Viterbi Training——Iterative Baum-Welch & Maximum Likelihood Estimation

You can think over and deduce these methods after the workshop.


Application Context

Just think over it : The feature of HMM Which kind of problem can it describe and model?

Two stochastic sequence One relies on another or two is related. One can be “seen”, but another can not Just think about the Three Core Problem ……

I think we can make a conclusion , just as: Use One sequence to deduce and predict another or Find Out Who is Behind

““Iceberg” Iceberg” ProblemProblem


Application Context(1):Voice Recognition

Statistical DescriptionI. The characteristic pattern of voice, from sampling more often:

T =t1,t2,…, tn

II. The word sequence W(n): W1,W2,...,Wn

III. Therefore, what we concern about is P( W(n)|T )


Formalization DescriptionWhat we have to solve is :

k = arg max{ P( W(n)|T ) }

n

04/10/23

Application Context(1):Voice Recognition


Baum-WelchRe-estimation

Speechdatabase

FeatureExtraction

Converged?

1

2

7

HMM

waveform feature

Yes

No

end

Recognition FrameworkRecognition Framework

04/10/23

Application Context(2):Text Information Extraction

Figure out the HMM Model : Q1:What ‘s the state and what’s the observation event? Q2:How to figure out those parameters, just like aij?


),,( BA

state : what you want to extract observation event : text block or each word etc

Through Training Samples

04/10/23

Application Context(2):Text Information Extraction


Partition-ing

State List

Extracted Sequence

Document Partitioni-ng

Training Sample

HMM

Extraction FrameworkExtraction Frameworkcountry, state , city, street

title, author, email, abstract

04/10/23

Application Context(3):Other Fields:

Face Recognition POS tagging Web Data Extraction Bioinformatics Network intrusion detection Handwriting recognition Document Categorization Multiple Sequence Alignment …


Which field are you interested in ?


04/10/23


Bayes Belief Network

Yueshen Xu, too


Overview

Bayes Theorem Naïve Bayes Theorem Bayes Belief Network Application


Bayes Theorem

Basic Bayes FormulaBasic of basis, but vital.

)(

)()|()|(

AP

BPBAPABP ii

i

niBPBAP

BPBAPABP n

jjj

iii ,...,2,1,

)()|(

)()|()|(

1

prior probabilityposterior probability

complete probability formula


Condition Condition InversionInversion

04/10/23

The naive Bayes theorem is a simple probabilistic theorem based on applying Bayes theorem with strong independence assumptions

Naïve Bayes Theorem


),,,,|(),|()|()(

),,,(

121121

21

nn

n

FFFCFPFCFPCFPCP

FFFCP

Chain RuleChain Rule

Conditional IndependenceConditional Independence

)|(),|( CFPFCFP iji

n

iin CFPCPFFFCP

121 )|()(),,,(

CC

F1F1 F2

F2… Fn

Fn

Naïve Bayes is a simple Bayes Net

04/10/23

Bayes Belief Network:Graph Structure

Directed Acyclic Graph (DAG) Nodes are random variables Edges indicate causal influences


BurglaryBurglary EarthquakeEarthquake

AlarmAlarm

JohnCallsJohnCalls MaryCallsMaryCalls

RV

parents

descendant

relationship

04/10/23

Bayes Belief Network:Conditional Probability Table

Each node has a conditional probability table (CPT) that gives the probability of each of its values given every possible combination of values for its parents.

Roots (sources) of the DAG that have no parents are given prior probabilities.


BurglaryBurglary EarthquakeEarthquake

AlarmAlarm

JohnCallsJohnCalls MaryCallsMaryCalls

P(B)

.001

P(E)

.002

B E P(A)

T T .95

T F .94

F T .29

F F .001

A P(M)

T .70

F .01

A P(J)

T .90

F .05

04/10/23

Bayes Belief Network:Joint Distributions

A Bayesian Network implicitly defines a joint distribution.

))(Parents|(),...,(1

21 i

n

iin XxPxxxP

Example

)( EBAMJP

)()()|()|()|( EPBPEBAPAMPAJP

00062.0998.0999.0001.07.09.0 Therefore an inefficient approach to inference is:

– 1) Compute the joint distribution using this equation.– 2) Compute any desired conditional probability using the joint

distribution.


Conditional Independence

04/10/23

Conditional Independence &D-separation

D-separation− Let X,Y and Z be three sets of node

− If X and Y are d-separation by Z then X and Y are conditional independent given Z

D-separation− A is d-separation from B given C if

every undirected path between them is blocked

Path blocking− Three cases that expand on three

basic independence structures.


Application:Simple Document Classification(1)

Step1: Assume for the moment that there are only two mutually exclusive classes, S and ¬S (eg, spam and not spam), such that every element(email) is in either one or the other, that is to say:

Step2: what we concern about is :


ii

ii

SPSDP

and

SPSDP

)|()|(

)|()|(

ii

ii

SPDP

SPDSP

and

SPDP

SPDSP

)|()(

)()|(

)|()(

)()|(

04/10/23

Application:Simple Document Classification(2)

Step3: Dividing one by the other gives, and the be re-factored .

Step4: Taking the logarithm of all these ratios for decreasing calculated quantity:

i i

i

i i

i i

SP

SP

SP

SP

DSP

DSP

SPSP

SPSP

DSP

DSP

)|(

)|(

)(

)(

)|(

)|(

)|()(

)|()(

)|(

)|(

i i

i

SP

SP

SP

SP

DSP

DSP

)|(

)|(ln

)(

)(ln

)|(

)|(ln

>0 or

<0

Known Sample

Training


Application:Overall

Medical diagnosis Pathfinder system outperforms leading experts in diagnosis of lymph-node

disease.

Microsoft applications Problem diagnosis: printer problems Recognizing user intents for HCI

Text categorization and spam filtering Student modeling for intelligent tutoring systems. Biochemical Data Analysis Predicting mutagenicity

So many…





Hidden markov chain and bayes belief networks doctor consortium

Education

state transition matrix

keystone middleware

voice recognition middleware

zju recognition framework

zju formalization description

iceberg problem middleware

observation sequence

p wnt middleware