Top Banner
Homework 1 posted, due January 28 Recommended work plan: Complete “Getting Started” by TOMORROW Complete “Baseline” by next Friday Complete “The Challenge” by January 28
52
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Edinburgh MT lecture2: Probability and Language Models

•Homework 1 posted, due January 28

•Recommended work plan:

•Complete “Getting Started” by TOMORROW

•Complete “Baseline” by next Friday

•Complete “The Challenge” by January 28

Page 2: Edinburgh MT lecture2: Probability and Language Models

def translate(French): # do something return English

T : ⌃⇤f ! ⌃⇤

e

Write a function

Page 3: Edinburgh MT lecture2: Probability and Language Models

Learn Write a functiondef learn(parallel_data): # do something return parameters

def translate(French, parameters): # do something return English

T : ⌃⇤f ⇥⇥ ! ⌃⇤

e

L : (⌃⇤f ⇥ ⌃⇤

e)⇤ ! ⇥

T (f, ✓) = arg max

e2⌃⇤e

p✓(e|f)Using probability:

Page 4: Edinburgh MT lecture2: Probability and Language Models

Why probability?

•Formalizes...

•the concept of models

•the concept of data

•the concept of learning

•the concept of inference (prediction)

•Derive logical conclusions in the face of ambiguity.

Page 5: Edinburgh MT lecture2: Probability and Language Models

Basic Concepts•Sample space S: set of all possible outcomes.

•Event space E: any subset of the sample space.

•Random variable: function from S to a set of disjoint events in S.

•Probability measure P: a function from events to positive real numbers satisfying these axioms:

1.

2.

3.

8E 2 F, P (E) � 0

P (S) = 1

8E1, ..., Ek

k\

i=1

Ei = ; ) P (E1 [ ... [ Ek) =kX

i=1

P (Ei)

Page 6: Edinburgh MT lecture2: Probability and Language Models

Experiment: roll a six-side die once.What is the sample space?

Page 7: Edinburgh MT lecture2: Probability and Language Models

16

16

16

16

16

16

S = {1, 2, 3, 4 5, 6}r.v. X(s) = s

P(X) =

if X =

Page 8: Edinburgh MT lecture2: Probability and Language Models

Experiment: roll a six-side die twice.What is the sample space?

Page 9: Edinburgh MT lecture2: Probability and Language Models

136136136136136136

136136136136136136

136136136136136136

136136136136136136

136136136136136136

136136136136136136

X, Y

S = {1, 2, 3, 4 5, 6}2

r.v. X(x,y) = x, Y(x,y) = y

Page 10: Edinburgh MT lecture2: Probability and Language Models

136136136136136136

136136136136136136

136136136136136136

136136136136136136

136136136136136136

136136136136136136

p(X = 1, Y = 1) =136

A probability over multiple events is a joint probability.

Page 11: Edinburgh MT lecture2: Probability and Language Models

136136136136136136

136136136136136136

136136136136136136

136136136136136136

136136136136136136

136136136136136136

p(1, 1) =136

A probability over multiple events is a joint probability.

Page 12: Edinburgh MT lecture2: Probability and Language Models

136136136136136136

136136136136136136

136136136136136136

136136136136136136

136136136136136136

136136136136136136

p(Y = 1) =X

x2X

p(X = x, Y = 1) =16By axiom 3

A probability distribution over a subset of variables is a marginal probability.

Page 13: Edinburgh MT lecture2: Probability and Language Models

136136136136136136

136136136136136136

136136136136136136

136136136136136136

136136136136136136

136136136136136136

p(X = 1) =X

y2Y

p(X = 1, Y = y) =16

A probability distribution over a subset of variables is a marginal probability.

Page 14: Edinburgh MT lecture2: Probability and Language Models

136136136136136136

136136136136136136

136136136136136136

136136136136136136

136136136136136136

136136136136136136

joint marginal P (Y = 1|X = 1) =

P (X = 1, Y = 1)Py2Y P (X = 1, Y = y)

=16

The probability of a r.v. when the when the values of the other r.v.’s are known is its conditional probability.

Page 15: Edinburgh MT lecture2: Probability and Language Models

136136136136136136

136136136136136136

136136136136136136

136136136136136136

136136136136136136

136136136136136136

A variable is conditionally independent of another iff its marginal probability = its conditional probability.

In other words, if knowing X tells me nothing about Y.

P (Y = 1|X = 1) = P (Y = 1) =1

6

Page 16: Edinburgh MT lecture2: Probability and Language Models

16

16

16

16

16

16

P(X) =

if X =

16

16

16

16

16

16

P(Y) =

if Y =

P(X,Y) = P(X)P(Y)Far fewer parameters!

Page 17: Edinburgh MT lecture2: Probability and Language Models

20°C15°C10°C5°C0°C-5°C

20°C15°C10°C5°C0°C-5°C

000

.003.01.03

.2.25.2

.147.09.07

p(snow|-5°C) = .30

In most interesting models, variables are not conditionally independent.

Page 18: Edinburgh MT lecture2: Probability and Language Models

Under this distribution, temperature and weather r.v.’s are not conditionally independent!

20°C15°C10°C5°C0°C-5°C

20°C15°C10°C5°C0°C-5°C

000

.003.01.03

.2.25.2

.147.09.07

p(snow) = .043

In most interesting models, variables are not conditionally independent.

p(snow|-5°C) = .30

Page 19: Edinburgh MT lecture2: Probability and Language Models

p(B|A) = p(A,B)/p(A)

Page 20: Edinburgh MT lecture2: Probability and Language Models

p(A, B) = p(A) · p(B|A)

Chain rule

Page 21: Edinburgh MT lecture2: Probability and Language Models

p(A, B) = p(A) · p(B|A)

Page 22: Edinburgh MT lecture2: Probability and Language Models

p(A, B) = p(A) · p(B|A) = p(B) · p(A|B)

Page 23: Edinburgh MT lecture2: Probability and Language Models

p(A) · p(B|A) = p(B) · p(A|B)

Page 24: Edinburgh MT lecture2: Probability and Language Models

p(B|A) =p(B) · p(A|B)

p(A)

Page 25: Edinburgh MT lecture2: Probability and Language Models

p(B|A) =p(B) · p(A|B)

p(A)Bayes’ Rule

Page 26: Edinburgh MT lecture2: Probability and Language Models

p(B|A) =p(B) · p(A|B)

p(A)Bayes’ Rule

prior likelihoodposterior

evidence

Page 27: Edinburgh MT lecture2: Probability and Language Models

p(English|Chinese) =

p(English) × p(Chinese|English)

p(Chinese)

likelihoodprior

evidence

Bayes’ Rule

Page 28: Edinburgh MT lecture2: Probability and Language Models

p(English|Chinese) =

p(English) × p(Chinese|English)

p(Chinese)

channel modelsignal model

normalization (ensures we’re working with valid probabilities).

Noisy Channel

Page 29: Edinburgh MT lecture2: Probability and Language Models

p(English|Chinese) =

p(English) × p(Chinese|English)

p(Chinese)

translation modellanguage model

normalization (ensures we’re working with valid probabilities).

Machine Translation

Page 30: Edinburgh MT lecture2: Probability and Language Models

English

p(Chinese|English)

Page 31: Edinburgh MT lecture2: Probability and Language Models

English

p(Chinese|English)

× p(English)

Page 32: Edinburgh MT lecture2: Probability and Language Models

English

p(Chinese|English)

× p(English)

� p(English|Chinese)

Page 33: Edinburgh MT lecture2: Probability and Language Models

p(English|Chinese) =

p(English) × p(Chinese|English)

p(Chinese)

translation modellanguage model

evidence

Machine Translation

Page 34: Edinburgh MT lecture2: Probability and Language Models

p(English|Chinese) ∼

p(English) × p(Chinese|English)

Machine Translation

How do we define the probability of a sentence?

How do we define the probability of a Chinese sentence, given a particular English sentence?

Questions we must answer:

What is the sample space?

Page 35: Edinburgh MT lecture2: Probability and Language Models

n-gram Language Models

S = V*V = set of all English words

Define an infinite set of events:Xi(s) = ith word in s if len(s)≥i, ε otherwise.

Must define: p(X0...X∞)

Page 36: Edinburgh MT lecture2: Probability and Language Models

n-gram Language Models

S = V*V = set of all English words

Define an infinite set of events:Xi(s) = ith word in s if len(s)≥i, ε otherwise.

Must define: p(X0...X∞)

= p(X0) p(X1|X0) .... p(Xk|X0...Xk-1) ....by chain rule:

Page 37: Edinburgh MT lecture2: Probability and Language Models

n-gram Language Models

S = V*V = set of all English words

Define an infinite set of events:Xi(s) = ith word in s if len(s)≥i, ε otherwise.

Must define: p(X0...X∞)

= p(X0) p(X1|X0) .... p(Xk|X0...Xk-1) ....by chain rule:

= p(X0) p(X1|X0) .... p(Xk|Xk-1) ....assume conditional independence:

Page 38: Edinburgh MT lecture2: Probability and Language Models

Key idea: since the language model is a joint model over all words in a sentence, make each word depend on n-1 previous

words in the sentence.

n-gram Language Models

Page 39: Edinburgh MT lecture2: Probability and Language Models

p(However|START )

n-gram Language Models

Page 40: Edinburgh MT lecture2: Probability and Language Models

p(However|START )

A number between 0 and 1.

n-gram Language Models

Page 41: Edinburgh MT lecture2: Probability and Language Models

p(However|START )

A number between 0 and 1.X

x

p(x|START ) = 1

n-gram Language Models

Page 42: Edinburgh MT lecture2: Probability and Language Models

However

p(However|START )

Language Models

Page 43: Edinburgh MT lecture2: Probability and Language Models

However ,

p(, |However)

Language Models

Page 44: Edinburgh MT lecture2: Probability and Language Models

However , the

p(the|, )

Language Models

Page 45: Edinburgh MT lecture2: Probability and Language Models

However , the sky

p(sky|the)

Language Models

Page 46: Edinburgh MT lecture2: Probability and Language Models

However , the sky remained

Language Models

p(remained|sky)

Page 47: Edinburgh MT lecture2: Probability and Language Models

However , the sky remained clear

Language Models

p(clear|remained)

Page 48: Edinburgh MT lecture2: Probability and Language Models

However , the sky remained clear ... wind .

Language Models

p(STOP |.)...

Page 49: Edinburgh MT lecture2: Probability and Language Models

p(English) =

length(English)!

i=1

p(wordi|wordi−1)

Language Models

Page 50: Edinburgh MT lecture2: Probability and Language Models

p(English) =

length(English)!

i=1

p(wordi|wordi−1)

Language Models

Note: the probability that word0=START is 1.

Page 51: Edinburgh MT lecture2: Probability and Language Models

p(English) =

length(English)!

i=1

p(wordi|wordi−1)

Language Models

Note: the probability that word0=START is 1.

This model explains every word in the English sentence.

Page 52: Edinburgh MT lecture2: Probability and Language Models

p(English) =

length(English)!

i=1

p(wordi|wordi−1)

Language Models

Note: the probability that word0=START is 1.

This model explains every word in the English sentence.But it makes very strong conditional independence

assumptions!