Albert Gatt Corpora and statistical methods. In this lecture Overview of rules of probability multiplication rule subtraction rule Probability based on.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Slide 1
Albert Gatt Corpora and statistical methods
Slide 2
In this lecture Overview of rules of probability multiplication
rule subtraction rule Probability based on prior knowledge
conditional probability Bayes theorem
Slide 3
Conditional probability and independence Part 1
Slide 4
Prior knowledge Sometimes, an estimation of the probability of
something is affected by what is known. cf. the many linguistic
examples in Jurafsky 2003. Example: Part-of-speech tagging Task:
Assign a label indicating the grammatical category to every word in
a corpus of running text. one of the classic tasks in statistical
NLP
Slide 5
Part-of-speech tagging example Statistical POS taggers are
first trained on data that has been previously annotated. Yields a
language model. Language models vary based on the n-gram window:
unigrams: probability based on tokens (a lexicon) E.g. input =
the_DET tall_ADJ man_NN model represents the probability that the
word man is a noun (NB: it could also be a verb) bigrams:
probabilities across a span of 2 words input = the_DET tall_ADJ
man_NN model represents the probability that a DET is followed by
an adjective, adjective is followed by a noun, etc. Can also do
trigrams, quadrigrams etc.
Slide 6
POS tagging continued Suppose weve trained a tagger on
annotated data. It has: a lexicon of unigrams: P(the=DET),
P(man=NN), etc a bigram model P(DET is followed by ADJ), etc Assume
weve trained it on a large input sample. We now feed it a new
phrase: the audacious alien Our tagger knows that the word the is a
DET, but its never seen the other words. It can: make a wild guess
(not very useful!) estimate the probability that the is followed by
an ADJ, and that an ADJ is followed by a NOUN
Slide 7
Prior knowledge revisited Given that I know that the is DET,
whats the probability that the following word audacious is ADJ?
This is very different from asking whats the probability that
audacious is ADJ out of context. We have prior knowledge that DET
has occurred. This can significantly change the estimate of the
probability that audacious is ADJ. We therefore distinguish: prior
probability: Nave estimate based on long-run frequency posterior
probability: probability estimate based on prior knowledge
Slide 8
Conditional probability In our example, we were estimating:
P(ADJ|DET) = probability of ADJ given DET P(NN|DET) = probability
of NN given DET etc In general: the conditional probability P(A|B)
is the probability that A occurs, given that we know that B has
occurred
Slide 9
Example continued If Ive just seen a DET, whats the probability
that my next word is an ADJ? Need to take into account: occurrences
of ADJ in our training data VV+ADJ (was beautiful), PP+ADJ (with
great concern), DET+ADJ etc occurrences of DET in our training
corpus DET+N (the man), DET+V (the loving husband), DET+ADJ (the
tall man)
Slide 10
Venn Diagram representation of the bigram training data AB
the+tall a+simple an+excellent the+man the+woman a+road is+tall
in+terrible were+nice Cases where w is ADJ NOT preceded by DET
Cases where w is a DET NOT followed by ADJ Cases where w is a DET
followed by ADJ
Slide 11
Estimation of conditional probability Intuition: P(A|B) is a
ratio of the chances that both A and B happen, by the chances of B
happening alone. P(ADJ|DET) = P(DET+ADJ) / P(DET)
Slide 12
Another example If we throw a die, whats the probability that
the number we get is even, given that the number we get is larger
than 4? works out as the probability of getting the number 6
P(even|>4) = P(even & >4)/P(>4) = (1/6) / (2/6) = =
0.5 Note the difference from simple, prior probability. Using only
frequency, P(6)= 1/6
Slide 13
Mind the fallacies! When we speak of prior and posterior, we
dont necessarily mean in time e.g. the die example Monte Carlo
fallacy: if 20 turns of the roulette wheel have fallen on black,
what are the chances that the next turn will fall on red? in
reality, prior experience here makes no difference at all every
turn of the wheel is independent from every other
Slide 14
The multiplication rule
Slide 15
Multiplying probabilities Often, were interested in switching
the conditional probability estimate around. Suppose we know P(A|B)
or P(B|A) We want to calculate P(A AND B) For both A and B to
occur, they must occur in some sequence (first A occurs, then
B)
Slide 16
Estimating P(A AND B) Probability that both A and B occur
Probability of A happening overall Probability of B happening given
that A has happened
Slide 17
Multiplication rule: example 1 We have a standard deck of 52
cards Whats the probability of pulling out two aces in a row? NB
Standard deck has 4 aces Let A1 stand for an ace on the first pick,
A2 for an ace on the second pick Were interested in P(A1 AND
A2)
Slide 18
Example 1 continued P(A1 AND A2) = P(A1)P(A2|A1) P(A1) = 4/52
(since there are 4 aces in a 52-card pack) If we do pick an ace on
the first pick, then we diminish the odds of picking a second ace
(there are now 3 aces left in a 51-card pack). P(A2|A1) = 3/51
Overall: P(A1 AND A2) = (4/52) (3/51) =.0045
Slide 19
Example 2 We randomly pick two words, w1 and w2, out of a
tagged corpus. What are the chances that both words are adjectives?
Let ADJ be the set of all adjectives in the corpus (tokens, not
types) |ADJ| = total number of adjectives A1 = the event of picking
out an ADJ on the first try A2 = the event of picking out an ADJ on
second try P(A1 AND A2) is estimated in the same way as per the
previous example: in the event of A1, the chances of A2 are
diminished the multiplication rule takes this into account
Slide 20
Some observations In these examples, the two events are not
independent of eachother occurrence of one affects likelihood of
the other e.g. drawing an ace first diminishes the likelihood of
drawing a second ace this is sampling without replacement if we put
the ace back into the pack after weve drawn it, then we have
sampling with replacement In this case, the probability of one
event doesnt affect the probability of the other.
Slide 21
Extending the multiplication rule The logic of the A AND B rule
is: Both conditions, A and B have to be met A is met a fraction of
the time B is met a fraction of the times that A is met Can be
extended indefinitely E.g. chances of drawing 4 straight aces from
a pack P(A1 & A2 & A3 & A4) = P(A1) P(A2|A1) P(A3|A1
& A2) P(A4|A1 & A2 & A3)
Slide 22
The subtraction rule
Slide 23
Extending the addition rule Its easy to extend the
multiplication rule. Extending the addition rule isnt so easy. We
need to correct for double-counting events.
Slide 24
Example P(A OR B OR C) A B C Once weve discounted the 2-way
intersection of A and B, etc, we need to recount the 3-way
intersection!
Slide 25
Subtraction rule Fundamental underlying observation: E.g.
Probability of getting at least one head in 3 flips of a coin (a
three-set addition problem) Can be estimated using the observation
that: P(Head out of 3 flips) = 1-P(no heads) = 1-P(3 tails)
Slide 26
Bayes theorem Part 4
Slide 27
Switching conditional probabilities Problem 1: We know the
probability that a test will give us positive in case a person has
a disease. We want to know the probability that there is indeed a
disease, given that the test says positive Useful for finding false
positives Problem 2: We know the probability P(ADJ|DET) that some
word w2 is an ADJ, given that the previous word w1 is a DET We find
a new word w. We dont know its category. It might be a DET. We do
know that the following word is an ADJ. We would therefore like to
know the reverse, i.e. P(DET|ADJ)
Slide 28
Deriving Bayes rule from the multiplication rule Given symmetry
of intersection, multiplication rule can be written in two ways:
Bayes rule involves the substitution of one equation into the
other, to replace P(A and B)
Slide 29
Deriving P(A) Often, its not clear where P(A) should come from
we start out from conditional probabilities! Given that we have two
sets of outcomes of interest, A and B, P(A) can be derived from the
following observation: i.e. The events in A are made up of those
which are only in A (but not in B) and those which are in both A
and B.
Slide 30
Finding P(A) -- I A B P(A) must either be in one or the other
(or both), since A is composed of these two sets.
Slide 31
Finding P(A) -- II Step 1: Applying the addition rule: Step 2:
Substituting into Bayes equation to replace P(A):
Slide 32
Summary This ends our first foray into the rules of probability
addition rule subtraction & multiplication rule conditional
probability Bayes theorem
Slide 33
Next up Probability distributions Random variables Basic
information theory