|
Chain rule |
 
|
-
Prob(A,B) = Prob(A|B) * Prob (B)
Prob(A,B) = Prob(B|A) * Prob (A)
-
Prob(A,B,C) = Prob(C|A,B) * Prob (A,B)
Prob(A,B,C) = Prob(C|A,B) * Prob(B|A) * Prob (A)
Prob(A,B,C) = Prob(A) * Prob(B|A) * Prob(C|A,B)
-
Prob(A,B,C,D) = Prob(D|A,B,C) * Prob (A,B,C)
Prob(A,B,C,D) = Prob(D|A,B,C) * Prob(A) * Prob(B|A) * Prob(C|A,B)
Prob(A,B,C,D) = Prob(A) * Prob(B|A) * Prob(C|A,B) *
Prob(D|A,B,C)
Prob(A1,..., An) = Prob(A1) * Prob(A2|A1) * ...
Prob(An|A1,A2, ... , An-1)
|
Word
Sequences |
 
|
We abbreviate word sequences w1, w2, w3,..., wn as:
We now look at an n-word word sequence
w1n as a joint event
consisting of word 1 occuring in the first position,
word 2 occurring in the second position,
and so on.
This lets us use the chain rule:
-
Prob(w1,..., wn) = Prob(w1) * Prob(w2|w1) * ...
Prob(wn|w1,w2, ... , wn-1)
Using our abbreviation:
-
Prob(w1n) = Prob(w1) * Prob(w2|w1) * ...
Prob(wn|w1n-1)
|
|
Sparseness |
 
|
The problem with this formulation of the probability can
be seen by looking at the last term:
This is the probability of the last word given the entire sequence of
words before it.
How would we compute such a thing?
It's easy!
To do this right we need some corpus large
enough to give us a representative sampling of
of the n-word strings beginning with w1n-1,
of which there is hopefully a representative sample
of w1n.
The problem is that there isn't enough data in the world to
get such representative samples
in most cases for even moderately small n.
Consider a very small n. Consider Shakespeare.
- Word token Count: 884,647
- Word form Types: 29,066 (including lots of proper nouns)
- Number of bigram types = 29,0662 = 844 million
- Number of bigram tokens: 884,647
In any corpus of this size, we're very unlikeley
to see most of the rarer bigrams.
Now consider Shakerspeare and trigrams:
- Word token Count: 884,647
- Word form Types: 29,066 (including lots of proper nouns)
- Number of trigram types = 29,0663 = 25,636,000,000,000
- Number of trigram tokens: 884,647
We get only a vanishing small sample of the entire
space of trigrams. We're likely to encounter
only the most common ones.
|
|
Approximation |
 
|
Since we can't compute the probability we want directly,
we approximate:
- Prob(wn|w1n-1)=app
Prob(wn|wn-1)
Why is this a reasonable approximation?
- |wn-1n|÷ |wn-1|
=app
|w1n|÷ |w1n-1|
Take a special case:
What we're saying for
this special case is that:
- |long distance call| ÷ |long distance| =app |distance call| ÷ |distance|
This is is exactly right when adding long in front of
distance neither
makes call more nor less likely. It has no effect. The
occurrence of that that word call is completely independent of
of the occurrence of long two words back.
So what our approximation claims is that the strongest conditioning
effect on what comes next is from the last word. To the extent that
the word two words back has a conditioning effect we're wrong.
In our example, it seems plausible that long is a factor
in making the word call more likely to come next.
What would be a better approximation?
- |wn-1n|÷ |wn-1|
=app
|w1n|÷ |w1n-1|
Look TWO words back. Now we see long. We still have other possible continuations:
But we make much less likely continuations like:
- distance himself from the opposition
But wait if we have:
- collect long distance ___
we're even more likely to have the word call occur
next (running and
runner are made much less likely for example).
The longer the history we look at, the more precise
our approximation will be, until in the limit,
it is no longer an approximation, because we look at the entire
history.
|
Bigram
Model
|
 
|
A probability model that looks one word back
to try to predict the next word is called a bigram model.
|
Trigram
Model
|
 
|
A probability model that looks two words back
to try to predict the next word is called a trigram model.
|
Berkeley
restaurant
project
|
 
|
A query system for finding Berkeley restaurants, providing
a small corpus of queries about food and restaurants
in Berkeley.
- I'm looking for cantonese food.
- I'd like to eat dinner someplace nearby.
- Tell me about Chez Panisse.
|
the basic
idea
|
 
|
We count. For each word, we count up occurrences of following words
and array them in a bigram table.
We also total up the total number of occurrences of each word,
| I | 3437 |
| want | 1215 |
| to | 3256 |
| eat | 938 |
| Chinese | 213 |
| food | 1506 |
| lunch | 459 |
We compute probabilities.
- P(wn | wn-1) =
|wn-1wn| ÷ |wn |
- P(to | want) = |want to| ÷ |want| = 786 ÷ 1215 = .65
Now we can build up tables with probabilities instead of counts.
|
Note that most of the cells in our original count tables will be zero.
Most of the time our bigram model assigns probability zero
to a potential following word:
Add-one
Smoothing
basic idea
|
 
|
We add one to every cell of
this table
We get this table
We recompute our our total occurrences:
-
I 3437 +1616 =5053
-
want 1215 + 1616 = 2931
-
to 3256 + 1616 = 4872
-
eat 938 + 1616 = 2554
-
Chinese 213 + 1616 = 1829
-
food 1506 + 1616 = 3122
-
lunch 459 + 1616 = 2075
Now we recompute the probabilities:
- P(wn | wn-1) =
|wn-1wn| ÷ |wn |
- P(food | want) = |want to| ÷ |want| = 1 ÷ 2931 = .0003
- P(to | want) = |want to| ÷ |want| = 787 ÷ 2931 = .27
This gives us this bigram probability table.
Compare this one.
Some things to notice:
- The events that used to be zeroes don't all have the same probability.
- All the events in the same row that were zeros in the old model get
the same probability in the new model.
- ALL the non-zero probabilities went down.
- Sometimes the change doesn't look very large
- P(eat | I)[.0038 -> .0028]
- P(I | to)[.00092 -> .00082]
- Some very predictable events became less predictable:
- P(to|want)[.65-> .22]
- P(food|Chinese) [.56 -> .066]
- Other probabilities changed by large factors.
- P(lunch|Chinese) [.0047 -> .0011]
- P(food|want) [.0066 -> .0032]
- Likelihood ratios changed
- old model: P(I|lunch) = 4 * P(food|lunch)
- new model: P(I|lunch) = 2.5 * P(food|lunch)
Conclusion: Increasing the zero
probabilities from zero to a small number was good,
but the effect on the non-zero probabilities
was not always good. We're blurring our original model.
- We've assigned too much probability to the zeros,
with the result that sharply predictable events [P(to|want)] became much
less so, and some moderately rare events became very rare.
- We've indiscriminately assigned probability to various zeros,
with the result that sharply predictable events [P(to|want)] became much
less so, and some moderately rare events became very rare. We'd
lie to distribute probabilities to the zeroes so that events that
are highly predictable in the old model remain so in
the new model.
- We want a model that changes the existing model less, but still steals
away some probability to assign to the zero events.
|
Witten Bell
Discounting.
|
 
|
If we're going to assign the probability to zero-events, the
probabilities of others has to go down.
Why? Because the probability of all the possible events we're looking
at must add up to 1.
Take the case of want:
- Count before smoothing: 1215
- Count after smoothing: 1215 + 1616 = 2931
- Number of word types not seen to follow want (estimating):
Top 4 words (to, a, some Thai)
= .75 of the probability mass
304 occurrences to distribute among 1612 types
a minimum of 1308 never-before seen types
- This means that, in the model, following want,
almost half of the probability mass is
reserved for unseen events, 1308 events each
of which has the probability 1/1308.
- 1308 & divide 2931 = .45
- Which means the probability of all the previously seen
words has to go down precipitously (1.0 -> .55)
It's easy to see what the extreme case would be.
Suppose the word to always followed the word
want in our corpus but that
want was
a much rare word. Then we'd still have pretty good evidence that
to was extremely likely after want. Our
initial model would assign probability 1.
What would happen with add-one smoothing?
- Count before smoothing: 100
- Count after smoothing: 100 + 1616 = 1716
- Number of word types not seen to follow want: 1615.
- probability for unseen events: 1615 / 1716 = .94
- p(to|want) after smoothing = .06
|