|
Witten-Bell
|
In Witten-Bell we relativize the amount of
discounting to each word's promiscuity.
Which we will measure directly in terms of
how many distinct word types follow it.
We deal with the case of conditional
probabilities.
Some more notation. For the special case of bigrams:
Partitioning the vocabulary relative to w
| Z(w) | = | Number of word types not seen after w |
| T(w) | = | Number of word types seen after w |
| V | = | Z(w) + T(w) |
Just to make clear what this notation
means, lets'
rephrase add-one discounting (conditional distribution)
using the new notation:
Probability reserved
for zeros
|
Z(w)
|
|
C(w) + V
|
Probability reserved
for each non-zero
p(w2 | w1)
|
C(w1w2) + 1
|
|
C(w1) + V
|
|
Total Prob
|
Z(w1) + C(w1) + T(w1)
|
|
C(w1) + V
|
Note the claim that the total probability is 1 depends on:
- T(w1) = Sumw2: w2 somewhere follows w1 1
- V = Z(w) + T(w)
- Sumw2 C(w1w2) = C(w1)
Witten-Bell discounting
is very similar, but with
a wrinkle. Relativize amount of discounting
to T(w1)
Probability reserved
for zeros
|
T(w1)
|
|
C(w1) + T(w1)
|
Probability reserved
for non-zeros
p(w2 | w1)
|
C(w1w2)
|
|
C(w1) + T(w1)
|
|
Total Prob
|
C(w1) + T(w1)
|
|
C(w1) + T(w1)
|
Observation:
A word followed by few types (relative to its frequency) gets
a small amount of probability mass reserved for zeros.
Consider the case of Chinese in the Berkeley Restaurant
corpus.
| Z(Chinese) | 20 |   |
| C(Chinese) | 213 |   |
Prob mass
for zeroes | 20 |
0.094
|
| 213 + 20 |
|
PMLE(food | Chinese)
|
120
|
0.563
|
|
213
|
|
PLap(food | Chinese)
|
121
|
0.066
|
|
213 + 1616
|
|
PWB(food | Chinese)
|
120
|
0.515
|
| 213 + 20 |
|
 
|
|
Good-Turing
|
Basic idea very simple, implementation not.
Instead using relative frequency r/N, we
use an "adusted frequency" r*/N. For
events seen before the adjusted
probability is:
| Descriptive Name | Abbreviation | Notation |
Adjusted Count | Discounted
Prob |
| Good-Turing | GT |
PGT(w1...wn), if
C( w1...wn)= r |
(r+1) Nr + 1 |
(r+1) Nr + 1 |
| Nr |
Nr* N |
First off, we're sorting events into frequency bins.
There is a bin for events of frequency
r, a bin for events of frequency r+1.
And so on.
Let's look at the amount of probability mass reserved
for the r bin and compare it to the mass reserved for the r+1
bin for both MLE and GT.
We'll use the notation
FPMGT(r+1) (frequency probability mass)
for the amount of probability mass reserved for
the r+1 bin for the GT method and
FPMMLE(r+1)
for the amount reserved for the MLE method.
|
frequency
|
FPMMLE
|
FPMGT
|
 
|
Simplified
FPMGT
|
|
r
|
r * Nr
|
Nr * (r+1)* Nr + 1
|
=
|
(r+1) * Nr + 1
|
|
N
|
N * Nr
|
N
|
|
r + 1
|
(r + 1) * Nr+1
|
Nr+1 * (r+2) * Nr + 2
|
=
|
(r+2) * Nr + 2
|
|
N
|
N * Nr + 1
|
N
|
So what we're doing is playing a shell game. We
still the (lesser) probability mass reserved
for the r+1 bin
and use it for the the r bin.
We use the r bin's mass for
the r-1 bin.
And so on, passing the buck down to 0.
|
r
|
FPMMLE
|
FPMGT
|
|
n
|
FPMMLE(n)
|
FPMMLE(n+1)
|
|
n -1
|
FPMMLE(n -1)
|
FPMMLE(n)
|
|
n - 2
|
FPMMLE(n-1)
|
FPMMLE(n-2)
|
|
........
|
|
2
|
FPMMLE(2)
|
FPMMLE(3)
|
1
|
FPMMLE(1)
|
FPMMLE(2)
|
0
|
FPMMLE(0)=0
|
FPMMLE(1)
|
This works because each prob mass on the RHS is
less than the mass on the left side.
For events with count= 0, we would have:
| Descriptive Name | Abbreviation | Notation | Adjusted Count | FPMGT(0) |
| Good-Turing | GT |
PGT(w1...wn) |
1 * N1 |
N1 |
| N0 |
N |
So we would estimate the probability of events of
count 0 by using the probability mass reserved for
the events of count 1. And distributing that mass uniformly,
each zero event would receive probability N1/(N0N).
But that isn't quite what happens for events
of frequency 0. We still need the probabilities
to add to 1. The only way to guarantee this
is just to use whatever's left over when we get to 0.
Then what we have for the 0 bin is:
| 1 - | Infinity |   |
| Sum | FPMGT(r) |
| r = 1 |   |
In practice this will be approximately equal to N1/N:
This shell game however cannot be played by just
taking the frequency counts directly from
the data. Some frequencies of
frequencies for bigrams
in Austen:
| Frequency | Nr |
| 1 | 138741 |
| 2 | 25413 |
| 3 | 10531 |
| 4 | 5997 |
| 5 | 3565 |
| 6 | 2486 |
|
.....
|
| 1264 | 1 |
|
| Frequency | Nr |
| 1265 | 0 |
| 1266 | 0 |
|
.....
|
0
|
| 1366 | 1 |
|
.....
|
0
|
| 1917 | 1 |
|
.....
|
| 2507 | 1 |
|
Our model assumes that every frequency is represented.
But in fact as we move up the frequency scale, gaps begin to occur.
There are no bigrams with frequencies between 1264 and 1366.
And of course there no bigrams with any frequencies
over 2507, because that is the frequency for
the most frequent bigram.
Three solutions:
- Find some function S
that smooths the
above table. For every frequency r,
S(r)
is the expected number of types of
frequency r+1.
- Do GT estimation only for low frequency
items (say occurring 5 times, or less). Use the Nr+1 found
in the data for those.
- Combine 1 and 2. Use Nr+1 in the data
until there is no significant difference between the
Nr+1 value in the data and r+1
computed by S. ((Simple Good-Turing: Gale
and Sampson 1995).
|
 
|