Statistical Methods in Computational Linguistics

Discounting

The Problem with Add-One (Laplace's Law)

Notation
Conventions
(Ch. 6)

  1. N: size of corpus [count tokens]
  2. V: size of vocabulary [count types]
  3. B: parameter size of model [count type sequences]
    Chunk size B
    unigram V
    bigram V2
    trigram V3
    ngram Vn
  4. r: C(w1...wn) is the count (also, "frequency") of ngram chunk w1...wn
  5. Nr: The number of ngrams with count r. Typically very high for low r, r=0, r=1, r=2, ... The various Nr are called frequencies of frequencies or count counts.
(1)
Laplace's
Estimate

    C(w1...wn)+ 1
PLap(w1...wn) =
    N + B

(2)
Sparse Data

Numbers from Church and Gale (1991a):

  1. N: 44 million words of AP newswire
  2. V: 400, 653 [maintaining case distinctions, hyphenated words, lots of names!]
  3. B: 1.6 * 1011 bigrams [gasp!]
  4. B is much bigger than N. This means we are adding 1 to the numerator and a REAlLY BIG NUMBER to the numerator, which means we are dramatically reducing probabilities. Laplace's method a non-starter.
      All the seen data counts get squeezed into a tiny, tiny portion of our probability mass.
 
Adding less
than 1

Lidstone's law of Succession. Add some lambda (normally between 0 and 1):
PLid(w1...wn) = C(w1...wn)+ lambda

N + (lambda * B)
Now you can correct for the disparity on size between B and N.

The smaller lambda is, the less probability mass you steal away from unseen events.

The case of lambda = 1/2 is called Jeffrey Perks Law or Expected Likelihood Estimation (ELE).

 
Estimates
so far

    Descriptive Name Abbreviation Notation Formula
    Maximum Likelihood
    Estimate
    MLE PMLE(w1...wn) C(w1...wn)
    N
    Lidstone's Law   PLid(w1...wn) C(w1...wn)+ lambda
    N + (lambda * B)
    Laplace's Law   PLap(w1...wn) C(w1...wn)+ 1
    N + B
    Jeffreys-Perks Law
    Expected Likelihood Estimate
    ELE PELE(w1...wn) C(w1...wn)+ 1/2
    N + (1/2 * B)
 
Estimates
for conditional
probabilities

Descriptive Name Abbreviation Notation Formula
Maximum Likelihood
Estimate
MLE PMLE(wn| w1...wn-1) C(w1...wn)
C(w1...wn-1)
Lidstone's Law   PLid(wn| w1...wn-1) C(w1...wn)+ lambda
C(w1...wn-1) + (lambda * V)
Jeffreys-Perks Law
Expected Likelihood Estimate
ELE PELE(wn| w1...wn-1) C(w1...wn)+ 1/2
C(w1...wn-1) + (1/2 * V)

Adding 1/2
still bad

ELE still gives dramatic reductions in probabilities for seen data. From the text, results for words following was in Persuasion:

    Rank Word MLE ELE
    1 not 0.065 0.036
    2 a 0.052 0.030
    3 the 0.033 0.019
    4 to 0.031 0.017
      ......  
    1482 inferior 0 0.00003
 
Witten-Bell

In Witten-Bell we relativize the amount of discounting to each word's promiscuity. Which we will measure directly in terms of how many distinct word types follow it. We deal with the case of conditional probabilities.

Some more notation. For the special case of bigrams:

    Partitioning the vocabulary relative to w
    Z(w) = Number of word types not seen after w
    T(w) = Number of word types seen after w
    V = Z(w) + T(w)

Just to make clear what this notation means, lets' rephrase add-one discounting (conditional distribution) using the new notation:

    Probability reserved
    for zeros
    Z(w)
    C(w) + V
    Probability reserved
    for each non-zero
    p(w2 | w1)
    C(w1w2) + 1
    C(w1) + V
    Total Prob Z(w1) + C(w1) + T(w1)
    C(w1) + V
Note the claim that the total probability is 1 depends on:
  1. T(w1) = Sumw2: w2 somewhere follows w1 1
  2. V = Z(w) + T(w)
  3. Sumw2 C(w1w2) = C(w1)

Witten-Bell discounting is very similar, but with a wrinkle. Relativize amount of discounting to T(w1)

    Probability reserved
    for zeros
    T(w1)
    C(w1) + T(w1)
    Probability reserved
    for non-zeros
    p(w2 | w1)
    C(w1w2)
    C(w1) + T(w1)
    Total Prob C(w1) + T(w1)
    C(w1) + T(w1)
Observation: A word followed by few types (relative to its frequency) gets a small amount of probability mass reserved for zeros.

Consider the case of Chinese in the Berkeley Restaurant corpus.
Z(Chinese) 20  
C(Chinese) 213  
Prob mass
for zeroes
20 0.094
213 + 20
PMLE(food | Chinese) 120 0.563
213
PLap(food | Chinese) 121 0.066
213 + 1616
PWB(food | Chinese) 120 0.515
213 + 20

 
Good-Turing

Basic idea very simple, implementation not. Instead using relative frequency r/N, we use an "adusted frequency" r*/N. For events seen before the adjusted probability is:

    Descriptive Name Abbreviation Notation Adjusted
    Count
    Discounted
    Prob
    Good-Turing GT PGT(w1...wn), if
    C( w1...wn)= r
    (r+1) Nr + 1 (r+1) Nr + 1
    Nr Nr* N

First off, we're sorting events into frequency bins. There is a bin for events of frequency r, a bin for events of frequency r+1. And so on.

Let's look at the amount of probability mass reserved for the r bin and compare it to the mass reserved for the r+1 bin for both MLE and GT. We'll use the notation

    FPMGT(r+1) (frequency probability mass)
for the amount of probability mass reserved for the r+1 bin for the GT method and
    FPMMLE(r+1)
for the amount reserved for the MLE method.
    frequency FPMMLE FPMGT   Simplified
    FPMGT
    r r * Nr Nr * (r+1)* Nr + 1 = (r+1) * Nr + 1
    N N * Nr N
    r + 1 (r + 1) * Nr+1 Nr+1 * (r+2) * Nr + 2 = (r+2) * Nr + 2
    N N * Nr + 1 N
So what we're doing is playing a shell game. We still the (lesser) probability mass reserved for the r+1 bin and use it for the the r bin. We use the r bin's mass for the r-1 bin. And so on, passing the buck down to 0.

r FPMMLE FPMGT
n FPMMLE(n) FPMMLE(n+1)
n -1 FPMMLE(n -1) FPMMLE(n)
n - 2 FPMMLE(n-1) FPMMLE(n-2)
........
2 FPMMLE(2) FPMMLE(3)
1 FPMMLE(1) FPMMLE(2)
0 FPMMLE(0)=0 FPMMLE(1)
This works because each prob mass on the RHS is less than the mass on the left side.

For events with count= 0, we would have:

    Descriptive Name Abbreviation Notation Adjusted
    Count
    FPMGT(0)
    Good-Turing GT PGT(w1...wn) 1 * N1 N1
    N0 N
So we would estimate the probability of events of count 0 by using the probability mass reserved for the events of count 1. And distributing that mass uniformly, each zero event would receive probability N1/(N0N).

But that isn't quite what happens for events of frequency 0. We still need the probabilities to add to 1. The only way to guarantee this is just to use whatever's left over when we get to 0. Then what we have for the 0 bin is:
1 - Infinity  
Sum FPMGT(r)
r = 1  
In practice this will be approximately equal to N1/N:

This shell game however cannot be played by just taking the frequency counts directly from the data. Some frequencies of frequencies for bigrams in Austen:

    Frequency Nr
    1 138741
    2 25413
    3 10531
    4 5997
    5 3565
    6 2486
    .....
    1264 1
    Frequency Nr
    1265 0
    1266 0
    ..... 0
    1366 1
    ..... 0
    1917 1
    .....
    2507 1
Our model assumes that every frequency is represented. But in fact as we move up the frequency scale, gaps begin to occur. There are no bigrams with frequencies between 1264 and 1366. And of course there no bigrams with any frequencies over 2507, because that is the frequency for the most frequent bigram.

Three solutions:

  1. Find some function S that smooths the above table. For every frequency r, S(r) is the expected number of types of frequency r+1.
  2. Do GT estimation only for low frequency items (say occurring 5 times, or less). Use the Nr+1 found in the data for those.
  3. Combine 1 and 2. Use Nr+1 in the data until there is no significant difference between the Nr+1 value in the data and r+1 computed by S. ((Simple Good-Turing: Gale and Sampson 1995).