Statistical Methods in Computational Linguistics

Cross-Entropy

Basic Definitions

Entropy

    H(p,m) = - Sumx p(x) log p(x)
The sum of the log probs of the model p, weighted by the real probs p.
 
Entropy
of a Language

    H(p) = Limc -> Inf -1/c Sumw in L p(w) log p(w)
(1)
Intuition

Here c is the size of a chunk in our language model:
unigram c = 1
bigram c = 2
trigram c = 3
  ...
ngram c = n

Other ideas:

  1. Note that this calculation is done for a language. Not for a corpus. The "Sum" is over all c-gram types in the language.
  2. The assumption is that as c gets big enough, the chunks are independent. For English, maybe this can't be. Or maybe "c" needs to get to be the size of say Remembrance of Things Past before the chunks really approach independence.
  3. We divide by "c" to get a per-word measure.
 
Cross-Entropy

    H(p,m) = - Sumx p(x) log m(x)
The sum of the log probs of the model m, weighted by the real probs p.
 
Cross-Entropy
of a Language

H(L,m) = Limc -> Inf -1 / c Sumw in L p(w) log m(w)
  = Limc -> Inf Sumw in L -1 / c p(w) log m(w)

(2)
Intuition
  1. Again, "c" is the size of chunks.
  2. Again this calculation is done for a language. Not for a corpus. The "Sum" is over all c-gram types in the language.
 
Log-Probability
of Corpus

Consider the following Limit:

    LimcN -> Inf -1 / (cN) log m(w1cN)
(3)
Intuition
  1. This calculation is done for a corpus, not a language.
  2. Again, "c" is the size of chunks.
  3. N is the number of chunks in our corpus.
  4. cN is thus the total number of words in the corpus. We divide by cN to get a per word measure.
  5. m(w1cN) is just the probability of the corpus according to the model.
 

Equivalence of Cross-entropy of a Language and Log-probability of a corpus

Magical
Theorem

It turns out that the right hand side of equation (2) and the expression in (3) are equal. We now show why.

 
Independence
of Chunks

We can use our independence assumption to rewrite Expression (3) as follows:

    = LimcN -> Inf - 1 / (cN) log (Pii=w1N m(i) )
    = LimcN -> Inf -1 / (cN) Sumi=w1N log m(i)
This just multiplies together the probabilities of the N chunks in our corpus, because they're independent. The log of that product is just the sum of the logs of the factors.
(4)
Corpus
Probability
by Type

In the summing part of (4) we summed over every c-gram token in the corpus. To make the coimparison to (2) clearer, we reformulate (4) to sum over every c-gram type, using the frequencies:

    = LimcN -> Inf -1 / (cN) Sumi=w1N log m(i)
    = LimcN -> Inf -1 / (cN) Sumw in L log m(w) f(w)
    = LimcN -> Inf Sumw in L -1 / (cN) log m(w) f(w)
Here f(w) is the frequency of chunk type w in the corpus. We go from a sum over all N c-gram tokens in the corpus to a sum over all the c-gram types in the language.
(5)
Chunk
Contributions

Now that we have both (2) and (5) expressed as sums over chunk types w, we can ask what one one chunk type w (a c-gram type) contributes.

 
Chunk
Contribution:
Language
Model

According to (2), we have as w's contribution:

    - 1/c p(w) log m(w)
(6)
Chunk
Contribution:
Corpus
Probability

According to (5), we have as w's contribution

    - 1/(cN) log m(w) f(w)
It turns out (6) and (7) are equal. And therefore the limits in (2) and (5) are equal. We now show this.
(7)
Representative
Corpus

We bring in the ergodic assumption. As cN gets larger and larger, the corpus is more and more likely to become representative. In a representative corpus, frequencies reflect the true probability model. That is:

    f(w) / N = p(w)
    Relative
    Frequency
    = Probability
    f(w) = N * p(w)
    Frequency = Expected
    frequency
 
Applying
Representativeness

Using our assumption about frequency, we substitute for f(w) in (7):

    = - 1/(cN) log m(w) * f(w)
    = - 1/(cN) log m(w) * p(w) * N
    = - 1/c log m(w) * p(w)
But this is exactly what w contributes in (6). And this happens for every chunk type summed over in (2) and (5), so the the limits in (2) and (5) are equal.
(8)

Cross-entropy: Cost of using the wrong model

You can think of cross-entropy of m as measuring the cost of using m instead of using the true model p. It turns out that the cross entropy is always greater than the entropy, that is, average amount of information using any model is always greater than or equal to the average information using the right model.
Size of Cross-entropy

The following is true:

    H(m,p) >= H(p)
(9)
The cross-entropy is always greater than or equal to the entropy. Equality occurs only when m = p.

Your textbook cites the following result, which you have been asked to prove as an extra-credit exercise:

    H(m,p) = H(p) + D(p||m)
It turns out that D(p||m) (Kullback-Leibler divergence) is always positive, so that proves the inequality in (9).
Kullback-
Leibler
Divergence
D(q||r)

Kullback-Leibler Divergence is defined as follows:

    D(q||r) = - Sumy q(y) log r(y)
    q(y)
Observation

The proof that D(q || r) is always positive is fairly simple. Lillian Lee (97) cites a very nice proof due to Elizabeth Thompson (Green 96). First let's note a fact about ln (the "natural log" function: log to the base of e, approximately equal to 2.718). For any positive z,

    ln z =< z-1
This can be seen by inspection. For z >= 1:
    z ln z z - 1
    1 0 0
    2 0.693 1
    3 1.099 2
    4 1.386 3
Note that ln z grows much more slowly than z, so ln z is never going to catch up to z - 1.

For z, 0 < z =< 1/e:

    z ln z z - 1
    1/e = 1/(2.718) -1.0 -0.6321
    1/3 -1.099 -0.667
    1/4 -1.386 -0.75
    1/64 -4.1589 -0.9844
    0 - Infinity -1
Here, z-1 is always greater than -1, and ln z is always less than -1.

For z, 1/e < z =< 1, ln z and z - 1 converge but only finally meet at z=1.

    z ln z z - 1
    1 0 0
    15/16 -0.0645 -0.0625
    7/8 -.1335 -0.1250
    3/4 -.2877 -0.25
    1/2 -0.693 -0.5
    2/5 = 1/(2.5) -0.9163 -0.6
Proof

The proof that D(q||r) is always positive follows fairly straightforwardly from the observation:

    -D(q||r) =   Sumy q(y) log r(y)  
    q(y)
      = 1 Sumy q(y) ln r(y)  
    ln 2 q(y)
      =< 1 Sumy q(y) ( r(y) - 1 )
    ln 2 q(y)
      = 1 Sumy r(y) - Sumyq(y)
    ln 2
      = 1 (1 - 1)
    ln 2
      = 0
So we've shown -D(q||r) is always less than or equal to something that is equal to 0. That is, -D(q||r) is always negative; so D(q||r) is always positive.
(10)
Intuition

KL is a measure of the dissimilarity of two distributions. It's dissimilarity rather than similarity because D(q||r)=0 just in case q = r.

This follows from our observation and our proof in (10). D(q||r) = 0 exactly when

    ln r(y) = r(y) - 1
    q(y) q(y)
But our observation shows this can only happen where, for each y:
    r(y) = 1
    q(y)

KL is sometimes thought of as the distance between two distributions. But it's not symmetric. So it's better just to think of it as a dissimilarity measure.

Getting back to cross-entropy, what we've shown is the following:

    The cross-entropy of a model and the true distribution is equal to the sum of the true entropy and the "dissimilarity measure" of the true distribution and the model.