|
Entropy
|
H(p,m) = - Sumx p(x) log p(x)
The sum of the log probs of the model p,
weighted by the real probs p.
|
 
|
Entropy
of a Language
|
H(p) = Limc -> Inf -1/c Sumw in L p(w) log p(w)
|
(1)
|
|
Intuition
|
Here c is the size of a chunk in our language model:
| unigram | c = 1 |
| bigram | c = 2 |
| trigram | c = 3 |
|   | ... |
| ngram | c = n |
Other ideas:
- Note that this calculation is done for a language.
Not for a corpus. The "Sum" is over all c-gram types in the language.
- The assumption is that as c gets big enough, the chunks
are independent. For English, maybe this
can't be. Or maybe "c" needs to get to be the size of
say Remembrance of Things Past before the chunks
really approach independence.
- We divide by "c" to get a per-word measure.
|
 
|
|
Cross-Entropy
|
H(p,m) = - Sumx p(x) log m(x)
The sum of the log probs of the model m,
weighted by the real probs p.
|
 
|
Cross-Entropy
of a Language
|
|
H(L,m) |
= |
Limc -> Inf -1 / c Sumw in L p(w) log m(w)
|
|
  |
= |
Limc -> Inf Sumw in L -1 / c p(w) log m(w)
|
|
(2)
|
|
Intuition
|
- Again, "c" is the size of chunks.
- Again this calculation is done for a language.
Not for a corpus. The "Sum" is over all c-gram types in the language.
|
 
|
Log-Probability
of Corpus
|
Consider the following Limit:
LimcN -> Inf -1 / (cN) log m(w1cN)
|
(3)
|
|
Intuition
|
- This calculation is done for a corpus,
not a language.
- Again, "c" is the size of chunks.
- N is the number of chunks in our corpus.
- cN is thus the total number of words in the corpus. We divide
by cN to get a per word measure.
- m(w1cN)
is just the probability of the corpus
according to the model.
|
 
|
Magical
Theorem
|
It turns out that the right hand side of equation
(2) and the expression in (3) are
equal. We now show why.
|
 
|
Independence
of Chunks
|
We can use our independence assumption
to rewrite Expression (3) as follows:
|
=
|
LimcN -> Inf - 1 / (cN) log (Pii=w1N
m(i) )
|
|
=
|
LimcN -> Inf -1 / (cN) Sumi=w1N
log m(i)
|
This just multiplies together the probabilities of the N chunks
in our corpus, because they're independent. The log of that product
is just the sum of the logs of the factors.
|
(4)
|
Corpus
Probability
by Type
|
In the summing part of (4) we summed
over every c-gram token in the corpus.
To make the coimparison to (2) clearer,
we reformulate (4) to sum over
every c-gram type, using
the frequencies:
|
=
|
LimcN -> Inf -1 / (cN) Sumi=w1N
log m(i)
|
|
=
|
LimcN -> Inf -1 / (cN) Sumw in L
log m(w) f(w)
|
|
=
|
LimcN -> Inf Sumw in L -1 / (cN)
log m(w) f(w)
|
Here f(w) is the frequency of chunk type w in the corpus.
We go from a sum over all N c-gram tokens in
the corpus to a sum over all the c-gram types in the language.
|
(5)
|
Chunk
Contributions
|
Now that we have both
(2) and (5) expressed as sums over chunk types w,
we can ask what one one chunk type w (a c-gram type) contributes.
|
 
|
Chunk
Contribution:
Language
Model
|
According to (2), we have as w's contribution:
- 1/c p(w) log m(w)
|
(6)
|
Chunk
Contribution:
Corpus
Probability
|
According to (5), we have as w's contribution
- 1/(cN) log m(w) f(w)
It turns out (6) and (7) are equal.
And therefore the limits in (2) and (5) are equal.
We now show this.
|
(7)
|
Representative
Corpus
|
We bring in the ergodic assumption.
As cN gets larger and larger,
the corpus is more and more likely to become representative.
In a representative corpus,
frequencies reflect the true probability model.
That is:
|
f(w) / N
|
=
|
p(w)
|
Relative
Frequency
|
=
|
Probability
|
|
|
|
f(w)
|
=
|
N * p(w)
|
|
Frequency
|
=
|
Expected frequency
|
|
|
|
 
|
Applying
Representativeness
|
Using our assumption about frequency,
we substitute for f(w) in (7):
|
=
|
- 1/(cN) log m(w) * f(w)
|
|
=
|
- 1/(cN) log m(w) * p(w) * N
|
|
=
|
- 1/c log m(w) * p(w)
|
But this is exactly what w contributes in (6).
And this happens for every chunk type summed over
in (2) and (5), so the the limits in (2) and (5) are equal.
|
(8)
|
|
Size of
Cross-entropy
|
The following is true:
H(m,p) >= H(p)
|
(9)
|
|
The cross-entropy is always greater than or equal to the
entropy. Equality occurs only when m = p.
Your textbook cites the following result, which you have been
asked to prove as an extra-credit exercise:
H(m,p) = H(p) + D(p||m)
It turns out that D(p||m) (Kullback-Leibler divergence) is always positive,
so that proves the inequality in (9).
|
Kullback-
Leibler
Divergence
D(q||r)
|
Kullback-Leibler Divergence is defined as follows:
|
D(q||r)
|
=
|
- Sumy q(y) log
|
r(y)
|
|
q(y)
|
|
|
|
Observation
|
The proof that D(q || r) is always positive is fairly simple.
Lillian Lee (97) cites a very nice proof due to
Elizabeth Thompson (Green 96). First let's note a
fact about ln (the "natural log" function: log to
the base of e, approximately
equal to 2.718). For any positive z,
ln z =< z-1
This can be seen by inspection. For z >= 1:
|
z
|
ln z
|
z - 1
|
|
1
|
0
|
0
|
|
2
|
0.693
|
1
|
|
3
|
1.099
|
2
|
|
4
|
1.386
|
3
|
Note that ln z grows
much more slowly than z,
so ln z is
never going to catch up
to z - 1.
For z, 0 < z =< 1/e:
|
z
|
ln z
|
z - 1
|
|
1/e = 1/(2.718)
|
-1.0
|
-0.6321
|
|
1/3
|
-1.099
|
-0.667
|
|
1/4
|
-1.386
|
-0.75
|
|
1/64
|
-4.1589
|
-0.9844
|
|
0
|
- Infinity
|
-1
|
Here, z-1 is always greater than -1, and
ln z is always less than -1.
For z, 1/e < z =< 1,
ln z and z - 1 converge
but only finally meet at z=1.
|
z
|
ln z
|
z - 1
|
|
1
|
0
|
0
|
|
15/16
|
-0.0645
|
-0.0625
|
|
7/8
|
-.1335
|
-0.1250
|
|
3/4
|
-.2877
|
-0.25
|
|
1/2
|
-0.693
|
-0.5
|
|
2/5 = 1/(2.5)
|
-0.9163
|
-0.6
|
|
|
Proof
|
The proof that D(q||r) is always positive follows
fairly straightforwardly
from the observation:
|
-D(q||r)
|
=
|
 
|
Sumy q(y) log
|
r(y)
|
 
|
|
q(y)
|
|
 
|
=
|
1
|
Sumy q(y) ln
|
r(y)
|
 
|
|
ln 2
|
q(y)
|
|
 
|
=<
|
1
|
Sumy q(y) (
|
r(y)
|
- 1 )
|
|
ln 2
|
q(y)
|
 
|
=
|
1
|
Sumy r(y)
|
-
|
Sumyq(y)
|
|
ln 2
|
 
|
=
|
1
|
(1
|
-
|
1)
|
|
ln 2
|
 
|
=
|
0
|
So we've shown -D(q||r) is always less than or
equal to something that is equal to 0.
That is, -D(q||r) is always negative; so
D(q||r) is always positive.
|
(10)
|
|
Intuition
|
KL is a measure of the dissimilarity
of two distributions. It's dissimilarity
rather than similarity because D(q||r)=0
just in case q = r.
This follows from our observation and
our proof in (10). D(q||r) = 0 exactly when
| ln | r(y) | = | r(y) | - 1 |
| q(y) |
q(y) |
But our observation shows this can only happen where, for each y:
KL is sometimes thought of as the distance
between two distributions. But it's not
symmetric. So it's better just to think of it as
a dissimilarity measure.
Getting back to cross-entropy, what we've shown
is the following:
The cross-entropy of a model and the true distribution
is equal to the sum of the true entropy and the
"dissimilarity measure" of the true distribution
and the model.
|