Statistical Methods in Computational Linguistics

Probability Intro

Joint Distributions

Two distributions

Consider the sample space of cars. Imagine a universe with only three makes of cars and only three colors. We are interested in two separate distributions over the sample space of cars.

The first is the COLOR distribution, with three events, being red, being yellow, and being green. The second is the MAKE distribution, with three events, being a Jag, being a BMW, being a VW,

Here are some statistics.

      Green Red Yellow Total
    VW       50
    BMW       30
    Jag       20
    Total 60 20 20 100
As yet we dont know the details of how color and make correlate.
Random
Variables

A random variable is just some function that assigns numbers to events in a sample space. We have two distributions we're interested in, MAKE and COLOR, and neither of them takes numbers as values, but that doesnt really matter. We can just DEFINE random variables in terms of the things we're interested in:

    MAKE(x) = 1 if x is a VW
      2 if x is a BMW
      3 if x is a Jag
    COLOR(x) = 1 if x is green
      2 if x is yellow
      3 if x is red
So we have random variables, MAKE and COLOR with respective ranges:
  1. Range(MAKE) = {1, 2, 3}
  2. Range(COLOR) = {1, 2, 3}

The probability mass function(pmf) p for a random variable X gives the probability that X takes different values. Let p be the pmf for COLOR. Let's take 1, the value signifying "green":

    p(1) = p(COLOR=1) = P({ x in CARS | x is green})

Probability mass functions are defined over the ranges of random variables. Here are some things we know about the distributions of MAKE and COLOR, using frequentist estimates for the probabilities:

  1. p(COLOR=1[green])=60/100=.6 [with no possibility of confusion, we write p(green)=.6]
  2. p(COLOR=2[yellow])=20/100=.2
  3. p(MAKE=1[VW])=50/100=.5

Notice that it's kind of annoying having to go by way of the 1,2,3 for MAKE and COLOR, given we're interested in make and color. Frequently we just notate this directly and write:

    p(COLOR=green)
Replacing the number with the property the number picks out.

We have two distributions MAKE and COLOR completely defined in fact, but we dont know the JOINT DISTRIBUTION, p(MAKE=x,COLOR=y) [read "the joint probability that the make is x AND the color is y"].

Joint
Distribution

Here's one possible version of p(MAKE,COLOR):

    Statistics determining COLOR and MAKE in which the two variables are independent
      Green Red Yellow Total
    VW 30 10 10 50
    BMW 18 6 6 30
    Jag 12 4 4 20
    Total 60 20 20 100

Note that we can turn this directly into a probability table just by dividing all the numbers by 100, the total number of cars. The 9 probabilities in the chart then add up to 1.

    Probabilities
    for joint, independent p(MAKE,COLOR)
      Green Red Yellow
    VW .30 .10 .10
    BMW .18 .06 .06
    Jag .12 .04 .04
This is actually a very special kind of pmf called a joint distribution of independent variables. We'll get back to why below.

Note that there is only one distribution of colors and makes consistent with the two distributions we started out with in which the two variables are independent. But there are numerous joint distributions consistent with the original facts in which the variables are not independent. Here are two others.

    Distribution A
      Green Red Yellow Total
    VW 30 5 15 50
    BMW 20 9 1 30
    Jag 10 6 4 20
    Total 60 20 20 100
    Distribution B
      Green Red Yellow Total
    VW 30 10 10 50
    BMW 20 5 5 30
    Jag 10 5 5 20
    Total 60 20 20 100
Conditional
Probability

 
VW INT green
    the set of cars that
    are both VWs
    and green
|VW INT green|
    the number of cars
    that are both VWs
    and green

The above tables give several kinds of probability information.

  1. JOINT distribution. p(MAKE,COLOR). Example: p(MAKE=VW,COLOR=green). The joint probability that a car is a VW and green.

    Frequentist estimate (relative frequency):

      |VW INT Green| / |CARS| ( = .3 Distribution A)
    The frequency of cars that are both VWs and green relative to the frequency of cars.
  2. CONDITIONAL distributions given a color. p(MAKE| COLOR=green). p(MAKE | COLOR=red). p(MAKE | COLOR=yellow).

    Example:

    p(MAKE=VW| COLOR=green). The probability that a car is a VW given that it's green.

    Frequentist estimate:

      |VW INT Green| / |Green| ( = .5 Distribution A)
    The frequency of cars that are both VWs and green relative to the frequency of green cars.
  3. CONDITIONAL distributions given a make. p(COLOR| MAKE=VW). p(COLOR | MAKE=BMW). p(COLOR | MAKE=Jag).

    Example: p( COLOR=green | MAKE=VW). The probability that a car is green given that it's a VW.

    Frequentist estimate:

      |VW INT Green| / |VW| ( = .6 Distribution A)
  4. MARGINAL DISTRIBUTIONS. p(COLOR). p(MAKE). These are the distributions we started with.
      p(MAKE=x) = Sumy in COLOR p(MAKE=x,COLOR=y)

We can actually talk about nine different probability distributions now, one joint distribution, two marginal distributions, and six conditional distributions.

  1. p(MAKE). Sample space: set of cars (size: 100)
  2. p(COLOR). Sample space: set of cars (size: 100)
  3. p(MAKE,COLOR). Sample space: set of cars (size: 100)
  4. p(MAKE|COLOR=green). Sample space: set of green cars (size: 60)
  5. p(MAKE|COLOR=yellow). Sample space: set of yellow cars (size: 20)
  6. p(MAKE|COLOR=red). Sample space: set of red cars (size: 20)
  7. p(COLOR|MAKE=VW). Sample space: set of VWs (size: 50)
  8. p(COLOR|MAKE=BMW). Sample space: set of BMWs (size: 30)
  9. p(COLOR|MAKE=Jag). Sample space: set of Jags (size: 20)
So, for example, we think of p(MAKE|COLOR=green) as The MAKE distribution restricted to the sample space of green cars.
p(X|Y) is not
a pmf

Note a couple of missing elements from our list of distributions:

    p(MAKE | COLOR)
    p(COLOR | MAKE)
Although these look like pmfs, they are not. Note that for each COLOR x,
  1. P(MAKE | COLOR =x)
defines a pmf that adds up to 1. For example:
    P(MAKE=VW | COLOR=green) + P(MAKE=BMW | COLOR=green) + P(MAKE=Jag | COLOR=green) = 1
But unless we fix a color we dont have something that adds up to 1. So the notation
    p(COLOR | MAKE)
doesnt tell you enough to pick out a probability function that adds up to 1.

Note that confusingly

    p(COLOR, MAKE)
is a pmf. Probabilities are being assigned to color, make pairs, and these do add up to 1.
Chain Rule

The following relationship is called the chain rule:

    p(COLOR=x,MAKE=y)= p(Color=x|Make=y) * p(Make=y)
It's completely symmetric as to which variable is given:
    p(COLOR=x,MAKE=y)= p(Make=y|Color=x) * p(Color=x)
Why should this be true? In a sense you cant prove it, not without a lot of assumptions about what a probability is. But you can see why it might be a reasonable axiom.

It's true if you look at probabilities in a purely frequentist way:

  1. p(green, VW) = | green INT VW | / |CARS|
  2. p(green|VW) = | green INT VW | / |VW|
  3. p(VW) = |VW|/|CARS|
So, on the frequentist interpretation, it just works out to be true:
    | green INT VW | / |CARS| = | green INT VW | / |VW| * |VW|/|CARS|
Independent
Distributions

Note an important special case of the chain rule. We call two distributions X and Y independent if:

    p(X | Y) = p(X)
In other words the value of y has no effect on the value of x.

Note the special case of the chain rule that holds for independent distributions.

    p(x,y)= p(x|y) * p(y) = p(x) * p(y)
Exercise: Verify that the table labeled an independent distribution really is.
Bayes' Law

Recall the two versions of the chain rule:

  1. p(COLOR=x,MAKE=y)= p(Color=x|Make=y) * p(Make=y)
  2. p(COLOR=x,MAKE=y)= p(Make=y|Color=x) * p(Color=x)

From this we can immediately conclude Bayes' Law:

    p(Color=x|Make=y) * p(Make=y) = p(Make=y|Color=x) * p(Color=x)
This is often written in the following form:
    p(Color=x|Make=y) = p(Make=y|Color=x) * p(Color=x) / p(Make=y)