|
General Case |
  |
A Probability modelthe sample space. |
|---|---|---|
|
Language Model |
  |
A language model assigns probabilities to ALL sequences of words. |
|
Order of a Model |
  |
|
|
Parameters of a model |
  |
The Parameters of a model are the basic numbers you have to know in order to compute the probabilities of all the events in the sample space. The number of parameters in a model is often an indication of how practical/useful it is. Too big a number means it's impractical. |
|
Examples of Parameter |
  |
Consider a unigram model with a vocabulary of 100. This has 100 parameters. How many words should you collect to estimate these 100 probabilities? Will 100 do? Will 1000 do? What does it really depend on? Minimally, you would need to try to get at least 1 occurrence of each word (assuming no vocabulary item has probability 0). But in any corpus in which some event occurs just once, there's an excellent chance that you haven't got the probability of that event right. Consider a pure pentagram model, building in the assumption that the probability of a word occurrence depends only on the previous 4 words. This has 10 Billion parameters for a vocabulary size of 100. What's a good sample size for this many parameters? Consider doing a pentagram model of _Tom_Sawyer_. What would happen if you started randomly generating given this model? |
|
Finite- State Grammars Review |
  |
Restrict rules so that non-terminals must occur at the beginning or at the end of a rule. Then we have a Finite-State Grammar. Finite-state grammars can be parsed using a particularly simple kind of computational model called finite-state automaton. These are more restricted than context-free grammars, hence a proper subset of tehm. They are easier to process and to parse. We'll come back to them later. |
|
Finite- State Automata(FSA) |
  |
A Finite-state automaton is a kind of machine that will accept the languages described by finite-state grammars. The simplicity and naturalness of the computationalk model suggest there is something significant and natural about the class of finite-state languages. Most computer languages fall in this class of languages. A Finite-state automaton consists of
|
| State | Final? | Initial? | Observation | Next State |
|---|---|---|---|---|
| 0 | No | Yes | b | 1 |
| 1 | Yes | No | a | 1 |
|
Markov Models |
  |
A Markov model may be thought of as an FSA with probabilities added. The probabilites leaving any given state must sum to 1. Ngram models may be represented with Markov models. There are three states in a bigram model with a vocabulary of size of 3, one for each element of the vocabulary:
|
|---|
| State | Observation | Next State | Probability |
|---|---|---|---|
| a | b | b | Pr(b|a) |
| a | a | Pr(a|a) | |
| b | a | a | Pr(a|b) |
| b | b | Pr(b|b) |
| State | Observation | Next State | Probability |
|---|---|---|---|
| aa | b | ab | Pr(b|aa) |
| a | aa | Pr(a|aa) | |
| ba | b | ab | Pr(b|ba) |
| a | aa | Pr(a|ba) | |
| bb | a | ba | Pr(a|bb) |
| b | bb | Pr(b|bb) | |
| ab | a | ba | Pr(a|ab) |
| b | bb | Pr(b|ab) |
|
Markov Chains |
  |
The Markov models for bigrams and trigrams (or for any ngrams) are called Markov Chains. A Markov Chain is a Markov model for which there is at most one path through the model for any given input. |
|---|---|---|
|
Hidden Markov Models |
  |
Hidden Markov Models (HMMs) have (possibly) more than one path through the machine for any given input. The probability of an input according to the machine is the sum of the probabilities of the paths that accept it. |
|
Viterbi Algorithm |
  |
This is an algorithm for finding the most probable path through an HMM. |
|
Probability Model |
  |
Let's assign tags to parts of speech with the following probability model.
|
Here is our probability model, built from a training corpus.
| Example |   |
Consider the following input to our tagger:
|
|---|
| start |   | ground |   | ground | |
| [s] |   | V |   | N |   |
|   | Pr(V|[s])* Pr(ground|V) |   | Pr(N|V) * Pr(ground|N) |   | |
|   | (.5 * .3) = .15 |   | (.9 * .4) = .36 |   |
|
Using the Probability Model |
  |
For our example, according to this probability model we calculate the joint probability of these words with these tags to be:
|
|---|
| States |
Start V = Verb N = Noun |
|||
|---|---|---|---|---|
| State | Observation | Transition |
Transition Probability |
|
| [s] | ground | [s] => V | Pr(V | [s])   * | Pr(ground | V) = |
| .5   * | .3 | |||
| .15 | ||||
| [s] => N | Pr(N | [s])   * | Pr(ground| N) = | ||
| .5   * | .4 | |||
| .20 | ||||
| control | [s] => N | .5 * .3 = .15 | ||
| [s] => V | .5 * .3 = .15 | |||
| station | [s] => N | .5 * .3 = .15 | ||
| [s] => V | .5 * .4 = .20 | |||
| V | ground | V => N | Pr(N | V)   * | Pr(ground | N) |
| .9   * | .4 | |||
| .36 | ||||
| V => V | Pr(V | V)   * | Pr(ground | V) | ||
| .1   * | .3 | |||
| .03 | ||||
| control | V => N | .9 * .3 = .27 | ||
| V => V | .1 * .29 = .029 | |||
| station | V => N | .9 * .3 = .27 | ||
| V => V | .1 * .41 = .041 | |||
| N | ground | N => N | Pr(N | N)   * | Pr(ground | N) |
| .5   * | .4 | |||
| .20 | ||||
| N => V | Pr(V | N)   * | Pr(ground | V) | ||
| .5   * | .3 | |||
| .15 | ||||
| control | N => N | .5 * .3 = .15 | ||
| N => V | .5 * .29 = .145 | |||
| station | N => N | .5 * .3 = .15 | ||
| N => V | .5 * .41 = .205 | |||
A Viterbi calculation for our HMM. Consider the input
| V |   | (.5 * .3) = .15 |   |   |
| N |   | (.5 * .4) = .20 |   |   |
| Start | 1.0 |   |   |   |
|   | [s] | ground | control | station |
| t=0 | t=1 | t=2 | t=3 |
Red cell: State V at time t=2 (two places to come from):
From [s] at t=0 to V at t=1 From V at t=1 to V at t=2
Product
.15 .029 .00435
The most probable of these two paths is the one from N with path probability
.029.
From [s] at t=0 to N at t=1 From N at t=1 to V at t=2
Product
.20 .145 .029
| V |   | (.5 * .3) = .15 | (.20 * .145) = .029 |   |
| N |   | (.5 * .4) = .20 |   |   |
| Start | 1.0 |   |   |   |
|   | [s] | ground | control | station |
| t=0 | t=1 | t=2 | t=3 |
Yellow cell: State N at time t=2 (two places to come from):
From [s] at t=0 to V at t=1 From V at t=1 to N at t=2
Product
.15 .27 .0405
The most probable of these two paths is the one from V with path probability
.0405
From [s] at t=0 to N at t=1 From N at t=1 to N at t=2
Product
.20 .15 .03
| V |   | (.5 * .3) = .15 | (.20 * .145) = .029 |   |
| N |   | (.5 * .4) = .20 | (.15 * .27) = .0405 |   |
| Start | 1.0 |   |   |   |
|   | [s] | ground | control | station |
| t=0 | t=1 | t=2 | t=3 |
Moving on to t=3.
| V |   | (.5 * .3) = .15 | (.20 * .145) = .029 |   |
| N |   | (.5 * .4) = .20 | (.15 * .27) = .0405 |   |
| Start | 1.0 |   |   |   |
|   | [s] | ground | control | station |
| t=0 | t=1 | t=2 | t=3 |
Red cell: State V at time t=3 (two places to come from):
From [s] at t=0 to V at t=2 From V at t=2 to V at t=3
Product
.029 .041 .00119
The most probable of these two paths is the one from N with path probability
.0083.
From [s] at t=0 to N at t=2 From N at t=2 to V at t=3
Product
.0405 .205 .0083
| V |   | (.5 * .3) = .15 | (.20 * .145) = .029 | (.0405 * .205) = .0083 |
| N |   | (.5 * .4) = .20 | (.15 * .27) = .0405 |   |
| Start | 1.0 |   |   |   |
|   | [s] | ground | control | station |
| t=0 | t=1 | t=2 | t=3 |
Yellow cell: State N at time t=3 (two places to come from):
From [s] at t=0 to V at t=2 From V at t=2 to N at t=3
Product
.029 .27 .0078
The most probable of these two paths is the one from V with path probability
.0078.
From [s] at t=0 to N at t=2 From N at t=2 to N at t=3
Product
.0405 .15 .006
| V |   | (.5 * .3) = .15 | (.20 * .145) = .029 | (.0405 * .205) = .0083 |
| N |   | (.5 * .4) = .20 | (.15 * .27) = .0405 | (.029 * .27) = .0078 |
| Start | 1.0 |   |   |   |
|   | [s] | ground | control | station |
| t=0 | t=1 | t=2 | t=3 |
We end up computing two best paths that cover the input "ground control station" (Orange = red + yellow): But since the path that ends up in state V has a higher path probability, this is the winner.