Language and Computers


Probabilistic Models

General
Case
 

A Probability modelthe sample space.

Language
Model
 

A language model assigns probabilities to ALL sequences of words.

Order
of a Model
 

  • zero-order. Random model. Generate elements of the vocabulary randomly.
  • first-order. Unigram model. Generate elements of vocabulary consistent with individual word frequencies.
      Equivalent to the (incorrect) assumption that word occurrences are independent of each other.
  • Second-order Bigram model. Generate words consistent with word-pair frequencies.
      Equivalent to the (incorrect) assumption that a word occurrence depends only on immediately preceeding word.
      • Consider environment "a case of __"
      • Compare Prob(beer|case of) with Prob(mice|case of).
      • Consider that phrases like "of mice and men" in "the best-laid plans of mice and men" might increase Prob(mice|of).
  • Third-order Trigram model. Generate words consistentv with word-triple frequencies.
      Equivalent to the (incorrect) assumption that a word occurrence depends only on immediately preceeding two words.
Parameters
of a model
 

The Parameters of a model are the basic numbers you have to know in order to compute the probabilities of all the events in the sample space.

The number of parameters in a model is often an indication of how practical/useful it is. Too big a number means it's impractical.

Examples
of Parameter
 

Consider a unigram model with a vocabulary of 100. This has 100 parameters.

How many words should you collect to estimate these 100 probabilities?

Will 100 do? Will 1000 do? What does it really depend on?

Minimally, you would need to try to get at least 1 occurrence of each word (assuming no vocabulary item has probability 0). But in any corpus in which some event occurs just once, there's an excellent chance that you haven't got the probability of that event right.

Consider a pure pentagram model, building in the assumption that the probability of a word occurrence depends only on the previous 4 words.

This has 10 Billion parameters for a vocabulary size of 100.

What's a good sample size for this many parameters?

Consider doing a pentagram model of _Tom_Sawyer_.

What would happen if you started randomly generating given this model?

Finite-
State Grammars
Review
 

Restrict rules so that non-terminals must occur at the beginning or at the end of a rule.

Then we have a Finite-State Grammar. Finite-state grammars can be parsed using a particularly simple kind of computational model called finite-state automaton.

These are more restricted than context-free grammars, hence a proper subset of tehm. They are easier to process and to parse. We'll come back to them later.

Finite-
State
Automata(FSA)
 

A Finite-state automaton is a kind of machine that will accept the languages described by finite-state grammars.

The simplicity and naturalness of the computationalk model suggest there is something significant and natural about the class of finite-state languages.

Most computer languages fall in this class of languages.

A Finite-state automaton consists of

  1. A finite set of states, one of which is designated as an initial state and and some set of which are designated as final states.
  2. Labeled, directed arcs between states. A labeled directed arc has an input state and an output state and a label. The label designates a vocabulary element in the language. The interpretation of an arc is that when the automaton is in the input state and that vocabulary item is obesrved the automaton may transition to the output state.
  3. This being the case an FSA can be determined by a table.

    Sheep Language FSA
    State Final? Initial? Observation Next State
    0 No Yes b 1
    1 Yes No a 1
Markov
Models
 

A Markov model may be thought of as an FSA with probabilities added. The probabilites leaving any given state must sum to 1.

Ngram models may be represented with Markov models.

There are three states in a bigram model with a vocabulary of size of 3, one for each element of the vocabulary:

  1. a, b, c
  2. The states encode the input just seen. THat is, to build a bigram model we need one state corresponding to each possible first word of a bigram: one state for each word in the vocabulary.
  3. The labels on the links between the states represent the next word seen. With a vocabluary of 3, each state can thus have three arcs leaving it; one is always a loop. So observing an a means taking a transition labeled a (a is the current input) to a state named a (a is now the last word seen).
  4. Summing up: 3 states, 9 transitions. For each ordered state pair, there is one transition. One for example, for [a,b], another for [b, a]. The 9 transitions correspond to the 9 bigrams in a bigram model for a vocabulary of 2. [ab and ba are different bigrams, with different probabilities.]
  5. The probability for each transition is the conditional probability for a bigram. The probability on the transition from state b to state a is Pr(a | b).
  6. Consider the case of the trigram. For a trigram Markov model we need states that encode the last TWO words seen. So for a vocabulary of two, a and b, we need four states aa, ab, ba and bb. Each state will have two states leaving it, one for each of the two words in the vocabulary. From aa we can only to aa and ab, because we must go to a state where the first of the two letters is "a". The probability on the transition from aa to ab is P(b | aa). Four states times 2 exiting transitions each is8 transitions, one for each trigram in a language with a vocabulary of 2.
  7. The probability of an input is the product of the transition probabilities that accept it. THis product of all the transitions on a single path through the machine is called the path probability.

    Bigram Markov Model
    State Observation Next State Probability
    a b b Pr(b|a)
    a a Pr(a|a)
    b a a Pr(a|b)
    b b Pr(b|b)

    Trigram Markov Model
    State Observation Next State Probability
    aa b ab Pr(b|aa)
    a aa Pr(a|aa)
    ba b ab Pr(b|ba)
    a aa Pr(a|ba)
    bb a ba Pr(a|bb)
    b bb Pr(b|bb)
    ab a ba Pr(a|ab)
    b bb Pr(b|ab)
Markov
Chains
 

The Markov models for bigrams and trigrams (or for any ngrams) are called Markov Chains.

A Markov Chain is a Markov model for which there is at most one path through the model for any given input.

Hidden
Markov
Models
 

Hidden Markov Models (HMMs) have (possibly) more than one path through the machine for any given input.

The probability of an input according to the machine is the sum of the probabilities of the paths that accept it.

Viterbi
Algorithm
 

This is an algorithm for finding the most probable path through an HMM.

Probability
Model
 

Let's assign tags to parts of speech with the following probability model.

    Prob(w1,n,t1,n)= Pii=1 P(wi | t i) * P(ti | t i-1)
That is, the joint probability of a word sequence n words long and a tag sequence n tags long is equal to the product of the probabilities of each word wi given its tag ti times the probability of tag ti following tag ti-1. We assume each string starts with a dummy word start labeled with a dummy tag [s]. So the above formula is always defined because t0 is always [s].

Here is our probability model, built from a training corpus.

Example  

Consider the following input to our tagger:

  • Start ground ground
There are 4 different state sequences that will accept this input:
  1. [s] V V
  2. [s] V N
  3. [s] N V
  4. [s] N N
These correspond to 4 different assignments of part-of-speech tags to the two input words (ground ground). Here is one assignment, which will correspond to one path through an HMM.

    start   ground   ground
    [s]   V   N  
      Pr(V|[s])* Pr(ground|V)   Pr(N|V) * Pr(ground|N)  
      (.5 * .3) = .15   (.9 * .4) = .36  
Using the
Probability
Model
 

For our example, according to this probability model we calculate the joint probability of these words with these tags to be:

  • .15 * .36 = .0540
When we have an HMM, this will the product of the transition probabilities for one path through the HMM. To find the most likely assignment of tags we need to consider all paths through the HMM and find the most probable. This is what the Viterbi algorithm does.

A Viterbi calculation for our HMM. Consider the input

    ground control station

Red cell: State V at time t=2 (two places to come from):

    Path through V
    From [s] at t=0 to V at t=1 From V at t=1 to V at t=2 Product
    .15 .029 .00435
    Path through N
    From [s] at t=0 to N at t=1 From N at t=1 to V at t=2 Product
    .20 .145 .029
The most probable of these two paths is the one from N with path probability .029.

Yellow cell: State N at time t=2 (two places to come from):

    Path through V
    From [s] at t=0 to V at t=1 From V at t=1 to N at t=2 Product
    .15 .27 .0405
    Path through N
    From [s] at t=0 to N at t=1 From N at t=1 to N at t=2 Product
    .20 .15 .03
The most probable of these two paths is the one from V with path probability .0405

Moving on to t=3.

Red cell: State V at time t=3 (two places to come from):

    Path through V
    From [s] at t=0 to V at t=2 From V at t=2 to V at t=3 Product
    .029 .041 .00119
    Path through N
    From [s] at t=0 to N at t=2 From N at t=2 to V at t=3 Product
    .0405 .205 .0083
The most probable of these two paths is the one from N with path probability .0083.

Yellow cell: State N at time t=3 (two places to come from):

    Path through V
    From [s] at t=0 to V at t=2 From V at t=2 to N at t=3 Product
    .029 .27 .0078
    Path through N
    From [s] at t=0 to N at t=2 From N at t=2 to N at t=3 Product
    .0405 .15 .006
The most probable of these two paths is the one from V with path probability .0078.

We end up computing two best paths that cover the input "ground control station" (Orange = red + yellow): But since the path that ends up in state V has a higher path probability, this is the winner.