Linguistics 696 (to be Linguistics 681)

Statistical Methods in Computational Linguistics

Motivations

Zipf's Law:

Language has a lot of rare phenomena: New words in English text
Negative Data Issues:

Probabilistic models work with pruning, because the spaces searched are so large.

Hypotheses that receive little support (predicting low frequency phenomena) can be eliminated.

Consider the problem oif explaining exceptions to productive English valence patterns:

  1. * John donated the library the book.
  2. John faxed Bob the report.
  3. * John whispered Mary a story.
  4. John told Mary a story.
  5. John baked Mary a cake.
  6. * John iced Mary the cake. [= put icing on the cake]

Goldberg (94) uses this kind of argumentation, relativizing the counting to a theory of markedness of contexts:

  1. Certain contexts have an unmarked choice associated with them:
    1. Sally gave him a brand new VW. (unmarked)
    2. Sally gave a brand new VW to him.
    3. Sally that to a charming young man. (unmarked)
    4. Sally gave the charming young man that.
  2. When a construction doesn't show up in unmarked context, that's a strike against it.
  3. This account must be frequency-based to be workable.
  4. A big step forward on the synonymy theory:
    1. * John donated the library the book.
    2. John donated the book to the library.
    like faxed Mary the report.
Corpus-based
Linguistics

J. R. Firth. "You shall know a word by the company it keeps."

Lumpers versus splitters.

Once you start looking at corpora, the splitters have a point. The very gross distributional categories of generative grammar ( Noun[num=sing, count=mass,relational=yes]) don't determine the collocational company a word keeps:

  1. mother of invention
  2. theft of services
  3. paucity of __
  4. the ignorance of __
Methodological
Qualms

Are grammaticality intuitions really that important?

Examples in text book on p. 10 from van Riemsdijk and Williams (1986):

    What did Sally whisper that she had secretly read?
BUT

It's folly to believe that corpus-based work will relieve us the need to make tehretical assumptions.

At least if we work on meaning.

Geoff Samson, rabid anti-Chomskian, distributes a corpus, available here, called Susanne, which is abalkanced Brownlike corpus, with TREES.

Where did all those trees come from?

  1. Which car do want to win?
  2. Which car do you wanna win?
Structure seems to be indispensable. Studying syntax is studying what structures are meaning licensing.
Non-categorical phenomena
Computational issues

A language system must be able to disambiguate:

  1. Our company is training workers.