San Diego State University logo

Department of Linguistics and Oriental Languages

Contents

Syllabus

Course Outline

Possible Projects

Textbook

Statistical MT Home

Baseline System

Textbook site

Statistical MT links

Python links wiki

Statistical NLP Site

General NLP Site

Lang Codes

Readings

Unix Tools

Computational Linguistics

Computational Syntax and Semantics

Statistical Machine Translation and word-sense disambiguation

Linguistics 582


Traditionally this course has been taught as the second semester of a two-semester intro to computational linguistics, focusing on issues of parsing and semantic interpretation. However, this semester I want to try something a little different to try to take advantage of some truly remarkable resources that have been made available to the community. I refer, for example, to the materials for the Workshop on Statistical MT, which have been accumulating for several years now, under the shared task workshop model that has proven quite successful in advancing research in domains like speech recognition.

MT, like speech recognition, has traditionally been a difficult area in which to do research because of very high barriers that needed to be crossed just to start up, let alone to be competitive. But now, thanks to a number of government-funded efforts which have spun off open source systems (mostly Verbmobil and to some extent TIDES), there are very sophisticated Statistical MT systems available which you can really peek at the insides of. Perhaps even more importantly, a huge amount of free parallel corpus data has been released in a number of different languages, along with free online tools for producing more.

The key point is that these resources include no bilingual dictionaries and no hand-aligned data, two of the most expensive pieces of legacy systems. What you get for free is a system, lots of parallel data, and lots of language modeling data to build (ngram) models for the target language,

Thus, for example, I downloaded the Moses system, the European parliament data, and the ngram data used for this year's workshop and in a few days (most of that consumed in training time for the MT system), I had French-to-English system that produced this translation for the first sentence of the development test corpus:

    nous savons très bien que les traités actuels ne suffisent pas
    et qu' il sera nécessaire à l' avenir de développer une structure plus efficace
    et différente pour l' union , une structure plus constitutionnelle
    qui indique clairement quelles sont les compétences des états membres
    et quelles sont les compétences de l' union .

rendered as
    we know very well that the current treaties are not enough
    and that it will be necessary in future to develop a more effective and different structure
    for the union , a constitutional structure , which clearly indicates what are the responsibilities of the member states and what are the competences of the union .

Clearly, statistical MT has made great strides in the last decade. More importantly for the purposes of this class, the development of such systems has been a paradigm case of the new empirically-based research model in computational linguistics, with successful applications of many of the leading ideas. They include all of the following, which will be parts of the subject matter of the course:

  1. A disciplined testing and development scheme requiring clearly separated training data, development data (for log linear model tuning), development test data, and test data.
  2. Ngram language (5-gram max) models with backoff
  3. Iterative maximum Likelihood Estimation (MLE) algorithms
  4. Minimum error rate algorithms
  5. Modeling of phrasal alignments (built up from IBM-like models of word-to-word alignments).
  6. Modeling word-class information
  7. A good (automatic) evaluation methodology with well-established links to human evaluation. [this is critical to step 4, and a very difficult area, as we will see].
The result is not just a case study of how to do statistical language modeling; it is also a treasure-trove of linguistic data. The phrase models produce outputs like the following for millions of words:
    être divisé sur ||| divided over ||| (0) (0) (0) (1) ||| (0,1,2) (3) ||| 0.0666667 1.64111e-07
Which includes information about bidirectionally induced alignments and their probabilities.

Obviously part of the task an MT system performs is sense-disambiguation, though that task may be performed quite indirectly as the statistical systems show. We will look at some classical and recent work in sense disambiguation and we will look at using SMT methods for sense disambiguation tasks. Evaluation issues will rear their heads again.

The course requirements do not include programming experience, although programming projects will be encouraged for those so inclined. There will be several assignments (see course outline for examples), a presentation, and a final course project. See Course syllabus for weights and Experiments page for project ideas.

The required text for the class will be Jurafsky and Martin, Speech and Natural Language Processing. We will be focusing on the excellent revision of the MT chapter (Chapter 24) in 2nd edition, available at the textbook website. Many of the other readings will be papers and tutorials freely available online (see Course outline).

Website

http://www-rohan.sdsu.edu/~gawron/parsing

Prerequisites and Grading

Prequisite: Some computer science or some linguistics; preferably Ling 581.

Grading will be based on exercises/projects a presentation and a final project.

Place and Time

Tu Th 11:00-12:15
SH-348
Storm Hall

Contact Info

Mailing address:
Jean Mark Gawron
Department of Linguistics and Oriental Languages
San Diego State University
5500 Campanile Drive
San Diego, CA 92182-7727
Telephone: (619) 594-0252
Office Hours: Tu Th 12:30-1:45, BAM 321


Unix | Computational Linguistics Lab