Traditionally this course has been taught as the second semester
of a two-semester intro to computational linguistics,
focusing on issues of parsing and semantic interpretation.
However, this semester I want to try something a little different
to try to take advantage of some truly remarkable resources
that have been made available to the community. I refer,
for example, to
the materials for the Workshop
on Statistical MT, which have been accumulating for several years
now, under the shared task workshop model that has proven quite successful
in advancing research in domains like
speech recognition.
MT, like speech recognition, has traditionally
been a difficult area in which to do research
because of very high barriers that
needed to be crossed just to start up,
let alone to be competitive. But now, thanks
to a number of government-funded efforts
which have spun off open source systems
(mostly Verbmobil and to some extent TIDES),
there are very sophisticated Statistical
MT systems available which you can
really peek at the insides of. Perhaps
even more importantly, a huge amount
of free parallel corpus data
has been released in a number of different
languages, along with free
online tools for producing more.
The key point is that these resources include
no bilingual dictionaries and no hand-aligned
data, two of the most expensive pieces
of legacy systems. What you get for free is a system,
lots of parallel data,
and lots of language modeling data
to build (ngram) models for the target language,
Thus, for example, I downloaded the
Moses system, the European parliament data,
and the ngram data used for this year's workshop and in a few
days (most of that consumed in training
time for the MT system),
I had French-to-English system that produced this translation
for the first sentence of the development test
corpus:
nous savons très bien que les traités actuels ne suffisent pas
et qu' il sera nécessaire à l' avenir de développer une structure plus efficace
et différente pour l' union , une structure plus constitutionnelle
qui indique clairement quelles sont les compétences des états membres
et quelles sont les compétences de l' union .
rendered as
we know very well that the current treaties are not enough
and that it will be necessary in future to develop a more effective and different structure
for the union , a constitutional structure , which clearly indicates what are the responsibilities of the member states
and what are the competences of the union .
Clearly, statistical MT has made great strides in the
last decade. More importantly for the purposes of this
class, the development of such
systems has been a paradigm case of
the new empirically-based research model
in computational linguistics, with
successful applications of many
of the leading ideas. They include
all of the following, which will be parts of the subject matter
of the course:
- A disciplined testing and development
scheme requiring clearly separated training data,
development data (for log linear model tuning),
development test data, and test data.
- Ngram language (5-gram max) models with backoff
- Iterative maximum Likelihood Estimation (MLE) algorithms
- Minimum error rate algorithms
- Modeling of phrasal alignments (built
up from IBM-like models of word-to-word alignments).
- Modeling word-class information
- A good (automatic) evaluation methodology
with well-established links to human evaluation. [this is
critical to step 4, and a very difficult area, as we will see].
The result is not just a case study of
how to do statistical language modeling;
it is also a treasure-trove of linguistic
data. The phrase models produce outputs
like the following for millions of words:
être divisé sur ||| divided over ||| (0) (0) (0) (1) ||| (0,1,2) (3) ||| 0.0666667 1.64111e-07
Which includes information about bidirectionally induced alignments
and their probabilities.
Obviously part of the task an MT system performs
is sense-disambiguation, though that
task may be performed quite indirectly
as the statistical systems show. We will
look at some classical and recent work in
sense disambiguation and we will look
at using SMT methods for sense disambiguation
tasks. Evaluation issues will rear their
heads again.
The course requirements do not include programming experience,
although programming projects will be encouraged for those
so inclined. There will be several assignments
(see course
outline for examples), a presentation, and a final course project.
See Course syllabus for weights
and Experiments page
for project ideas.
The required text for the class will be Jurafsky and Martin, Speech and
Natural Language Processing. We will be focusing
on the excellent revision
of the MT chapter (Chapter 24) in 2nd edition, available
at
the textbook website. Many of the other readings
will be papers and tutorials freely available online (see
Course outline).
http://www-rohan.sdsu.edu/~gawron/parsing
Prequisite: Some computer science or
some linguistics; preferably Ling 581.
Grading will be based on exercises/projects a
presentation and a final project.
Tu Th 11:00-12:15
SH-348
Storm Hall
Mailing address:
Jean Mark Gawron
Department of Linguistics and Oriental Languages
San Diego State University
5500 Campanile Drive
San Diego, CA 92182-7727
Telephone: (619) 594-0252
Office Hours: Tu Th 12:30-1:45, BAM 321