Article - Slav Petrov

root
sbarq
s
wrb
no
vp
nns
vbg
vp
How
np
programmers are vbg
jj
nns
using
statistical approaches
s
vp
to
vp
to
vb
s
help
(Un)natural
Language
processing
by Paul Crider
Graphic by Matthew Mattozzi
N
early every one of us handily solved
the problems of natural language processing (NLP) shortly after we mastered
bipedal walking. Perhaps we can’t deconstruct and label everything we say, but we
do have an intuitive grasp of how to put
a sentence, or several sentences, together.
The researchers in Dan Klein’s research
group at UC Berkeley are attempting a
much more daunting task: teaching this
skill to computers. Percy Liang, a graduate
student in the Klein lab, compares making
machines understand human language to
making machines fly. Airplanes fly without flapping their wings, even though that
method comes so naturally to birds, but
both must obey the laws of aerodynamics.
In the case of NLP, researchers must figure out how to replace children’s natural
advantages in language learning, such as
context, with the computers’ strengths in
statistical analyses.
The field of NLP was born with a bang
of optimism when the first experiments
yielded encouraging results. The SHRDLU
program (whose name comes from an ar-
rangement of keys), developed by Terry
Winograd at MIT in the late 1960s, was
capable of understanding and obeying a
limited set of commands given in human
language. Users could ask the program, in
English, to perform simple tasks like “Place
the small, blue pyramid atop the red cube,”
and the program would follow through or
ask for clarification. The demonstration was
an impressive initial step in showing that
computers can be taught to understand human language.
There was also a good deal of excitement about the possibility of machines
translating languages; early researchers reasoned that if a machine could understand
one human language, it could probably understand another just as well, and therefore
convert between the two. Here again, the
first attempts were hopeful, as the machines
had no trouble translating simple sentences.
Slav Petrov, another student in the Klein lab,
chuckled at the grandiose ambitions this inspired: “There were some first promising experiments and people were saying, ‘Oh, we’ll
build computers that can translate from any
np
vp
nns
vb
Feature Language processing
whadvp
computers learn
Term
Translation
sbarq
question sentence
whadvp
wh-moved adverb
phrase
s
subject
wrb
wh-adverb
(question word)
no
noun object
vp
verb phrase
nns
plural noun
vbg
Verb/gerund/
present participle
np
noun phrase
jj
adjective
to
to
vb
verb
given language into English, and we’ll just
record everything going on in Russia, [and]
we’ll be able to build the perfect spy. We’ll
process everything in a few days.’”
The field sobered up when experiments
on complex sentences with real-life context
yielded less inspiring results. Sentences involving more actions and more objects than
those available in SHRDLU’s world are harder to understand, and the meaning of even
simple sentences can change depending on
emphasis, tone, and context. Researchers
43
les chats aiment
cats like
The Klein lab’s machine translation learning
algorithm, accompanying a simple “phrase-based”
translation from French to English. The learning
algorithm performs competitively when compared
with other phrase-based translation programs. It
also has the advantage of being able to translate
equally well between many different languages.
44
le poisson frais
fresh fish
Graphic adapted from Klein lab website by Matthew Mattozzi
Feature Language processing
came to accept that the “perfect spy” would
comparing how often it rains while old
rently studying phone recognition, which is
not be as easy to create as they had first
men complain about their aching knees
similar to speech recognition, but the goal
hoped. Rather, current research is focused on
with how often it rains in the absence of
is to recognize smaller units of sound called
developing programs for specific tasks, such
aching knees. This statistical dependence
phones—the bits of sound that combine to
as machine translation (using computers to
can then be combined with similar obsermake syllables, like the t sound in the syltranslate sentences from one language into
vations about clouds and barometric preslable ta. Petrov explains, “When you want
another), parsing (grammatically dissecting
sure. The Klein lab’s statistical models use
to say a word, you go through a sequence
sentences), and speech recognition (“transmassive data sets to improve their predicof steps of pronouncing each of the sounds,
lating” audio waves into text). “The only
or phones. While you’re in this mental
way the field is able to make progress
state [thinking about saying a particu“They said, ‘Oh, we’ll build
is to focus on these well-defined tasks,”
lar word], your mouth goes through
says Liang. “Even machine translation
computers that can translate into the steps of saying this word. What you
isn’t that well defined, because there are
are these sounds. We capture
English, and just record everything observe
many possible translations, and how do
them in a certain way, the frequencies
you evaluate that? But that’s [better] than going on in Russia, and we’ll be able and climaxes, and we infer what that
saying ‘Let’s build a machine that undermental state was [the word you intendto build the perfect spy.’”
stands human language!’”
ed to say].”
The modern approach to natural
To study this problem, Petrov used
language processing uses the techniques
tions. “The statistical paradigm is giving lots
a Hidden Markov model. In a Hidden Markof machine learning, a subfield of artificial
of data and examples to this thing called
ov model, the system under study exists
intelligence research that uses statistical
a learning algorithm which will then have
in one of many possible states. Each state
methods to extract trends and correlations
learned how to perform a certain task,” as
can give rise to a few possible observations,
automatically from data. Klein, who found
Liang describes it. Most of their algorithms
each with a different probability. The states
himself in the field as a logical consequence
are “trained” on standard data sets in which
also transition into one another with differof his high school passions for math, comthe correct “answer” has been indicated by
ent probabilities. For example, since many
puters, and languages, describes statistihuman researchers. A small portion of the
words contain the st combination (state,
cal modeling as a two-step process. First,
data is set aside to test the efficacy of the
study, system), the probability of the s state
break down complex processes into a set
trained algorithm.
changing into the t state is much higher than
of interacting variables, and then describe
the probability of s changing into f (words
how these variables probabilistically influTalk like a man
like “sphere” are rarer, at least in English).
ence each other. While traditional logic
Chances are, you’ve had a conversation
Petrov’s model starts by associating
rigidly links variables together, machine
with a machine in the recent past, perhaps
each phone with one mental state, and
learning techniques take advantage of “a
in order to change the radio station on your
then induces structure by splitting phones
kind of softer version of logic,” Klein excar stereo or during a phone call to track
into two or more states or merging multiple
plains. “[The programs] say there are a
a flight. Despite the success of such speech
phones into one mental state. The splits
bunch of variables and those variables
recognition applications, much work reand mergers are kept if they increase the
can be anything; they can be the words in
mains to be done. The Klein group is curlikelihood of observing the acoustic data.
your speech recognizer or the trees in your
translation procedure or whatever it is. You
have those same variables, except now you
relate them in ways that are more complex
.
than ‘A causes B.’ You say here are the ways
A could be, here are the ways B could be,
and here are the statistical dependencies
between them.”
To take the weather, for example, the
probability of rain might be predicted by
.
What is a sentence, anyway?
Even if computers can understand some
spoken words and respond to specific verbal
commands, they have a ways to go before
they truly understand the language. Computational linguistics, a sister of NLP, employs similar statistical methods to address
linguistic problems, such as what makes a
valid sentence. This is a tricky problem. Linguists have never been able to devise rules
that completely specify what constitutes an
admissible sentence, so naturally it is difficult to consistently diagram sentence structure—particularly for complex sentences
involving many clauses, like this one. This
process is called parsing, and while computers are quite good at parsing programming
languages, they have a hard time dealing
with the ambiguities of human languages.
The goal for researchers working on this
type of problem is to write an algorithm that
will take a sentence as input and deliver a
parse tree as an output.
A central problem of parsing is figuring out how to cluster the words of a
sentence into separate categories, like
noun phrases and verb phrases, and then
how to subdivide these clusters further.
This problem is especially complicated
because, despite the age of the field, linguists are still uncertain of how many
such categories there should be in any language. To deal with this uncertainty, Klein’s
group allows their model to split categories
to form new, more specific subdivisions as
it learns from data. Researchers first allow
their model to use any number of categories, permitting the introduction of new
categories as the program learns from the
data. As the algorithm proceeds, fewer new
categories need to be added, and eventu-
The future of NLP: History?
Natural Language Processing even has something to say about history. Alex BouchardCôté, a graduate student in the Klein group, has developed a probabilistic model of
linguistic evolution. Using sets of cognates (words in different languages that have descended from the same word in their ancestral language), he describes a model where
languages are arranged in trees, with the ancestor language at the top of the tree. A
phoneme (sound) change is made along a branch of the tree according to probability
rules that are dependent on the surrounding context: an e sound may change into an i
sound, but it is extremely unlikely to change into a k sound.
This model can tease out details of lost languages based on shared characteristics of
its modern successors. If four out of five daughter languages end a certain word with
an s, then the word in the parent language also probably ended in an s, while if only one
daughter language ends with s then the parent language probably did not. BouchardCôté trained his model on the well-studied Romance languages, with Latin as the “correct answer” ancestor, but he is excited about the prospect of using this method as a
“Jurassic Park for ancient languages.” For instance, this type of model could be used to
reconstruct the sounds of ancient Sanskrit from modern Hindi and Urdu.
Feature Language processing
The process is repeated over and over again
and the result is a phone transcription of
what the model “heard.”
This method differs from other approaches in the simplicity of its initial model.
Human phoneticians have known for a long
time that a “one phone, one state” model
was too simplistic to describe real speech, so
most models begin with complex structure
(splitting phones into subphones) and go
from there. Petrov’s model starts simply, and
adds complexity as it learns from the data.
The model performs competitively with
these other models when checked against
speech which has been manually annotated
with the true phone sequence.
The comparison to Jurassic Park is especially apt, as Bouchard-Côté’s approach is
similar to that used by evolutionary biologists. Klein compares this work to “the kind
of problem that biologists worry about in reconstructing ancient genomes, or ancient
forms of proteins. Except our words are much shorter, and the processes that alter
them are much, much more complex.”
Biologists may beg to differ on that last point, but at least concur that, unlike a
velociraptor, ancient Sanskrit is not likely to rip out your entrails.
ally the total number of categories becomes
static. An analogy might be made with music: borrow a stranger’s music collection and
start dividing random songs into genres and
sub-genres. There might be any number of
genres, but as you proceed, you will tend to
introduce fewer and fewer new ones.
The syntactic categories described by
the model won’t surprise many linguists;
you may remember many of them from a
grammar class. The model found categories like subject noun phrases, direct objects, and prepositional phrases. “When we
Even if computers can understand
some spoken words, they have
a ways to go before they truly
understand language.
looked at what was being learned, it was
many of the things linguists have proposed
before. The system learned to distinguish
[them],” says Petrov.
Using completely different methodologies, the humans and the machines
draw similar conclusions about language
structure. But the results are not identical.
Petrov’s parser doesn’t distinguish between
syntax (sentence structure and grammatical
rules) and semantics (what sentences actually mean). Klein explains, “Some linguists
draw a clear distinction between syntax and
semantics as two different things. But when
a parser is automatically trying to fit a grammar to data, often the syntax and the semantics are conflated.”
Some supervision required
The Klein group’s parser, which can be
found online, is the best-performing parser
of its kind. Because of its general, purely
data-driven approach, it has the advantage of being language agnostic: the same
program learns to parse Chinese as well it
does English. According to Liang, it may
be possible to make the parser even more
robust by removing humans from the
process entirely. Most machine translation has been supervised, that is, learning
algorithms have been trained on data that
has already been labeled with the correct
answers. “You have the learning algorithm
which you give these training examples.
Well, where do these training sentences
come from? Someone has to manually label
these sentences,” Liang says. “It took two
years for some people to hammer out those
sentences. So one thing I’m interested in is,
can you do without the supervision?”
45
Feature Language processing
The question of supervision also comes
into play for coreference resolution, or the
problem of determining what words in different contexts and documents refer to the
same thing. For instance, this might be finding what pronouns and nouns across several documents all refer to George W. Bush.
These will probably include instances of
“he”, plus words like “President” or “Commander-in-Chief.” The model is given the
specific entities it’s looking for (like Bush),
and a set of parameters describing the entity
(like gender, number, or whether the entity
is a location or a person), and then the model attempts to find and label the references
to that entity.
Coreference resolution is interesting
to computational linguists for two reasons.
First, humans clearly do it very well, so the
problem has theoretical value. But there
are practical applications as well. Suppose
you’re trying to find the answer to a specific
question—for example, “Where did George
Bush go to college?”—and your document
has a sentence containing “he attended Yale
University.” You need to know what “he”
refers to in order to find the answer.
Traditionally, coreference resolution has
been a supervised task: words like ‘President’ in the training data would already be
tagged as referring to George W. Bush. Apparently this labeling is not altogether necessary. Klein explains, “We were able to get
state-of-the-art performance from a fully unsupervised system that never sees examples
of the correct answers. It just uses a statisti-
cal model that is fit to the data and it’s able
to induce the right structure.”
One of the best known applications
of NLP, machine translation, uses a similar
strategy. In traditional machine translation,
rules are coded by linguists, which the machine then uses to translate text. Needless
“You can use statistical
methods to predict anything,
provided you have a way of
analyzing your raw input.”
to say, human linguists are more expensive
than statistical methods, so it’s worthwhile
to try to automate this process. In statistical machine translation, models are trained
on bilingual texts, such as French and
English transcripts of meetings of the European Union. The model then generates
best-guess sentences and compares them to
a reference translation, choosing the best
sentence based on a score associated with
the parameters of the model, such as phrase
length and word order distortion, as well
as the presence of common phrases (the
French “il y a” should probably be translated as “there is” instead of “it there has”
or “there has”). As the algorithm proceeds,
more parameters are learned and their relative importance is reweighted. Of course,
any scoring process like this is complicated
by the fact that there is no single correct
translation; style can account for a number
of perfectly legitimate translations.
The techniques used in machine translation and coreference resolution are combined in another project of the Klein group:
building bilingual lexicons. When two texts
are translations of each other, building a lexicon between the languages as in machine
translation is relatively straightforward; the
difficulty arises because the grammars are
different. A new idea is to try this with texts
that are not direct translations, using statistical models to correlate words in the two
languages. “One of the things we’re working
on is trying to do that without these very
carefully aligned translations,” says Klein.
“We are just looking at a big collection of
Chinese data and a big collection of English
data, even though they’re not translations of
each other, and we’re trying to put the words
of the languages into alignment anyway, on
the basis of what words they occur with or
maybe how they’re spelled or various kinds
of features.”
The ultimate goal of natural language
processing research is to build a “system
which can deeply understand language, textual information, and can read and understand and interact based on that understanding,” says Klein. While no one in the Klein
group thinks that goal is close at hand, they
do think that some surprising things could
be learned along the way. Could beauty in
language be amenable to statistical treatment? Klein initially appears amused at the
notion, but then, thinking of studies done
on physical attractiveness, he reflects, “You
can use statistical methods to predict anything, provided you have a way of analyzing
your raw input.”
Paul Crider is a graduate student in physical
chemistry.
Want to know more? Check out the Klein Lab
website (and play with the parser):
Photo by Annaliese Beery
nlp.cs.berkeley.edu
Members of the Klein lab. From left to right: John DeNero, Aria Haghighi, Percy Liang, Alex Bouchard-Côté, Slav Petrov,
and Dan Klein.
46