DOP model defines CBR system for natural language

A Memory-Based Model of
Syntactic Analysis:
Data Oriented Parsing
Remko Scha, Rens Bod, Khalil Sima’an
Institute for Logic, Language and Computation
University of Amsterdam
Outline of the lecture

Introduction

Disambiguation

Data Oriented Parsing

DOP1 computational aspects and
experiments

Memory Based Learning framework

Conclusions
Introduction

Human language cognition:


Modern linguistics



Analogy-based processes on a store of past experiences
Set of rules
Language processing algorithms
Performance model of human language
processing


Competence grammar as broad framework to
performance models.
Memory / Analogy - based language processing
The Problem of Ambiguity
Resolution



Every input string has unmanageable
large number of analyses
Uncertain input – generate guesses and
choose one
Syntactic disambiguation might be a side
effect of semantic one
The Problem of Ambiguity
Resolution

Frequency of occurrence of lexical item
and syntactic structures:



People register frequencies
People prefer analyses they already
experienced than constructing a new ones
More frequent analyses are preferred to less
frequent ones
From Probabilistic CompetenceGrammars to Data-Oriented Parsing



Probabilistic information derived from
past experience
Characterization of the possible
sentence-analyses of the language
Stochastic Grammar

Define : all sentences, all analyses.

Assign : probability for each

Achieve : preference that people display
when they choose sentence or analyses.
Stochastic Grammar

These predictions are limited

Platitudes and conventional phrases

Allow redundancy

Use Tree Substitution Grammar
Stochastic Tree Substitution Grammar

Set of elementary trees

Tree rewrite process

Redundant model

Statistically relevant phrases

Memory based processing model
Memory based processing model


Data oriented parsing approach:

Corpus of utterances – past experience

STSG to analyze new input
In order to describe a specific DOP model

A formalism for representing utteranceanalyses

An extraction function

Combination operations

A probability model
A Simple Data Oriented Parsing
Model: DOP1


Our corpus: DOP1 - Imaginary corpus of two
trees
Possible sub trees:





t consists of more than one node
t is connected
except for the leaf nodes of t, each node in t has the
same daughter-nodes as the corresponding node in T
Stochastic Tree Substitution Grammar – set of
sub trees
Generation process – composition:

A
B – B is substituted on the leftmost non terminal
leaf node of A
Example of sub trees
DOP1 - Imaginary corpus of two
trees
S
S
NP
she
VP
NP
V
NP
wanted
the
she
NP
dress
PP
P
VP
V
PP
NP
P
NP
NP
saw
on
VP
the
rack
the
dress
with
the
telescope
Derivation and parse #1
S
NP
NP
VP
she
V
saw
VP
the
PP
NP
S
PP
dress
NP
P
with
the
telescope
NP
she
VP
VP
V
saw
PP
NP
the
P
dress
She saw the dress with the telescope.
with
NP
the
telescope
Derivation and parse #2
S
NP
she
VP
VP
VP
V
PP
with
NP
saw
NP
P
the
S
NP
telescope
the
dress
NP
she
VP
VP
V
saw
PP
NP
the
P
dress
She saw the dress with the telescope.
with
NP
the
telescope
Probability Computations:

Probability of substituting a sub tree t on a
specific node
t
P (t ) 

Probability of Derivation

t
t . r ( t  )  r ( t )
P( D)  P(t1 ... tn )   P(ti )
i

Probability of Parse Tree
P(T ) 

D derives T
P ( D)
Computational Aspects of DOP1
 Parsing
 Disambiguation

Most Probable Derivation

Most Probable Parse
 Optimizations
Parsing

Chart-like parse forest

Derivation forest


Elementary tree t as a context-free rule:
root(t) —> yield(t)
Label phrase with it’s syntactic category
and its full elementary tree
Elementary trees of an example
STSG
0
1
2
3
4
Derivation forest for the string abcd
Derivations and parse trees for the string abcd
Derivations and parse trees for the
string abcd
Disambiguation

Derivation forest define all
derivation and parses

Most likely parse must be chosen

MPP in DOP1

MPP vs. MPD
Most Probable Derivation

Viterbi algorithm:


Eliminate low probability sub derivations
using bottom-up fashion
Select the most probable sub derivation at
each chart entry, eliminate other sub
derivation of that root node.
Viterbi algorithm


Two derivations for abc
d1 > d2 : eliminate the right derivation
Algorithm 1 – Computing the probability
of most probable derivation





Input : STSG , S , R , P
Elementary trees in R are in CNF
A—>t H : tree t, root A, sequence of
labels H.
<A, i, j> - non terminal A in chart entry
(i,j) after parsing the input W1,...,Wn .
PPMPD – probability of MPD of input
string W1,...,Wn.
Algorithm 1 – Computing the probability
of most probable derivation
The Most Probable Parse


Computing MPP in STSG is NP hard
Monte Carlo method




Sample derivations
Observe frequent parse tree
Estimate parse tree probability
Random – first search
The algorithm
 Law of Large Numbers

Algorithm 2:
Sampling a random derivation

for length := 1 to n do
 for start := 0 to n - length do


for each root node X
chart-entry (start, start + length)
do:
1. select at random a tree from the
distribution of elementary trees with
root node X
2. eliminate the other elementary trees
with root node X from this chart-entry
Results of Algorithm 2

Random derivation for the whole sentence

First guess for MPP

Compute the size of the sampling set


Probability of error

Upper bound

0 index of MPP,i index of parse i, N derivation
No unique MPP – ambiguity
Reminder
V [ X ]  E[ X ]  E[ X ]
2
0  P[ X ] 1
 (X )  V[X ]
2
Conclusions – lower bound for N

Lower bound for N:

Pi is probability of parse i

B - Estimated probability by frequencies in N

Var(B) = Pi*(1-Pi)/N

0 < Pi^2 <= 1 -> Var(B) <= 1/(4*N)

s = sqrt(Var(B)) -> S <= 1/(2*sqrt(N))

1/(4*s^2) <= N

100 <= N -> s <= 0.05
Algorithm 3:
Estimating the parse probabilities




Given a derivation forest of a sentence and a
threshold sm for the standard error:
N := the smallest integer larger than 1/(4 sm 2)
repeat N times:
 sample a random derivation from the
derivation forest
 store the parse generated by this derivation
for each parse i:
 estimate the conditional probability given
the sentence by pi := #(i) / N
Complexity of Algorithm 3



Assumes value of max allowed
standard error
Samples number of derivations
which is guaranteed to achieve the
error
Number of needed samples is
quadratic in chosen error
Optimizations




Sima’an : MPD in linear time in STSG
size
Bod : MPP on small random corpus of
sub trees
Sekine and Grishman : use only sub
trees rooted with S or NP
Goodman : different polynomial time
Experimental Properties of DOP1

Experiments on the ATIS corpus





MPP vs. MPD
Impact of fragment size
Impact of fragment lexicalization
Impact of fragment frequency
Experiments on SRI-ATIS and OVIS

Impact of sub tree depth
Experiments on ATIS corpus

ATIS = Air Travel Information System

750 annotated sentence analyses

Annotated by Penn Treebank

Purpose: compare accuracy obtained
in undiluted DOP1 with the one
obtained in restricted STSG
Experiments on ATIS corpus

Divide into training and test sets






90% = 675 in training set
10% = 75 in test set
Convert training set into fragments and
enrich with probabilities
Test set sentences parsed with sub trees
from the training set
MPP was estimated from 100 sampled
derivations
Parse accuracy = % of MPP that are identical
to test set parses
Results

On 10 random training / test splits
of ATIS:

Average parse accuracy = 84.2%

Standard deviation = 2.9 %
Impact of overlapping fragments
MPP vs. MPD


Can MPD achieve parse accuracies similar to
MPP
Can MPD do better than MPP





Overlapping fragments
Accuracies generated by MPD on test set
The result is 69%
Comparing to accuracy achieved with MPP on
test set : 69% vs. 85%
Conclusion:
overlapping fragments play important role
in predicting the appropriate analysis of a
sentence
The impact of fragment size


Large fragments capture more
lexical/syntactic dependencies than small
ones.
The experiment:

Use DOP1 with restricted maximum depth

Max depth 1 -> DOP1 = SCFG

Compute the accuracies both for MPD and
MPP for each max depth
Impact of fragment size
Impact of fragment lexicalization



Lexicalized fragment
More words -> more lexical
dependencies
Experiment:

Different version of DOP1

Restrict max number of words per fragment

Check accuracy for MPP and MPD
Impact of fragment lexicalization
Impact of fragment frequency



Frequent fragments contribute more
large fragments are less frequent than
small ones but might contribute more
Experiment:



Restrict frequency to min number of
occurrences
Not other restrictions
Check accuracy for MPP
Impact of fragment frequency
Experiments on SRI-ATIS and OVIS

Employ MPD because the corpus is bigger

Tests performed on DOP1 and SDOP

Use set of heuristic criteria for selecting the
fragments:


Constraints of the form of sub trees
 d - upper bound on depth
 n – number of substitution sites
 l – number of terminals
 L – number of consecutive terminals
Apply constraints on all sub trees besides those
with depth 1
Experiments on SRI-ATIS and OVIS

d4 n2 l7 L3

DOP(i)

Evaluation metrics:




Recognized
Tree Language Coverage – TLC
Exact match
Labeled bracketing recall and precision
Experiments on SRI-ATIS



13335 annotated syntactically utterances
Annotation scheme originated from Core
Language Engine system
Fixed parameters except sub tree bound:




n2 l4 L3
Training set – 12335 trees
Test set – 1000 trees
Experiment:

Train and test on different depths upper bounds
(takes more than 10 days for DOP(4) !!! )
Impact of sub tree depth
SRI-ATIS
Experiments on OVIS corpus

10000 syntactically and semantically
annotated trees

Both annotations treated as one

More non terminal symbols



Utterances are answers to questions in
dialog -> short utterances (avg. 3.43)
Sima’an results – sentences with at least
2 words , avg. 4.57
n2 l7 L3
Experiments on OVIS corpus

Experiment:

Check different sub tree depth

1,3,4,5

Test set with 1000 trees

Train set with 9000 trees
Impact of sub tree depth - OVIS
Summary of results

ATIS:





Accuracy of parsing is 85%
Overlapping fragments have impact on
accuracy
Accuracy increases as fragment depth
increases both for MPP and MPD
Optimal lexical maximum for ATIS is 8
Accuracy decreases if lower bound of
fragment frequency increases (for MPP)
Summary of results

SRI-ATIS:



Availability of more data is more crucial
to accuracy of MPD.
Depth has impact
Accuracy is improved when using
memory based parsing(DOP(2)) and
not SCFG (DOP(1))
Summary of results

OVIS:


Recognition power isn’t affected by depth
No big difference between exact match in
DOP1(1) and DOP1(4) mean and standard
deviations
DOP: probabilistic recursive MBL



Relationship between present DOP
framework and Memory Based Learning
framework
DOP extends MBL to deal with
disambiguation
MBL vs. DOP

Flat or intermediate description vs.
hierarchical
Case Based Reasoning - CBR

Case Based learning

Lazy learning, doesn’t generalize

Lazy generalization

Classify by means of similarity function

Refer this paradigm as MBL

CBR vs. other variants of MBL

Task concept

Similarity function

Learning task
The DOP framework and CBR

CBR method



An extraction function – retrieve units

Combination operations – reuse and revision
Missing in DOP:


A formalism for representing utteranceanalyses - case description language
Similarity function
Extend CBR:

A probability model
DOP model
defines CBR
system for
natural
language
analysis
DOP1 and CBR methods









DOP1 as extension to CBR system
<string,tree> = classified instance
Retrieve sub trees and construct tree
Sentence = instance
Tree = class
Set of sentences = instance space
Set of trees – class space
Frontier , SSF , <str , st >
Infinite runtime case-base containing
instance-class-weight triples:
<SSF,subtree,probability>
DOP1 and CBR methods

Task and similarity function:


Task = disambiguation
Similarity function:
Parsing -> recursive string matching procedure
 Ambiguity -> computing probability and selecting
the highest.


Conclusion:

DOP1 is a lazy probabilistic recursive CBR
classifier
DOP vs. other MBL approached in NLP


K-NN vs. DOP
Memory Based Sequence Learning





DOP – stochastic model fro computing
probabilities
MBSL – ad hoc heuristics for computing scores
DOP – globally based ranking strategy of
alternative analyzes
MBSL – locally based one
Different generalization power
Conclusions






Memory Based aspects of DOP model
Disambiguation
Probabilities to account frequencies
DOP as probabilistic recursive Memory
Based model
DOP1 - properties, computational
aspects and experiments.
DOP and MBL - differences