String Kernels for HPSG Parse Selection

The Leaf Projection Path View of
Parse Trees: Exploring String
Kernels for HPSG Parse Selection
Kristina Toutanova, Penka Markova,
Christopher Manning
Computer Science Department
Stanford University
Motivation: the task
“I would like to meet with you again on Monday”
Input: a sentence
Classify to one of
the possible parses
focus on discriminating
among parses
Motivation: traditional
representation of parse trees

Features are pieces of local rule productions
with grand-parenting
to
meet
meet
on
When using plain context
free rules most features
make no reference to the
input string – naive for a
discriminative model!
Lexicalization with the
head word introduces
more connection to the
input
Motivation: traditional
representation of parse trees

All subtrees representation: features are
(a restricted kind) of subtrees of the original tree
must choose features or
discount larger trees
General idea: representation
Trees are lists of leaf projection paths
Non-head path is included
in addition to the head path
Each node is lexicalized with
all words dominated by it
Trees must be binarized



Provides broader view of tree contexts
Increases connection to the input string (words)
Captures examples of non-head dependencies like in “more careful
than his sister” (Bod 98)
General idea: tree kernels

Often only a kernel (a similarity measure) between trees is necessary
for ML algorithms.
Measure the similarity between trees by the similarity between projection
paths of common words/pos tags in the trees.
General idea: tree kernels from
string kernels

Measures of similarity between sequences (strings) have been
developed for many domains.
S
VP
VP-NF
VP
VP-NF
VP-NF
VP-NF
VP-NF
meet


SIM
S
VP
VP-NF
VP-NF
VP-NF
VP
VP-NF
VP-NF
meet
use string kernels between projection paths and combine them into a
tree kernel via a convolution
this gives rise to interesting features and more global modeling of the
syntactic environment of words
Overview



HPSG syntactic analyses representation
Illustration of the leaf projection paths
representation
Comparison to traditional rule representation



experimental results
Tree kernels from string kernels on projection
paths
Experimental results
HPSG tree representation:
derivation trees
HPSG – Head Driven Phrase Structure Grammar;
lexicalized unification based grammar
IMPER
ERG grammar of English
HCOMP
Node labels are rule names such
as head-complement and headadjunct
The inventory of rules is larger
than in traditional HPSG
grammars
Full HPSG signs can be recovered
from the derivation trees using
the grammar
HCOMP
LET_V1
let
HCOMP
US PLAN_ON_V2 HCOMP
us
We use annotated derivation trees as the main
representation for disambiguation
plan
ON
on
THAT_DEIX
that
HPSG tree representation:
annotation of nodes
Annotation with the value of synsem.local.cat.head
Its values are a small set of part-of-speech tags
IMPER : verb
HCOMP : verb
HCOMP: verb
HCOMP: verb
LET_V1
US
PLAN_ON_V2
let
us
plan
HCOMP : prep*
ON
on
THAT_DEIX
that
HPSG tree representation:
syntactic word classes
Our representation heavily uses word classes to backoff from words
word types
lexical item ids
LET_V1
US
PLAN_ON_V2
ON
THAT_DEIX
let
us
plan
on
that
v_sorb
n_pers_pro
v_empty_prep
_intrans
p_reg
n_deictic_pro
The word classes are around 500 types in the HPSG type hierarchy.
They show detailed syntactic information including e.g. subcategorization.
Leaf projection paths
representation
END
IMPER : verb
HCOMP: verb
IMPER: verb
HCOMP: verb
HCOMP: verb
LET_V1
HCOMP: verb
LET_V1 verb
START
let
v_sorb
END
START
let
HCOMP: verb
US PLAN_ON_V2 HCOMP: prep*
us
plan
ON
v_sorb n_pers_pro v_empty_prep_
on
intrans
p_reg
THAT_DEIX
let
that
n_deictic_pro
v_sorb
•The tree is represented as a list of paths from the words to the top.
•The paths are keyed by words and corresponding word classes.
•The head and non-head paths are treated separately.
Leaf projection paths
representation
IMPER: verb
HCOMP: verb
END
END
HCOMP: verb
IMPER: verb
PLAN_ON: verb
HCOMP: verb
START
START
plan
HCOMP: verb
LET_V1
US
let
us
v_sor
b
HCOMP: verb
PLAN_ON HCOMP:
plan
ON
n_pers_pro v_empty_pre
p_ intrans on
p_reg
prep*
THAT_DEIX
plan
that
n_deictic_pro
v_empty_prep v_empty_prep
_intrans
_intrans
•The tree is represented as a list of paths from the words to the top.
•The paths are keyed by words and corresponding word classes.
•The head and non-head paths are treated separately.
Leaf projection paths
representation
IMPER : verb
HCOMP: verb
HCOMP: verb
HCOMP: verb
LET_V1
US
PLAN_ON
let
us
plan
HCOMP: prep*
ON
v_sorb n_pers_pro v_empty_prep
on
_ intrans
p_reg
THAT_DEIX
that
n_deictic_pro
Can recover local rules by annotation of nodes with sister and
parent categories
Now extract features from this representation for discriminative
models
Overview



HPSG syntactic analyses representation
Illustration of the leaf projection paths
representation
Comparison to traditional rule representation



experimental results
Tree kernels from string kernels on projection
paths
Experimental Results
Machine learning task setup
Given m training sentences
( si ,  (ti ,1 ),...,  (ti , pi ))
Sentence si has pi possible analyses and ti,1 is
the correct analysis

Learn a parameter vector w and choose for a
test sentence
 the tree t with the maximum
score w. (t )
Linear Models e.g. (Collins
00)
Choosing the parameter vector
1  
min w  w  C   i , j
2
i, j

ij  1 : w  ( (ti ,1 )   (ti , j ))  1   i , j
ij  1 :  i , j  0



Previous formulations (Collins 01, Shen and Joshi 03)
We solve this problem using SVMLight for ranking
For all models we extract all features from the
kernel’s feature map and solve the problem with a
linear kernel
The leaf projection paths view
versus the context free rule view

Goals:



Compare context free rule models to projection path
models
Evaluate the usefulness of non-head paths
Models

Projection paths:



Bi-gram model on projection paths (2PP)
Bi-gram model on head projection paths only (2HeadPP)
Context free rules:


Joint rule model (J-Rule)
Independent rule model (I-Rule)
The leaf projection paths view
versus the context free rule view


2PP has as features bi-grams from the projection paths.
Features of 2PP including the node HCOMP
IMPER : verb
HCOMP : verb
HCOMP: verb
LET_V1
let
v_sorb
HCOMP :verb
US PLAN_ON_V2 HCOMP: prep*
us
n_pers_pro
plan
ON
v_empty_prep_
intrans
on
p_reg
THAT_DEIX
plan (head path)
[v_empty_prep_intrans,PLAN_ON_V2,HCOMP,head]
[v_empty_prep_intrans,HCOMP,END,head]
that
n_deictic_pro
The leaf projection paths view
versus the context free rule view


2PP has as features bi-grams from the projection paths.
Features of 2PP including the node HCOMP
IMPER : verb
HCOMP : verb
HCOMP: verb
LET_V1
let
v_sorb
HCOMP :verb
US PLAN_ON_V2 HCOMP: prep*
us
n_pers_pro
plan
ON
v_empty_prep_
intrans
on
p_reg
THAT_DEIX
plan (head path)
[v_empty_prep_intrans,PLAN_ON_V2,HCOMP,head]
[v_empty_prep_intrans,HCOMP,END,head]
that
n_deictic_pro
The leaf projection paths view
versus the context free rule view


2PP has as features bi-grams from the projection paths.
Features of 2PP including the node HCOMP
IMPER : verb
HCOMP : verb
HCOMP :verb
HCOMP: verb
LET_V1
let
v_sorb
US PLAN_ON_V2 HCOMP: prep*
us
n_pers_pro
plan
ON
v_empty_prep_
intrans
on
p_reg
on (non-head path)
[p_reg,START,HCOMP,non-head]
[p_reg,HCOMP,HCOMP,non-head]
THAT_DEIX
that
n_deictic_pro
The leaf projection paths view
versus the context free rule view


2PP has as features bi-grams from the projection paths.
Features of 2PP including the node HCOMP
IMPER : verb
HCOMP : verb
HCOMP :verb
HCOMP: verb
LET_V1
let
v_sorb
US PLAN_ON_V2 HCOMP: prep*
us
n_pers_pro
plan
ON
v_empty_prep_
intrans
on
p_reg
on (non-head path)
[p_reg,START,HCOMP,non-head]
[p_reg,HCOMP,HCOMP,non-head]
THAT_DEIX
that
n_deictic_pro
The leaf projection paths view
versus the context free rule view


2PP has as features bi-grams from the projection paths.
Features of 2PP including the node HCOMP
IMPER : verb
HCOMP : verb
HCOMP: verb
LET_V1
let
v_sorb
HCOMP :verb
US PLAN_ON_V2 HCOMP: prep*
us
n_pers_pro
plan
ON
v_empty_prep_
intrans
on
p_reg
that (non-head path)
[n_deictic_pro,HCOMP,HCOMP,non-head]
[n_deictic_pro,HCOMP,HCOMP,non-head]
THAT_DEIX
that
n_deictic_pro
The leaf projection paths view
versus the context free rule view


2PP has as features bi-grams from the projection paths.
Features of 2PP including the node HCOMP
IMPER : verb
HCOMP : verb
HCOMP: verb
LET_V1
let
v_sorb
HCOMP :verb
US PLAN_ON_V2 HCOMP: prep*
us
n_pers_pro
plan
ON
v_empty_prep_
intrans
on
p_reg
that (non-head path)
[n_deictic_pro,HCOMP,HCOMP,non-head]
[n_deictic_pro,HCOMP,HCOMP,non-head]
THAT_DEIX
that
n_deictic_pro
The leaf projection paths view
versus the context free rule view


I-Rule has as features edges of the tree, annotated with the
word class of the child and head vs. non-head information
Features of I-Rule including the node HCOMP
IMPER : verb
HCOMP: verb
HCOMP: verb
LET_V1
let
HCOMP
: verb
US PLAN_ON_V2 HCOMP: prep*
us
plan
ON
v_sorb n_pers_pro v_empty_prep_
on
intrans
p_reg
[v_empty_prep_intrans,PLAN_ON_V2,HCOMP,head]
THAT_DEIX
that
n_deictic_pro
The leaf projection paths view
versus the context free rule view


I-Rule has as features edges of the tree, annotated with the
word class of the child and head vs. non-head information
Features of I-Rule including the node HCOMP
IMPER : verb
HCOMP: verb
HCOMP
: verb
HCOMP: verb
LET_V1
let
US PLAN_ON_V2 HCOMP: prep*
us
plan
ON
v_sorb n_pers_pro v_empty_prep_
on
intrans
p_reg
[p_reg,HCOMP,HCOMP,non-head]
THAT_DEIX
that
n_deictic_pro
The leaf projection paths view
versus the context free rule view


I-Rule has as features edges of the tree, annotated with the
word class of the child and head vs. non-head information
Features of I-Rule including the node HCOMP
IMPER : verb
HCOMP: verb
HCOMP: verb
LET_V1
let
HCOMP
: verb
US PLAN_ON_V2 HCOMP: prep*
us
plan
ON
v_sorb n_pers_pro v_empty_prep_
on
intrans
p_reg
[v_empty_prep_intrans,HCOMP,HCOMP,non-head]
THAT_DEIX
that
n_deictic_pro
Comparison results

Redwoods corpus
3829 ambiguous sentences; average number of words 7.8
average ambiguity 10.8
10-fold cross-validation ; report exact match accuracy
82.70
83.0
Accuracy
82.0
81.0
80.99
81.07
Non-head paths are useful
(13% relative error
reduction from head only)
80.14
80.0
The bi-gram model on
projection paths performs
better than a very similar
local rule based model
79.0
78.0
Model
2HeadPP
J-Rule
I-Rule
2PP
Overview



HPSG syntactic analyses representation
Illustration of the leaf projection paths
representation
Comparison to traditional rule representation



experimental results
Tree kernels from string kernels on projection
paths
Experimental Results
String kernels on projection
paths




We looked at a bi-gram model on projection
paths (2PP).
This is a special case of a string kernel (ngram kernel).
We could use more general string kernels on
projection paths --- existing ones, that handle
non-contiguous substrings or more complex
matching of nodes.
It is straightforward to combine them into
tree kernels.
Formal representation of parse
trees
END
IMPER: verb
t
[( key1 , x1 ),.., (keym , xm )]
HCOMP: verb
HCOMP: verb
LET_V1 verb
START
key1=let (head)
let
v_sorb
END
START
let
v_sorb
X1=“START LET_V1:verb HCOMP:verb HCOMP:verb IMPER:verb END”
key2=v_sorb(head) X2 = X1
key3=let (non-head)
X3=“START END”
key4=v_sorb(non-head) X4 = X3
Tree kernels using string
kernels on projection paths
t
[( key1 , x1 ),.., (keym , xm )]
t’
[(key'1 , x'1 ),.., (key'n , x'n )]
KP(( key, x), (key, x))  K ( x, x), if key  key
KP(( key, x), (key, x))  0, otherwise
m
n
KT (t , t ' )   KP((keyi , xi ), (key' j , x' j ))
i 1 j 1
String kernels overview

Define string kernels by their feature map  from
strings to vectors indexed by feature indices 
Example: 1-gram kernel
END
IMPER
HCOMP
HCOMP
LET_V1
START
 END  1,  IMER  1,  HCOMP  2,
 LET _ V 1  1,  START  1
Repetition kernel

General idea: Improve on the 1-gram kernel by better
handling repeated symbols.
NP
NP
PP
PP
He eats chocolate from Belgium with fingers .
head path of eats when high attachment – (NP PP PP NP)
Rather than the feature for PP having twice as much weight,
there should be a separate feature indicating that there are
two PPs.
The feature space is indexed by strings   a...a, a  
Two discount factors for 1 gaps and 2for letters
 PP ( NP, PP, PP, NP)  1,  PP, PP ( NP, PP, PP, NP)  .5
if 1  2  .5
The Repetition kernel versus
1-gram and 2-gram
1-gram
44,278 features
82.21
Repetition 52,994 features
2-gram
83.59
104,331 features
84.15
81
82
83
84
85
Repetition achieves 7.8% error reduction from 1-gram
86
Other string kernels



So far: 1-gram,2-gram, repetition
Next: allow general discontinuous n-grams
 restricted subsequence kernel
Also: allow partial matching
 wildcard kernel allowing a wild-card character in
the n-gram features; the wildcard matches any
character
Lodhi et al. 02; Leslie and Kuang 03
Restricted subsequence kernel

Has parameters k – maximum size of the feature n-gram; g
– maximum span in the string; λ1 - gap penalty and λ2 - letter
- penalty λ2
when k=2,g=5,
λ1 =.5, λ2 =1
END
IMPER
HCOMP
HCOMP
LET_V1
START
 END  1,  IMER  1,  HCOMP  2,  LET _ V 1  1,  START  1
 END , IMPER  1,  IMPER, HCOMP  1  .5  1.5,  IMPER, START  .125,...
 END , START  0
Varying the string kernels on
word class keyed paths
1-gram (13K)
81.43
2-gram (37K)
82.70
subseq (2,3,.50,2) (81K) 83.22
81
82
81
82
83
84
85
86
subseq (2,3,.25,2) (81K) 83.48
subseq (2,4,. 5,2) (102K) 83.29
subseq (3,5,.25,2)(416K) 83.06
83
84
85
86
Varying the string kernels on
word class keyed paths
1-gram (13K)
81.43
2-gram (37K)
82.70
subseq (2,3,.50,2) (81K) 83.22
81
82
83
84
85
86
subseq (2,3,.25,2) (81K) 83.48
subseq (2,4,.50,2) (102K) 83.29
subseq (3,5,.25,2) (416K) 83.06
81
82
83
84
85
Increasing the amount of discontinuity or adding larger n-gram did
not help
86
Adding word keyed paths
Fixed the kernel for word keyed paths to 2-gram+repetition
word classes
word classes+words
86.0
85.0
84.0
84.96
83.22
84.75
83.48
84.4
83.29
83.0
82.0
81.0
subseq
(2,3,.5,2)
subseq
(2,3,.25,2)
subseq
(2,4,.5,2)
Best previous result from a single classifier 82.7 (mostly
local rule based). Relative error reduction is 13%
Other models and model
combination


Many features are available in the HPSG signs.
A single model is likely to over-fit when given too many
features.

To better use the additional information, train several classifiers
and combine them by voting
best single model
86.0
85.0
84.96
model combination
85.4
84.0
83.0
82.0
81.0
Best previous result from voting classifiers is 84.23% (Osborne & Balbridge 04)
Conclusions and future work
Summary
 We presented a new representation of parse trees
leading to a tree kernel

It allows the modeling of more global tree contexts as well
as greater lexicalization
We demonstrated gains from applying existing string
kernels on projection paths and new kernels useful
for the domain (Repetition kernel)
 The major gains were due to the representation
Future Work
 Other sequence kernels better suited for the task
 Feature selection: which words / word classes
deserve better modeling of their leaf paths
 Other corpora
