The Leaf Projection Path View of Parse Trees: Exploring String Kernels for HPSG Parse Selection Kristina Toutanova, Penka Markova, Christopher Manning Computer Science Department Stanford University Motivation: the task “I would like to meet with you again on Monday” Input: a sentence Classify to one of the possible parses focus on discriminating among parses Motivation: traditional representation of parse trees Features are pieces of local rule productions with grand-parenting to meet meet on When using plain context free rules most features make no reference to the input string – naive for a discriminative model! Lexicalization with the head word introduces more connection to the input Motivation: traditional representation of parse trees All subtrees representation: features are (a restricted kind) of subtrees of the original tree must choose features or discount larger trees General idea: representation Trees are lists of leaf projection paths Non-head path is included in addition to the head path Each node is lexicalized with all words dominated by it Trees must be binarized Provides broader view of tree contexts Increases connection to the input string (words) Captures examples of non-head dependencies like in “more careful than his sister” (Bod 98) General idea: tree kernels Often only a kernel (a similarity measure) between trees is necessary for ML algorithms. Measure the similarity between trees by the similarity between projection paths of common words/pos tags in the trees. General idea: tree kernels from string kernels Measures of similarity between sequences (strings) have been developed for many domains. S VP VP-NF VP VP-NF VP-NF VP-NF VP-NF meet SIM S VP VP-NF VP-NF VP-NF VP VP-NF VP-NF meet use string kernels between projection paths and combine them into a tree kernel via a convolution this gives rise to interesting features and more global modeling of the syntactic environment of words Overview HPSG syntactic analyses representation Illustration of the leaf projection paths representation Comparison to traditional rule representation experimental results Tree kernels from string kernels on projection paths Experimental results HPSG tree representation: derivation trees HPSG – Head Driven Phrase Structure Grammar; lexicalized unification based grammar IMPER ERG grammar of English HCOMP Node labels are rule names such as head-complement and headadjunct The inventory of rules is larger than in traditional HPSG grammars Full HPSG signs can be recovered from the derivation trees using the grammar HCOMP LET_V1 let HCOMP US PLAN_ON_V2 HCOMP us We use annotated derivation trees as the main representation for disambiguation plan ON on THAT_DEIX that HPSG tree representation: annotation of nodes Annotation with the value of synsem.local.cat.head Its values are a small set of part-of-speech tags IMPER : verb HCOMP : verb HCOMP: verb HCOMP: verb LET_V1 US PLAN_ON_V2 let us plan HCOMP : prep* ON on THAT_DEIX that HPSG tree representation: syntactic word classes Our representation heavily uses word classes to backoff from words word types lexical item ids LET_V1 US PLAN_ON_V2 ON THAT_DEIX let us plan on that v_sorb n_pers_pro v_empty_prep _intrans p_reg n_deictic_pro The word classes are around 500 types in the HPSG type hierarchy. They show detailed syntactic information including e.g. subcategorization. Leaf projection paths representation END IMPER : verb HCOMP: verb IMPER: verb HCOMP: verb HCOMP: verb LET_V1 HCOMP: verb LET_V1 verb START let v_sorb END START let HCOMP: verb US PLAN_ON_V2 HCOMP: prep* us plan ON v_sorb n_pers_pro v_empty_prep_ on intrans p_reg THAT_DEIX let that n_deictic_pro v_sorb •The tree is represented as a list of paths from the words to the top. •The paths are keyed by words and corresponding word classes. •The head and non-head paths are treated separately. Leaf projection paths representation IMPER: verb HCOMP: verb END END HCOMP: verb IMPER: verb PLAN_ON: verb HCOMP: verb START START plan HCOMP: verb LET_V1 US let us v_sor b HCOMP: verb PLAN_ON HCOMP: plan ON n_pers_pro v_empty_pre p_ intrans on p_reg prep* THAT_DEIX plan that n_deictic_pro v_empty_prep v_empty_prep _intrans _intrans •The tree is represented as a list of paths from the words to the top. •The paths are keyed by words and corresponding word classes. •The head and non-head paths are treated separately. Leaf projection paths representation IMPER : verb HCOMP: verb HCOMP: verb HCOMP: verb LET_V1 US PLAN_ON let us plan HCOMP: prep* ON v_sorb n_pers_pro v_empty_prep on _ intrans p_reg THAT_DEIX that n_deictic_pro Can recover local rules by annotation of nodes with sister and parent categories Now extract features from this representation for discriminative models Overview HPSG syntactic analyses representation Illustration of the leaf projection paths representation Comparison to traditional rule representation experimental results Tree kernels from string kernels on projection paths Experimental Results Machine learning task setup Given m training sentences ( si , (ti ,1 ),..., (ti , pi )) Sentence si has pi possible analyses and ti,1 is the correct analysis Learn a parameter vector w and choose for a test sentence the tree t with the maximum score w. (t ) Linear Models e.g. (Collins 00) Choosing the parameter vector 1 min w w C i , j 2 i, j ij 1 : w ( (ti ,1 ) (ti , j )) 1 i , j ij 1 : i , j 0 Previous formulations (Collins 01, Shen and Joshi 03) We solve this problem using SVMLight for ranking For all models we extract all features from the kernel’s feature map and solve the problem with a linear kernel The leaf projection paths view versus the context free rule view Goals: Compare context free rule models to projection path models Evaluate the usefulness of non-head paths Models Projection paths: Bi-gram model on projection paths (2PP) Bi-gram model on head projection paths only (2HeadPP) Context free rules: Joint rule model (J-Rule) Independent rule model (I-Rule) The leaf projection paths view versus the context free rule view 2PP has as features bi-grams from the projection paths. Features of 2PP including the node HCOMP IMPER : verb HCOMP : verb HCOMP: verb LET_V1 let v_sorb HCOMP :verb US PLAN_ON_V2 HCOMP: prep* us n_pers_pro plan ON v_empty_prep_ intrans on p_reg THAT_DEIX plan (head path) [v_empty_prep_intrans,PLAN_ON_V2,HCOMP,head] [v_empty_prep_intrans,HCOMP,END,head] that n_deictic_pro The leaf projection paths view versus the context free rule view 2PP has as features bi-grams from the projection paths. Features of 2PP including the node HCOMP IMPER : verb HCOMP : verb HCOMP: verb LET_V1 let v_sorb HCOMP :verb US PLAN_ON_V2 HCOMP: prep* us n_pers_pro plan ON v_empty_prep_ intrans on p_reg THAT_DEIX plan (head path) [v_empty_prep_intrans,PLAN_ON_V2,HCOMP,head] [v_empty_prep_intrans,HCOMP,END,head] that n_deictic_pro The leaf projection paths view versus the context free rule view 2PP has as features bi-grams from the projection paths. Features of 2PP including the node HCOMP IMPER : verb HCOMP : verb HCOMP :verb HCOMP: verb LET_V1 let v_sorb US PLAN_ON_V2 HCOMP: prep* us n_pers_pro plan ON v_empty_prep_ intrans on p_reg on (non-head path) [p_reg,START,HCOMP,non-head] [p_reg,HCOMP,HCOMP,non-head] THAT_DEIX that n_deictic_pro The leaf projection paths view versus the context free rule view 2PP has as features bi-grams from the projection paths. Features of 2PP including the node HCOMP IMPER : verb HCOMP : verb HCOMP :verb HCOMP: verb LET_V1 let v_sorb US PLAN_ON_V2 HCOMP: prep* us n_pers_pro plan ON v_empty_prep_ intrans on p_reg on (non-head path) [p_reg,START,HCOMP,non-head] [p_reg,HCOMP,HCOMP,non-head] THAT_DEIX that n_deictic_pro The leaf projection paths view versus the context free rule view 2PP has as features bi-grams from the projection paths. Features of 2PP including the node HCOMP IMPER : verb HCOMP : verb HCOMP: verb LET_V1 let v_sorb HCOMP :verb US PLAN_ON_V2 HCOMP: prep* us n_pers_pro plan ON v_empty_prep_ intrans on p_reg that (non-head path) [n_deictic_pro,HCOMP,HCOMP,non-head] [n_deictic_pro,HCOMP,HCOMP,non-head] THAT_DEIX that n_deictic_pro The leaf projection paths view versus the context free rule view 2PP has as features bi-grams from the projection paths. Features of 2PP including the node HCOMP IMPER : verb HCOMP : verb HCOMP: verb LET_V1 let v_sorb HCOMP :verb US PLAN_ON_V2 HCOMP: prep* us n_pers_pro plan ON v_empty_prep_ intrans on p_reg that (non-head path) [n_deictic_pro,HCOMP,HCOMP,non-head] [n_deictic_pro,HCOMP,HCOMP,non-head] THAT_DEIX that n_deictic_pro The leaf projection paths view versus the context free rule view I-Rule has as features edges of the tree, annotated with the word class of the child and head vs. non-head information Features of I-Rule including the node HCOMP IMPER : verb HCOMP: verb HCOMP: verb LET_V1 let HCOMP : verb US PLAN_ON_V2 HCOMP: prep* us plan ON v_sorb n_pers_pro v_empty_prep_ on intrans p_reg [v_empty_prep_intrans,PLAN_ON_V2,HCOMP,head] THAT_DEIX that n_deictic_pro The leaf projection paths view versus the context free rule view I-Rule has as features edges of the tree, annotated with the word class of the child and head vs. non-head information Features of I-Rule including the node HCOMP IMPER : verb HCOMP: verb HCOMP : verb HCOMP: verb LET_V1 let US PLAN_ON_V2 HCOMP: prep* us plan ON v_sorb n_pers_pro v_empty_prep_ on intrans p_reg [p_reg,HCOMP,HCOMP,non-head] THAT_DEIX that n_deictic_pro The leaf projection paths view versus the context free rule view I-Rule has as features edges of the tree, annotated with the word class of the child and head vs. non-head information Features of I-Rule including the node HCOMP IMPER : verb HCOMP: verb HCOMP: verb LET_V1 let HCOMP : verb US PLAN_ON_V2 HCOMP: prep* us plan ON v_sorb n_pers_pro v_empty_prep_ on intrans p_reg [v_empty_prep_intrans,HCOMP,HCOMP,non-head] THAT_DEIX that n_deictic_pro Comparison results Redwoods corpus 3829 ambiguous sentences; average number of words 7.8 average ambiguity 10.8 10-fold cross-validation ; report exact match accuracy 82.70 83.0 Accuracy 82.0 81.0 80.99 81.07 Non-head paths are useful (13% relative error reduction from head only) 80.14 80.0 The bi-gram model on projection paths performs better than a very similar local rule based model 79.0 78.0 Model 2HeadPP J-Rule I-Rule 2PP Overview HPSG syntactic analyses representation Illustration of the leaf projection paths representation Comparison to traditional rule representation experimental results Tree kernels from string kernels on projection paths Experimental Results String kernels on projection paths We looked at a bi-gram model on projection paths (2PP). This is a special case of a string kernel (ngram kernel). We could use more general string kernels on projection paths --- existing ones, that handle non-contiguous substrings or more complex matching of nodes. It is straightforward to combine them into tree kernels. Formal representation of parse trees END IMPER: verb t [( key1 , x1 ),.., (keym , xm )] HCOMP: verb HCOMP: verb LET_V1 verb START key1=let (head) let v_sorb END START let v_sorb X1=“START LET_V1:verb HCOMP:verb HCOMP:verb IMPER:verb END” key2=v_sorb(head) X2 = X1 key3=let (non-head) X3=“START END” key4=v_sorb(non-head) X4 = X3 Tree kernels using string kernels on projection paths t [( key1 , x1 ),.., (keym , xm )] t’ [(key'1 , x'1 ),.., (key'n , x'n )] KP(( key, x), (key, x)) K ( x, x), if key key KP(( key, x), (key, x)) 0, otherwise m n KT (t , t ' ) KP((keyi , xi ), (key' j , x' j )) i 1 j 1 String kernels overview Define string kernels by their feature map from strings to vectors indexed by feature indices Example: 1-gram kernel END IMPER HCOMP HCOMP LET_V1 START END 1, IMER 1, HCOMP 2, LET _ V 1 1, START 1 Repetition kernel General idea: Improve on the 1-gram kernel by better handling repeated symbols. NP NP PP PP He eats chocolate from Belgium with fingers . head path of eats when high attachment – (NP PP PP NP) Rather than the feature for PP having twice as much weight, there should be a separate feature indicating that there are two PPs. The feature space is indexed by strings a...a, a Two discount factors for 1 gaps and 2for letters PP ( NP, PP, PP, NP) 1, PP, PP ( NP, PP, PP, NP) .5 if 1 2 .5 The Repetition kernel versus 1-gram and 2-gram 1-gram 44,278 features 82.21 Repetition 52,994 features 2-gram 83.59 104,331 features 84.15 81 82 83 84 85 Repetition achieves 7.8% error reduction from 1-gram 86 Other string kernels So far: 1-gram,2-gram, repetition Next: allow general discontinuous n-grams restricted subsequence kernel Also: allow partial matching wildcard kernel allowing a wild-card character in the n-gram features; the wildcard matches any character Lodhi et al. 02; Leslie and Kuang 03 Restricted subsequence kernel Has parameters k – maximum size of the feature n-gram; g – maximum span in the string; λ1 - gap penalty and λ2 - letter - penalty λ2 when k=2,g=5, λ1 =.5, λ2 =1 END IMPER HCOMP HCOMP LET_V1 START END 1, IMER 1, HCOMP 2, LET _ V 1 1, START 1 END , IMPER 1, IMPER, HCOMP 1 .5 1.5, IMPER, START .125,... END , START 0 Varying the string kernels on word class keyed paths 1-gram (13K) 81.43 2-gram (37K) 82.70 subseq (2,3,.50,2) (81K) 83.22 81 82 81 82 83 84 85 86 subseq (2,3,.25,2) (81K) 83.48 subseq (2,4,. 5,2) (102K) 83.29 subseq (3,5,.25,2)(416K) 83.06 83 84 85 86 Varying the string kernels on word class keyed paths 1-gram (13K) 81.43 2-gram (37K) 82.70 subseq (2,3,.50,2) (81K) 83.22 81 82 83 84 85 86 subseq (2,3,.25,2) (81K) 83.48 subseq (2,4,.50,2) (102K) 83.29 subseq (3,5,.25,2) (416K) 83.06 81 82 83 84 85 Increasing the amount of discontinuity or adding larger n-gram did not help 86 Adding word keyed paths Fixed the kernel for word keyed paths to 2-gram+repetition word classes word classes+words 86.0 85.0 84.0 84.96 83.22 84.75 83.48 84.4 83.29 83.0 82.0 81.0 subseq (2,3,.5,2) subseq (2,3,.25,2) subseq (2,4,.5,2) Best previous result from a single classifier 82.7 (mostly local rule based). Relative error reduction is 13% Other models and model combination Many features are available in the HPSG signs. A single model is likely to over-fit when given too many features. To better use the additional information, train several classifiers and combine them by voting best single model 86.0 85.0 84.96 model combination 85.4 84.0 83.0 82.0 81.0 Best previous result from voting classifiers is 84.23% (Osborne & Balbridge 04) Conclusions and future work Summary We presented a new representation of parse trees leading to a tree kernel It allows the modeling of more global tree contexts as well as greater lexicalization We demonstrated gains from applying existing string kernels on projection paths and new kernels useful for the domain (Repetition kernel) The major gains were due to the representation Future Work Other sequence kernels better suited for the task Feature selection: which words / word classes deserve better modeling of their leaf paths Other corpora
© Copyright 2026 Paperzz