Semantic Role Labelling Using Chunk Sequences Ulrike Baldewein Katrin Erk Sebastian Padó Detlef Prescher Saarland University Saarbrücken Amsterdam University Amsterdam 1. Representation for Classification Usual choice: Classify Constituents Classify words? Intuition: one argument one constituent Not available in this task Too fine-grained Classify chunks? Data anylsis: Not always the right level 34% of arguments: more than one chunk 13% of arguments: do not respect chunk boundaries Chunk sequences as classification instances Sequences of chunks and chunk parts Adaptive level of structure „Potential constituents“ [NP Britain´s] [NP manufacturing industry] [VP is transforming] [NP itself] [VP to boost] [NP exports] ARG0 = NP_NP V = VP[VBG] ARG1 = NP ARG2 = VP_NP Frequency-based chunk sequence filtering Filter 1: Use only sequence types which realise arguments in training set 1089 types Zipf distributed NP (23,000) S (5,000) NP_PP_NP_PP_NP_VP _PP_NP_NP (1) Filter 2: Use only frequent sequence type (f(s)>10) Examine material between sequence and target Divider sequences Also Zipf distributed Empty divider (14,000) NP (10,000) Similar to „Path“ Filter 3: Use only seq.s with frequent divider (f(d)<10) Filter 4: Use only seq.s cooccurring frequently with some divider (f(s,d)<5) Results of filtering Leaves 87 sequence types (was 1089) 43,777 tokens in devel set (about 1 seq / word) 8,698 are proper arguments (about 20%) Bad news: representation loses 16% of proper arguments 2. Features „Shallow features“: „Higher-level features“: Simple properties Syntactic properties (mostly heuristic) „Divider features“: Shallow and higher-level properties of dividers EM-based clustering Measure fit between objects y1 (pred:arg) and y2 (sequence) y1 and y2 are independent and generated by cluster Example: How well does NP fit give:A1? p(y1,y2) = c p(c) p(y1|c) p(y2|c) EM derives clusters from training data Intention: Generalisation within clusters Features: e.g. „most likely argument slot for this sequence for this predicate“ 3. Procedure Filter sequences from training set Compute features for sequence tokens and their dividers (training + development + test set) Estimate Maximum Entropy model on training set Classify sequences from devel / test set Recover semantic parses Two-step classification procedure Classifier 1: Argument recognition Binary decision about argumenthood All argument classes conflated into ARG Classifier 2: Argument labelling Consider only sequences assigned ARG by step 1 4. Classification result: Sequence chart The man with A0 (70%), A1 (20%) the beard sleeps NOLABEL (70%), AM-MOD(25%) A0 (60%), NOLABEL (40%) A0 (65%), A1 (25%) Need to find optimal „semantic parse“ of argument labels Semantic parse recovery Find most probable semantic parse p = (l1,l2,...) Step 1: Beam search: Simple probability model with independence assumption: Pbs(l1,l2,...) = i Pc(li) Step 2: Reestimation Global considerations: [A0 A0] Use counts from training set: P(l1,l2,...) = Pbs(l1,l2,...) * Ptr(l1,l2,...) 5. Results (Development Set) Precision Recall F-score Upper Bound 100 83.3 90.9 Step 1 (ARG only) 77.3 60.1 66.1 Final 64.9 41.6 50.7 Upper Bound: given by lost chunk sequences But filtering is necessary Only sequence frequency filtering (filter 1 and 2): Good news: 9% arguments are lost (now 16%) Bad news: 127,000 sequences (now 44,000) Argument recognition much more difficult F-score with same features only 0.38 Results Precision Recall F-score Upper Bound 100 83.3 90.9 Step 1 (ARG only) 77.3 60.1 66.1 Final 64.9 41.6 50.7 Two steps have different profiles Arg identification: shallow and divider features important Arg labelling: shallow and higher-level features important Clustering features unsuccessful: Increase precision at cost of recall Feature „most probable label for sequence“ Successful in SENSEVAL-3 model Largest problem is recall What I talked about... and more Chunk sequences for SRL Since submission Adaptive representation with „higher-level“ features Recall problem (Filtering loses proper arguments) EM-based features promising, but currently not helpful Maxent vs. memory-based learner: virtually same result Left to do Detailed error analysis More intelligent filtering Better features
© Copyright 2026 Paperzz