Constructing a Valence Lexicon for a Treebank of German - META-Net

Constructing a Valence Lexicon
for a Treebank of German
Erhard W. Hinrichs, Kathrin Beck
{eh, kbeck}@sfs.uni-tuebingen.de
University of Tübingen
Seminar für Sprachwissenschaft
Germany
12/13/10 Erhard Hinrichs, Kathrin Beck CLARA Course on Treebank Annota?on The TüBa-D/Z Treebank
German newspaper corpus:
Annotation scheme:
Ø  data source: die tageszeitung (taz)
Ø  ca. 36 000 sentences
Ø  semi-automatic annotation
Ø  context-free backbone
Ø  PS grammar + predicate argument structure
Ø  topological fields
‘But there would be intelligent solutions which do not cost money.’
12/13/10 Erhard Hinrichs, Kathrin Beck CLARA Course on Treebank Annota?on Other Valence Lexica
§  PropBank (Palmer et al. 2005)
additional layer of semantic roles in the Penn Treebank
§  FrameNet (Baker et al. 1998)
based on frame semantics
§  Prague valency lexicon PDT-VALLEX (Hajič et al. 2003)
created on the basis of the Prague Dependency Treebank
12/13/10 Erhard Hinrichs, Kathrin Beck CLARA Course on Treebank Annota?on The TüBa-D/Z Valence Lexicon
The valence lexicon:
Example entry of a polysemous verb:
Ø  constructed in lockstep with the
development of the TüBa‑D/Z
Ø  The number of verb lemmas and
valence frames corresponds with
the number of sentences in the
TüBa-D/Z
Ø  4896 distinct verb lemmas
Ø  8013 valence frames (total)
Ø  717 distinct valence frames
einsetzen:
=======
ON [einsetzen] OA
Bsp: Wir haben Computer eingesetzt
We used the computer.
(R4-5603)
ON [einsetzen] OA FOPP (für, gegen)
(R4-3126)
Bsp: Wir setzen uns für eine Feuerpause ein
‘We supported a cease fire.’
Bsp: Gegen den Widerstand setzt der Senat
Polizeiknüppel ein
(R4-27058)
‘Against the resistance the senate used billy clubs.’
ON [einsetzen]
Bsp: Schneefall hatte eingesetzt
Snowfall had set in.
(R4-2903)
ON [einsetzen] OA PRED
Bsp: Gourmetköche setzen sie als Garnitur ein
Gourmet cooks used it as garnish.
(R4-17034)
ON [einsetzen] OD OA
Bsp: Man setzt den Pflanzen neue Gene ein
One inserts new genes into the plants.
(N5-37382)
12/13/10 Erhard Hinrichs, Kathrin Beck CLARA Course on Treebank Annota?on Grammatical Function Labels
Inventory of grammatical function labels used in the valence lexicon:
Ø  coincides with the edge labels
used in the syntactic annotation
Ø  corresponds directly to syntax
Label
Description
ON
nominative object (incl. subject clauses)
OG
genitive object
OD
dative object
OA
accusative object
OS
sentential object
OPP
obligatory prepositional object
FOPP
facultative prepositional object
OADVP
adverbial object
OADJP
adjectival object
PRED
predicate
OV
verbal object
12/13/10 Erhard Hinrichs, Kathrin Beck CLARA Course on Treebank Annota?on Quantitative Analysis I
Accession rates for frames, verb lemmas, and their combinations
in ranges of 5000 sentences:
9000
8000
Number of frames
7000
Number verb lemmas
6000
combined
5000
4000
3000
2000
1000
0
0
5000
10000
15000
20000
25000
30000
Number of annotated sentences
12/13/10 Erhard Hinrichs, Kathrin Beck CLARA Course on Treebank Annota?on 35000
40000
Quantitative Analysis I
Accession rates for frames, verb lemmas, and their combinations
in ranges of 5000 sentences:
9000
8000
Number of frames
7000
Number verb lemmas
6000
combined
5000
4000
3000
17.4% 2000
10.4% 10.0% 9.0% 8.5% 5.9% 33.9% 1000
0
0
5000
10000
15000
20000
25000
30000
Number of annotated sentences
12/13/10 Erhard Hinrichs, Kathrin Beck CLARA Course on Treebank Annota?on 35000
40000
Quantitative Analysis II r 40 000
r 35 000
r 30 000
r 25 000
Lemma
12/13/10 Erhard Hinrichs, Kathrin Beck CLARA Course on Treebank Annota?on stehen:
sein:
haben:
finden:
tun:
sprechen:
sagen:
geben:
sehen:
nehmen:
lassen:
halten:
denken:
r 20 000
schreiben:
16
14
12
10
8
6
4
2
0
machen:
Valence frames per verb lemma
Distribution of valence frames over sentence number range (r) for the
15 verb lemmas with the highest number of valence frames:
r 15 000
r 10 000
r 5000
Quantitative Analysis III Top 30 list of valence frames
Number of distinct valence frames:
Ø  The frequency of occurrence
for a specific valence frame
ranges from
2243 (ON OA)
down to
3 (36 distinct valence frames)
2 (67 distinct valence frames)
1 (488 distinct valence frames)
Valence frame
Ø  717 distinct valence frames
(including prepositions)
ON FOPP (an)
ON OPP (mit)
ON OA FOPP (auf)
ON FOPP (über)
ON OA FOPP (an)
ON OADJP
ON FOPP (mit)
EMPTY
ON FOPP (in)
ON OPP (in)
ON OA FOPP (zu)
ON OA FOPP (in)
ON OD (PASSIV)
ON OA OPP (in)
ON OADVP
ON OPP (auf)
ON OA OS
ON OA OD
ON OA FOPP (mit)
ON PRED (PASSIV)
ON OD OS
ON PRED
OA (INFINITIV)
ON OA PRED
ON OD
ON OD OA
ON OS
ON (PASSIV)
ON
ON OA
0
500
1000
1500
Frequency
12/13/10 Erhard Hinrichs, Kathrin Beck CLARA Course on Treebank Annota?on 2000
2500
Quantitative Analysis IV
Valence frame count per verb lemma and frequency count:
4896 verb lemmas (total)
67.3% (3294 verb lemmas): 1 frame
Verb lemma
Valence frames
per verb lemma
Frequency
count
machen
16
1
schreiben
15
1
denken, halten
14
2
18.8% (921 verb lemmas):
2 frames
lassen, nehmen, sehen
13
3
geben, sagen, sprechen, tun
12
4
7.1%
3 frames
finden, haben, sein, stehen
11
4
entscheiden ... wissen
10
9
bleiben … verpflichten
9
6
bekommen … ziehen
8
15
anfangen … zahlen
7
25
abstimmen … wünschen
6
33
anbieten … zwingen
5
85
abfahren … zustimmen
4
146
abgeben … zweifeln
3
347
abbrechen … zutreffen
2
921
aalen … zwitschern
1
3294
(347 verb lemmas):
3.0%
(146 verb lemmas):
4 frames
1.7%
(85 verb lemmas):
5 frames
1.8%
0.3%
(88 verb lemmas):
(15 verb lemmas):
6-10 frames
more than
10 frames
12/13/10 Erhard Hinrichs, Kathrin Beck CLARA Course on Treebank Annota?on Conclusion and Future Work
Current state of work:
Ø  TüBa-D/Z:
ca. 40 000 sentences
Ø  Valence Lexicon:
4947 distinct verb lemmas
8139 valence frames (total)
755 distinct valence frames
Integration with other resources of German (e.g. GermaNet):
Benefits:
Ø  opportunity to clarify the intended sense of a verb by matches of verb
senses with valence frames
Ø  empirical verification of the relationship between the correlation of
distinct valence frames and sense distinction
12/13/10 Erhard Hinrichs, Kathrin Beck CLARA Course on Treebank Annota?on Thank you
for your attention 12/13/10 Erhard Hinrichs, Kathrin Beck CLARA Course on Treebank Annota?on Quantitative Analysis V
Correlation of lemma frequency with the number of valence frames
per verb:
Valence frame count
Lemma
Top 20 correlation of lemma frequency
and valence frame count per verb
Lemma frequency
sein
werden
haben
können
sollen
müssen
wollen
geben
sagen
machen
kommen
lassen
gehen
stehen
sehen
bleiben
dürfen
heißen
wissen
finden
12/13/10 Erhard Hinrichs, Kathrin Beck CLARA Course on Treebank Annota?on 10009
6545
5766
2164
1418
1373
1294
1021
922
801
668
626
562
475
462
409
379
364
364
361
per verb
11
7
11
6
6
5
8
12
12
16
10
13
10
11
13
9
5
10
10
11
Quantitative Analysis VI
Top 100 correlation of lemma frequency and valence frame count:
100
90
80
70
60
50
40
30
20
10
0
LF
VFC
Linear (VFC)
sein
sollen
sagen
gehen
dürfen
erklären
halten
spielen
gelten
leben
glauben
scheinen
ziehen
brauchen
erreichen
fragen
einsetzen
tragen
verstehen
übernehmen
bestätigen
unterstützen
anbieten
verlassen
ausgehen
Relative frequency
Ø  weak correlation Lemma
12/13/10 Erhard Hinrichs, Kathrin Beck CLARA Course on Treebank Annota?on