Annotating Clause Boundary Labels to Japanese Corpora

2/17/2015
Contents
Annotating Clause Boundary
Labels to Japanese Corpora
Introduction :
 Multiple clause linkage structure in Japanese
Data :
 Corpora: CSJ, BCCWJ
 Annotating clause boundaries
Takehiko Maruyama
(University of Oxford / NINJAL)
Result :
 Distribution of clauses and their combinations
Discussion :
17 th February 2015
East Asian Linguistics Seminar
 Mechanism of multiple clause linkage structure
Application to Old Japanese :
 Multiple clause linkage structure in OCOJ
Clause Linkage System in Japanese
•Japanese subordinate clauses
–Conjugation forms of (auxiliary) verbs
[ [ Taro ga warai ] Hanako ga naita ]
Taro laughed and Hanako cried.
–Conjunctive particles
[ [ Taro ga waratta kara ] Hanako ga naita ]
Taro laughed, so Hanako cried.
•Final boundaries of clauses and sentences can be
distinguished morpho-syntactically
–Cf. English
Taro laughed and Hanako cried.
Taro laughed. And Hanako cried.
Bad Sentence with Multiple Clause Linkage
• 僕たちはバスをおりて、長い階段を上がって、動物園に
向かったんだけど、動物園の門にはライオンの絵がか
いてあって、とても大きくてびっくりしたけど、中に入ると
最初はパンダがいて、その先にはリスやコアラがいたり、
おもしろい鳴き声を出す鳥とかがたくさんいて、祐二君
が「動物園ってすごい楽しいね」と言っているうちに、お
昼ごはんを食べる時間になって、キリンの長い首を見
ながらお弁当を食べて、それから大きなゾウを見て、長
い鼻でエサを拾いあげるようにして食べるのを見て、す
ごいおもしろかったけれど、途中から雨が強くなって、
早く学校に帰ることになったから、ちょっと残念だった。
Composition by elementary school student
Clause Linkage System in Japanese
•Subordinate clauses can be linked circularly
–A sentence can be infinitely long (cf. embedded sentence)
Taro ga warai Hanako ga naita kara Jiro ga okotte …
Taro laughed and Hanako laughed, so Jiro got angry and…
–A series of clause linkages within a sentence:
“Multiple clause linkage structure”
• Multiple clause linkage structure tends to be avoided in
Prescriptive Grammar
–In elementary schools, long sentences are called
“pointless / bad sentence” (だらだら文 ・ 悪文)
–Nagano (1969) “Disorders and confusions of authors’
thoughts generate pointless sentences”
Clause Linkages in Spontaneous Speech
• 私が住んでいたところは団地 : の二階でして (F えーと)
その前は大きな (F えー) 明治 : 道路が走っていたんで
すけれども || 団地と道路の間にはこう団地の庭みたい
な感じで (F えーと) || 道路の手前に木がたくさん生えて
いたので || (F えーと) || (F ま) 鳥が || (D つつ) 飛び出し
たと してもすぐには道路に出ないで その : 木 (D ん) (D
き) 木の辺りに引っ掛かってるかな : という || (F えー) 感
じでしたので まず二階からこう || 木を || 木のどの辺にい
るかという || のを当たり付けて || 当たりを付けると言う
か (F まー) || 探してみて || すぐには見つからなかったの
で しょうがない (D ぐ) ので (D すす) すぐに外に飛び出
しまして || (F えーとー)...
(Corpus of Spontaneous Japanese:S02M0076)
1
2/17/2015
Clause Linkages in Written Text
• 何年度かの初島レースで徹夜で舵を引いて明け方ファースト・フィ
ニッシュし、後の片付けはクルーにまかせて家に飛んで 帰り昼近く
まで仮眠をとった後迎えにきたサッカー仲間の車で横浜にいき、外
人クラブとの試合で私も一点ゴールを決めて大勝し、帰り道には
当時流行りだしていたバッティングセンターで小一時間ボールを
打って、その後ハーバーから戻っていたクルーたちとマージャンし
て馬鹿勝ちし、「いったい石原さんて何なんだ」、とぼやかれて悦に
入っていたこともあったが、その次の年あたりに海で酷い目に会い
生まれて初めて体力の限界を覚らされたものでした。
Then…
•How subordinate clauses are linked to compose
multiple clause linkage structure in spoken and
written Japanese?
•Why do speakers / writers produce
bad sentences in their speech / text?
(Balanced Corpus of Contemporary Written Japanese
OB6X 00101)
(『老いてこそ人生』 石原慎太郎著、幻冬舎)
Research Questions
•What type of subordinate clause appears in
Japanese spontaneous speech and written text,
and how they are combined?
–Distribution of clause types and their combinations
–Corpus-based study
• Surveying large corpora of spoken and written Japanese
• Identifying various types of clauses in the corpora
•What is the mechanism of generating multiple
clause linkage structures?
–“Utterance production in real time” cf. Levelt (1989)
–“Dynamic Rewriting Rule” by Kondo (2005)
CSJ
•651 hours, 7.52M words of spontaneous speech
–90% for monologue
Corpora
•CSJ: Corpus of Spontaneous Japanese (2004)
『日本語話し言葉コーパス』
–Mainly monologue
–651 hours
–7.52 million words
•BCCWJ: Balanced Corpus of Contemporary
Written Japanese (2011)
『現代日本語書き言葉均衡コーパス』
–Various types of written text
–100 million words
–172,675 samples
2-way Transcription System
Basic Trans.
Pronunciation Trans.
• APS (Academic Presentation Speech): formal
• SPS (Simulated Public Speaking): casual
–10% for dialogue
•Rich annotations
–Transcription, Morphological information, Clause
boundary label, Dependency and Discourse structure,
Segment label, Intonation label, Speakers’ info..
•Aims
–To develop automatic speech recognition system
–Linguistics study of spontaneous speechfs
2
2/17/2015
Morphological Information
•Morphologically analyzed data
XML encoding
•All the annotations are encoded in XML files
BCCWJ
•Balanced corpus for general purpose
•100 million words
–Sampled randomly from various written text published
during 1976 - 2005 (-2009)
•Registers
BCCWJ
•Balanced corpus for general purpose
•100 million words
–Sampled randomly from various written text published
during 1976 - 2005 (-2009)
•Registers
–Books, Magazines, Newspapers, Web Documents,
Whitepapers, Textbooks, Blog, Law, Verse...
•Aims
–Books, Magazines, Newspapers, Web Documents,
Whitepapers, Textbooks, Blog, Law, Verse...
•Aims
–Vocabulary survey, Grammatical study, Lexicography...
–Japanese language education
–Natural language processing
XML encoding
<?xml version="1.0" encoding="UTF-8"?>
<sample sampleID="LBe2_00005" version="1.0" type="fixedLength">
<article articleID="LBe2_00005_F001">
<paragraph>
<sentence>やがて、後<sampling type="start" />燕は漢人の<ruby rubyText="
ひょう">馮</ruby><ruby rubyText="ばつ">跋</ruby>に乗っ取られてしまいます。
</sentence>
<sentence>西暦四〇九年のことですが、この翌年前記の南燕が東晋の<ruby
rubyText="りゅう">劉</ruby><ruby rubyText="ゆう">裕</ruby>によって、ほ
ろぼされてしまいました。</sentence>
</paragraph>
<paragraph>
<sentence> 四〇九年には、いろいろなことがおこっています。</sentence>
<sentence>さしもの拓跋珪も、この年、思わぬことで、あろうことか息子の一人、
<ruby rubyText=“たく”>拓</ruby><ruby rubyText=“ばつ”>跋</ruby><ruby
rubyText=“しょう”>紹</ruby>によって殺されました。
</sentence>
</paragraph>
A sample starts here
Figures, old Japanese are
–Vocabulary survey, Grammatical omitted
study, Lexicography...
–Japanese language education
A character
randomlyprocessing
–Natural language
chosen in a page
Morphological Information
•Morphologically analyzed data
18
3
2/17/2015
少納言 Shonagon / 中納言 Chunagon
•BCCWJ concordance program
What is “Annotation” ?
19
Corpus Annotations
•Annotations: adding (non-)linguistic information
to linguistic entities in a corpus
•Various types of annotations to various levels of
linguistic expressions
Annotating Clause Boundaries
•Adding “Clause Boundaries Labels”
Discourse Rhetorical structure, Anaphora
Sentence boundaries, Dependency structure,
Sentence Predicate-Argument structure, Speech act,
Intonation
Clause Clause boundaries, Dependency structure
Phrase Syntactic feature, Semantic role
Lemma, Part-of-Speech, Named entity, WordWord
sense, Accent
Phoneme Segment
kyoo ohanasi sasete itadaku naiyoo na n desu keredomo /keredomo/
(F e:tto:) (F ma) tokuni mezurasii koto de wa nai to <Quote>
omou n desu ga /ga/
(F ano:) jibun wa (F ano) karada wa (F e:) moto kara tuyoi hoo dewa
nakatta no desu ga /ga/
iwayuru byooki rasii byooki toyu: koto wa sita koto ga naku te /te/
(F sono) (F e) (D i) ikkagetu hodo zutto netakiri toyu:ka <suspend>
ie ni ori masi te /te/
(F ano) byooki wo site ori masita mono de <Copula>
jibun ni totte wa umare te <te>
hajimete no koto datta node <node>
(F e:) sono koto ni tui te ohanasi sasete itadaki tai to <Quote>
omoi masu [EOS]
“CBAP”
Data Statistics
•CBAP: Clause Boundary Annotation Program
(Maruyama et al. 2004)
–Detecting and annotating clause boundaries
–Using morphologically analyzed data
–139 types of clauses
Subordinate
(102)
Conditional (23)
Reason
(8)
Time
(21)
Manner
(12)
misc
(38)
Complementary
(10)
Complementary
(2)
Quotation
(5)
Indirect question (3)
Adnominal
(15)
Coordinate
(12)
•“Sentences” in (the part of) CSJ and BCCWJ
–CSJ: Clause Boundary Labels [EOS]
–BCCWJ: the extent of sentence-tags ended by [。!?]
Registers # files # sentences # words
APS (formal)
70
5,389 191,591
CSJ
SPS (casual)
107
4,494 164,096
Book
83
8,780 204,050
BCCWJ Magazine
86
9,342 202,268
Newspaper
340
11,898 308,504
39,903 1,070,509
Total
4
2/17/2015
Distribution of CBLs
Target of Clause Boundary Labels
•Major clause boundaries, classified by 5 types
Clause types
Clause Boundary Labels
EOS
EOS
ga, keredomo, keredo, kedomo,
Coordination
kedo, si
Reason
kara, node
Conditional
tara, taraba, to, nara, naraba,
reba
misc
Continuative forms of a verb
and copula, te, quotation, toyu:
CBL
EOS
EOS
APS
SPS
Book Magazn Newsp
5,624 5,476
8,606
9,237 7,713
1,027
672
716
552
496
382
800
14
2
1
108
328
0
0
0
8
37
15
26
4
43
584
26
62
10
54
230
108
90
21
78
261
307
185
69
310
735
150
164
40
60
303
184
172
29
546
691
438
365
265
3
9
42
53
11
153
225
450
288
178
ContinueF
556
277
1,908
1,837 2,023
Copula
347
769
448
408
408
te
2,884 3,903
2,122
1,625 1,080
Quote
1,006 1,577
1,130
732
881
toyu:
1,454 1,163
445
267
150
14,645 18,038 17,110 16,065 13,379
ga
keredomo
kedomo
Coordinate
keredo
kedo
si
kara
Reason
node
tara(ba)
to
Conditional
nara(ba)
reba
misc
Total
Adjusted frequency: 200,000 words in each register
Number of CBLs within a Sentence
6,000
APS
SPS
Book
Magazine
Newspaper
Frequency
5,000
4,000
3,000
2,000
1,000
0
1
2
3
4
5
6
7
8
9
10 ~20 ~40
CBL combinations (CSJ)
32.0%
9.1%
3.8%
3.3%
2.7%
2.2%
1.9%
1.5%
1.5%
1.5%
APS
EOS
te / EOS
toyu: / EOS
ga / EOS
Quote / EOS
te / te / EOS
Continue / EOS
te / toyu: / EOS
Copula / EOS
te / Quote / EOS
26.2%
5.9%
3.6%
1.8%
1.7%
1.6%
1.4%
1.4%
1.4%
1.2%
SPS
EOS
te / EOS
Quote / EOS
toyu: / EOS
te / Quote / EOS
Continue / EOS
keredomo / EOS
to / EOS
te / te / EOS
ga / EOS
# CBLs within a sentence
Adjusted 10K sentences in each register
CBL Combinations (BCCWJ)
45.1%
7.8%
6.4%
3.5%
2.4%
1.8%
1.8%
1.6%
1.2%
1.2%
Book
EOS
te / EOS
Cont / EOS
Quote / EOS
ga / EOS
to / EOS
Copula / EOS
reba / EOS
toyu: / EOS
kara / EOS
Magazine
53.7% EOS
7.8% Cont / EOS
6.3% te / EOS
2.9% Quote / EOS
2.4% ga / EOS
2.0% Copula / EOS
1.7% to / EOS
1.3% te / Cont / EOS
1.3% reba / EOS
1.0% toyu: / EOS
Newspaper
52.2% EOS
11.5% Cont / EOS
5.6% te / EOS
3.6% Quote / EOS
2.7% ga / EOS
2.6% Copula / EOS
1.4% to / EOS
1.2% Cont / Cont / EOS
1.1% Cont / te / EOS
1.0% te / Cont / EOS
Controled, common writing style in
published text?
Number of CBLs within a Sentence
×6
×7 ×8 ×9 ×10 ~×20 ~×40
APS
321 186
41
17
37
0
SPS
465 367 185 116
95
78
220
11
Book
95
49
10
10
5
0
0
Magazine
54
16
5
3
2
1
0
35
4
2
2
0
1
0
Newspaper
Adjusted 10K sentences in each register
Spoken > Written
Casual > Formal
5
2/17/2015
Clause Linkages in Spontaneous Speech
If he writes as…
• 私が住んでいたところは団地 の二階でし EOS
その前は大きな
明治 道路が走っていたんで
す EOS 団地と道路の間にはこう団地の庭みたいな
感じで
道路の手前に木がたくさん生えていた
ので EOS
鳥が
飛び出したと し
てもすぐには道路に出ないで その 木
木
の辺りに引っ掛かってるかな という
感じでし
たので EOS 二階からこう 木を 木のどの辺にいる
かという のを当たり付けて 当たりを付けると言うか
探してみて すぐには見つからなかったので しょうがな
い
ので
すぐに外に
1. 私が住んでいたところは、団地の二階でした。
2. その前は大きな明治道路が走っていました。
3. 団地と道路の間は団地の庭のような感じで、
道路の手前に木がたくさん生えていました。
4. ですから、鳥が飛び出したとしても、すぐには
道路に出ずに、その木の辺りに引っ掛かって
るかなと思いました。
5. そこで、まず二階から木のどの辺にいるかの
当たりを付けて、. . .
Difference of Speech and Written Text
“Dynamic Rewriting Rule”
•Process of producing utterance (Levelt 1989)
–A speaker has to continue speaking while s/he speaks
–Dynamic processing in real time to produce a narrative
–Disfluencies (fillers, word fragments, repairs…)
•Process of writing texts
–A writer can take several time until s/he fixes the result
–Editing (copy&paste, delete, rewrite), proof-reading
–Typos
Constraints of continuity in spontaneous speech
generates multiple clause linkage structure
“Dynamic Rewriting Rule”
1. [ [hito sigekumo aranedo] + tabi kasanari keri] (EOS)
2. [ [hito sigekumo aranedo] + tabi kasanari kereba] + aruji
kikituku] (EOS)
3. [ [ [hito sigekumo aranedo] + tabi kasanari kereba] + aruji
kikitukete] + sono kayohiji ni yogoto ni hito wo suetu] (EOS)
4. [ [ [ [hito sigekumo aranedo] + tabi kasanari kereba] + aruji
kikitukete] + sono kayohiji ni yogoto ni hito wo suete] +
mamorasekeri] (EOS)
5. [ [ [ [ [hito sigekumo aranedo] + tabi kasanari kereba] + aruji
kikitukete] + sono kayohiji ni yogoto ni hito wo suete] +
mamorasekereba] + ikedomo e awadu] (EOS)
6. [ [ [ [ [ [hito sigekumo aranedo] + tabi kasanari kereba] + aruji
kikitukete] + sono kayohiji ni yogoto ni hito wo suete] +
mamorasekereba] + ikedomo e awade] + kaerikeri ] (EOS)
•Kondo (2005) “MCL in Early Middle Japanese”
–A main clause with subordinate clauses is often
rewritten into another subordinate clause dynamically
1. [ [hito sigekumo aranedo] + tabi kasanari keri] (EOS)
2. [ [hito sigekumo aranedo] + tabi kasanari kereba] +
aruji kikituku] (EOS)
3. [ [ [hito sigekumo aranedo] + tabi kasanari kereba] +
aruji kikitukete] + sono kayohiji ni yogoto ni hito wo
suetu ] (EOS)
Narratives and Multiple Clause Linkage
•Kondo (2005)
–“Writing styles in the Early Middle Japanese is related
to “narratives”, which must reflect spoken language.”
–“Speech is produced dynamically, combining phonetic
forms and semantic meaning simultaneously.”
•Mechanism of Multiple Clause Linkage
–MCL is a reflection of the nature of narratives, which a
speaker/writer keeps telling a series of episodes
• Dynamic production of narratives in spontaneous speech
• “Lively” description styles in written text (bad sentences, or
effectively used by professional authors)
6
2/17/2015
OCOJ
OCOJ
The Oxford Corpus of Old Japanese
オックスフォード大学上代日本語コーパス
The Oxford Corpus of Old Japanese
オックスフォード大学上代日本語コーパス
•The Oxford Corpus of Old Japanese
–A comprehensively annotated corpus of Old Japanese
• Original text in Kanji, romanized phonemic transcription,
English translation, morphological information,
lemmatization, and grammatical and semantic roles of noun
Poetic texts
phrases
Kojiki kayo (古事記歌謡; 712)
Nihon shoki kayo (日本書紀歌謡; 720)
Fudoki kayo (風土記歌謡; 730s)
Bussukoseki-ka (仏足石歌; after 753)
Man‘yoshu (万葉集; after 759)
Shoku nihongi kayo (続日本紀歌謡; 797)
Jogu shotoku hoo teisetsu (上宮聖徳法王帝説)
112 poems
133 poems
20 poems
21 poems
4685 poems
8 poems
4 poems
2527 words
2444 words
271 words
337 words
83706 words
134 words
60 words
Non-poetic texts
Shoku nihongi Senmyo (続日本紀宣命)
Engishiki Norito (延喜式祝詞)
OCOJ
approx. 14,000 words
approx. 6,500 words
The Oxford Corpus of Old Japanese
オックスフォード大学上代日本語コーパス
(tentative) Result of Annotation
•Shoku nihongi Senmyo (続日本紀宣命)
–A total of 14,306 words
–A total of 3,121 clause boundary labels were annotated.
te
EOS
Adnominal
Quote
ContinueF
suru-mo
suru-ni
Quote-namo
reba
727
512
498
381
173
56
46
44
42
te-namo
suru-o
domo
yueni
madeni
manimani
temo
nagara
made
30
22
15
7
7
5
4
4
4
Annotating CBLs to Senmyo text
Comparing Senmyo to CSJ/BCCWJ
•Adjusted 20,000 words for each register
CBL
EOS
te
Quote
keredomo
ContinueF
reba
SPS
Book Senmyo
548
861
716
390
212 1,016
158
113
533
80
1
99
28
191
242
23
45
59
7
2/17/2015
CBL combination (OCOJ)
CBL combination (OCOJ)
/adnom/adnom/adnom/ContF/ContF/Quote
•Identifying the clause boundaries
-tonamo/Quote/adnom/Quote/EOS/
•Identifying the clause boundaries
ametuti no muta nagaku topoku aratamu masiziki tune no
nori to tatetamapyeru wosukuni no nori mo katabuku koto
naku ugoku koto naku watariyukamu to namo
omoposimyesaku to noritamapu opomikoto wo moromoro
kikitamapeyo to noritamapu
(OCOJ:Senmyo 3)
ametuti no muta nagaku topoku aratamu masiziki /adnom/
tune no nori to tatetamapyeru /adnom/
wosukuni no nori mo katabuku /adnom/
koto naku /ContF/
ugoku koto naku /ContF/
watariyukamu to namo /Quote-tonamo/
omoposimyesaku to /Quote/
noritamapu /adnom/
opomikoto wo moromoro kikitamapeyo to /Quote/
noritamapu [EOS]
CBL combination (OCOJ)
•Adjusted 20,000 words for each register
109
94
41
28
15
14
13
13
11
11
Senmyo
EOS
te / EOS
Quot / EOS
te / te / EOS
te / Quot / Quot / EOS
Quot / Quot / EOS
Cont / EOS
suruni / EOS
te / Quot / EOS
te / te / te / EOS
388
67
55
30
21
15
15
14
10
10
Book
EOS
te / EOS
Cont / EOS
Quote / EOS
ga / EOS
to / EOS
Copula / EOS
reba / EOS
toyu: / EOS
kara / EOS
Prospect
•Is it possible to examine clause linkage structures
in a series of historical Japanese texts?
–Old Japanese ( -794)
–Early Middle Japanese (8-12 C)
–Late Middle Japanese (12-17 C)
–Early Modern Japanese (17-19 C)
–Modern Japanese (19-20 C)
–Contemporary Japanese (20-21 C)
•Corpus-based diachronic studies of Japanese
–Development of new historical corpora
–Grammatical studies from various aspects
CBL combination (OCOJ)
•Adjusted 20,000 words for each register
109
94
41
28
15
14
13
13
11
11
Senmyo
EOS
te / EOS
Quot / EOS
te / te / EOS
te / Quot / Quot / EOS
Quot / Quot / EOS
Cont / EOS
suruni / EOS
te / Quot / EOS
te / te / te / EOS
144
32
20
20
13
10
9
9
9
8
SPS
EOS
te / EOS
NounQuot / EOS
Quot
Noun-/ EOS
tetoyu: / EOS
te / Quot / EOS
Copula / EOS
Interjection
keredomo / EOS
Reference
1. Kondo, Yasuhiro (2005) “Heian jidai go no fukushi setsu no
setsu rensa koozoo ni tsuite”, Kokugo to kokubungaku, 82(11)
2. Levelt, W. J. M. (1989) Speaking: From Intention to Articulation.
MIT Press.
3. Maruyama, Takehiko, Hideki Kashioka, Tadashi Kumano and
Hideki Tanaka (2004) “Nihongo setsu kyookai kensyutsu
puroguramu CBAP no kaihatsu to hyooka”, Sizen gengo syori
11(3)
4. Maruyama, Takehiko (2014) “Gendai nihongo no tajuutekina
setsu rensa koozoo ni tsuite”, Hanashi kotoba to kaki kotoba no
setten, Hituji Shobo.
5. Nagano, Masaru (1969) Akubun no jiko shindan to chiryoo no
jissai, Shibundo.
8