UNIVERSIDADE TÉCNICA DE LISBOA
INSTITUTO SUPERIOR TÉCNICO
<TranscriptSegment>
<TranscriptGUID>2</TranscriptGUID>
<AudioType start="970" end="1472">Clean</AudioType>
<Time start="970" end="1472" reasons=""/>
<Speaker id="1000" name="Homem" gender="M" known="F"/>
<SpeakerLanguage native="T">PT</SpeakerLanguage>
<TranscriptWList>
<W start="970" end="981" conf="0.765016" focus="F0" pos="S.">em</W>
<W start="982" end="997" conf="0.525857" focus="F0" pos="Nc">boa</W>
<W start="998" end="1049" conf="0.98280" focus="F0" punct=".” pos="Nc">noite</W>
<W start="1050" end="1064" conf="0.904695" focus="F0" pos="Td">os</W>
<W start="1065" end="1113" conf="0.974994" focus="F0" pos="Nc">centros</W>
<W start="1114" end="1121" conf="0.938673" focus="F0" pos="S.">de</W
<W start="1122" end="1173" conf="0.993847" focus="F0" pos="Nc">emprego</W>
<W start="1174" end="1182" conf="0.951339" focus="F0" pos="S.">em</W>
<W start="1183" end="1229" conf="0.999291" focus="F0" pos="Np">portugal</W>
<W start="1230" end="1283" conf="0.979457" focus="F0" pos="V.">continuou</W>
<W start="1284" end="1285" conf="0.967095" focus="F0" pos="Td">a</W>
<W start="1286" end="1345" conf="0.996321" focus="F0" pos="V.">registar</W>
<W start="1346" end="1399" conf="0.946317" focus="F0" pos="R.">menos</W>
<W start="1400" end="1503" conf=... focus="F0" punct=".” pos="V.">inscritos</W>
</TranscriptWList>
</TranscriptSegment>
Recovering Capitalization and Punctuation Marks
on Speech Transcriptions
Fernando Manuel Marques Batista
(Mestre)
Dissertação para obtenção do Grau de Doutor em
Engenharia Informática e de Computadores
Orientador:
Doutor Nuno João Neves Mamede
Júri
Presidente:
Vogais:
Presidente do Conselho Científico do IST
Doutor Mário Jorge Costa Gaspar da Silva
Doutora Isabel Maria Martins Trancoso
Doutor Nuno João Neves Mamede
Doutora Dilek Hakkani-Tür
Doutora Helena Sofia Andrade Nunes Pereira Pinto
Maio de 2011
Resumo
Esta tese aborda duas tarefas de anotação de meta-informação, que fazem parte do enriquecimento de transcrições de fala: maiusculização e recuperação de marcas de pontuação. Este estudo centra-se no processamento de notícias televisivas, envolvendo transcrições
manuais e automáticas. São comparados e analisados vários modelos de maiusculização,
concluindo-se que os modelos generativos capturam melhor a estrutura da língua escrita, enquanto que os modelos discriminativos são melhores para transcrições de fala e mais robustos
aos erros de reconhecimento. O impacto da dinâmica da língua é analisado, concluindo-se que
o desempenho da maiusculização é afectado pela distância temporal entre o material de treino
e teste. Em termos de pontuação, são analisadas as três marcas mais frequentes: ponto, vírgula,
e interrogação. As experiências iniciais usam informação local, que combina informação lexical
e acústica, para dar conta do ponto e da vírgula. As experiências mais recentes utilizam também
informação prosódica e estendem este estudo às interrogativas.
Grande parte do estudo é independente da língua, mas à língua Portuguesa foi dado um
destaque especial. A investigação realizada permitiu obter os primeiros resultados de avaliação, relativos às duas tarefas, para notícias televisivas em Português Europeu. Algumas experiências foram também replicadas para Inglês e Espanhol.
Abstract
This thesis addresses two important metadata annotation tasks, involved in the production
of rich transcripts: capitalization and recovery of punctuation marks. The main focus of this
study concerns broadcast news, using both manual and automatic speech transcripts. Different capitalization models were analysed and compared, indicating that generative approaches
capture the structure of written corpora better, while the discriminative approaches are suitable
for dealing with speech transcripts, and are also more robust to ASR errors. The so-called language dynamics have been addressed, and results indicate that the capitalization performance
is affected by the temporal distance between the training and testing data. In what concerns
the punctuation task, this study covers the three most frequent marks: full stop, comma, and
question mark. Early experiments addressed full-stop and comma recovery, using local features,
and combining lexical and acoustic information. Recent experiments also combine prosodic
information and extend this study to question marks.
Much of the research conducted here is language independent, but a special focus is given
to the Portuguese language. This thesis provides the first evaluation results of these two tasks
over European Portuguese broadcast news data. Most experiments were also conducted over
English and Spanish.
Palavras Chave
Keywords
Palavras chave
Enriquecimento de transcrições de fala
Maiusculização automática
Pontuação automática
Segmentação automática de frases
Métodos generativos e discriminativos
Dinâmica da linguagem
Keywords
Rich transcription
Automatic capitalization
Automatic punctuation
Sentence boundary detection
Generative and discriminative methods
Language dynamics
Agradecimentos
Acknowledgements
Esta tese não teria sido possível sem o apoio e ajuda que recebi ao longo destes quatro anos.
Agradeço a todos os que me apoiaram e ajudaram.
Em primeiro lugar quero agradecer ao Professor Nuno Mamede pela sua orientação e
apoio. Muito agradeço a amizade e confiança prestadas desde os tempos do meu mestrado.
O meu obrigado por ter alocado recursos que permitiram revêr e corrigir dados essenciais ao
meu trabalho.
Queria também agradecer ao Diamantino Caseiro que, agora nos Estados Unidos, muito
me ajudou nos primeiros tempos deste trabalho, com as suas oportunas sugestões. Não tenho
palavras para agradecer à Professora Isabel Trancoso a sua amizade, disponibilidade e ajuda
sempre pronta. A sua dedicação pessoal e as suas sábias e valiosas contribuições foram determinantes no desenvolvimento deste trabalho.
Agradeço a todos os meus colegas do laboratório de sistemas de língua falada (L2 F) do
INESC-ID por toda a colaboração, apoio, camaradagem e excelente ambiente de trabalho que
me têm proporcionado. À Joana Paulo, Ricardo Ribeiro, David Matos, Luisa Coheur, Hugo
Meinedo e António Serralheiro pela sua longa amizade e apoio. Aos meus colegas Helena
Moniz, Hugo Meinedo, Thomas Pellegrini e Alberto Abad pela ajuda, estreita colaboração e
importantes contribuições para o meu trabalho. Um especial agradecimento ao Jorge Baptista
pela sua amizade e tempo dedicado à revisão deste documento. Agradeço também à Vera
Cabarrão o profissionalismo e tempo dedicado à revisão dos dados de fala com que trabalhei.
Agradeço aos meus colegas do DCTI do ISCTE-IUL pela sua camaradagem, apoio e excelente ambiente de trabalho que me têm proporcionado. Um agradecimento especial aos meus
colegas Ricardo Ribeiro, Luís Nunes, Tomás Brandão, João Baptista, Paulo Trezentos, Luís Cancela, Abílio Oliveira, Joaquim Esmerado, Alexandre Almeida, José Farinha, Marco Ribeiro, Luís
Botelho, Manuel Sequeira, Maria Albuquerque, José André, mas também a todos os outros dos
quais tenho sempre recebido o maior apoio.
Uma palavra de agradecimento para todos os restantes amigos que também, pela sua
amizade, foram catalizadores deste trabalho.
Queria também agradecer aos meus pais e aos meus irmãos, com os quais sempre pude
contar. Aos meus avós já falecidos, que recordo com muito carinho. Aos meus restantes familiares, tios, sogros, cunhados e primos.
Finalmente, um agradecimento muito especial à Susana, minha mulher, que com o seu
amor, abnegação e sacrifício, tornou possível a realização desta tese.
Muito obrigado a todos.
Lisboa, Março de 2011
Fernando Manuel Marques Batista
Contents
1
2
Introduction
1
1.1
Emerging Interest in Rich Transcription . . . . . . . . . . . . . . . . . . . . . . . .
2
1.2
Rich Transcription Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4
1.3
Motivation and Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6
1.4
Proposed Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8
1.5
Document Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9
State of the Art
11
2.1
Related Work on Capitalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
13
2.2
Related Work on Punctuation and Sentence Boundary Detection . . . . . . . . . .
16
2.3
The Maximum Entropy Approach . . . . . . . . . . . . . . . . . . . . . . . . . . .
20
2.3.1
Application to Rich Transcription . . . . . . . . . . . . . . . . . . . . . . .
22
2.3.2
Large Corpora Issues
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
23
Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
24
2.4
3
Corpora
27
3.1
Broadcast News Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
28
3.1.1
Portuguese Corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
28
3.1.2
Spanish Corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
31
3.1.3
English Corpora . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
32
Written Newspaper Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
36
3.2.1
Portuguese Corpora . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
37
3.2.2
Spanish Corpora . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
38
3.2.3
English Corpora . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
38
3.2
i
3.3
3.4
3.5
3.6
4
Speech Data Preparation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
39
3.3.1
Capitalization Alignment Issues . . . . . . . . . . . . . . . . . . . . . . . .
41
3.3.2
Punctuation Alignment Issues . . . . . . . . . . . . . . . . . . . . . . . . .
43
Additional Prosodic Information . . . . . . . . . . . . . . . . . . . . . . . . . . . .
44
3.4.1
Extracting the Pitch and Energy . . . . . . . . . . . . . . . . . . . . . . . .
44
3.4.2
Adding Phone Information . . . . . . . . . . . . . . . . . . . . . . . . . . .
45
3.4.3
Marking the Syllable Boundaries and Stress . . . . . . . . . . . . . . . . . .
46
3.4.4
Producing the Final XML File . . . . . . . . . . . . . . . . . . . . . . . . . .
46
Speech Data Word Boundaries Refinement . . . . . . . . . . . . . . . . . . . . . .
47
3.5.1
Post-processing rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
50
3.5.2
Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
51
3.5.3
Impact on acoustic models training . . . . . . . . . . . . . . . . . . . . . .
52
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
53
Capitalization Recovery
55
4.1
Capitalization Analysis Based in Written Corpora . . . . . . . . . . . . . . . . . .
55
4.1.1
Capitalization Ambiguity . . . . . . . . . . . . . . . . . . . . . . . . . . . .
56
Early Work Comparing Different Methods . . . . . . . . . . . . . . . . . . . . . . .
58
4.2.1
Description of the Generative Methods . . . . . . . . . . . . . . . . . . . .
59
4.2.2
Comparative Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
62
4.2.3
Results using Unlimited Vocabulary . . . . . . . . . . . . . . . . . . . . . .
65
Impact of Language Dynamics . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
66
4.3.1
Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
66
4.3.2
Capitalization of Written Corpora . . . . . . . . . . . . . . . . . . . . . . .
68
4.3.3
Capitalization of Speech Transcripts . . . . . . . . . . . . . . . . . . . . . .
70
4.3.4
Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
72
Capitalization Model Adaptation . . . . . . . . . . . . . . . . . . . . . . . . . . . .
73
4.4.1
Baseline results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
73
4.4.2
Adaptation results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
74
4.2
4.3
4.4
ii
4.4.3
4.5
4.6
4.7
5
Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
75
Recent Work on Capitalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
76
4.5.1
Capitalization Results using a ME-based Approach . . . . . . . . . . . . .
76
4.5.2
Capitalization Results using an HMM-based Approach . . . . . . . . . . .
78
4.5.3
Capitalization Results using Conditional Random Fields . . . . . . . . . .
78
4.5.4
Analysis of Feature Contribution . . . . . . . . . . . . . . . . . . . . . . . .
81
4.5.5
Error Analysis and General Problems . . . . . . . . . . . . . . . . . . . . .
81
Extension to other Languages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
82
4.6.1
Analysis of the language variations over time . . . . . . . . . . . . . . . .
83
4.6.2
Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
84
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
87
Punctuation Recovery
89
5.1
Punctuation Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
90
5.2
Early Work using Lexical and Acoustic Features . . . . . . . . . . . . . . . . . . .
92
5.2.1
Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
93
5.2.2
Sentence Boundary Detection . . . . . . . . . . . . . . . . . . . . . . . . . .
95
5.2.3
Segmentation into Chunk Units, Delimited by Punctuation Marks . . . .
98
5.2.4
Recovering full-stop and comma Simultaneously . . . . . . . . . . . . . . . 102
5.3
5.4
5.5
Extended Punctuation Module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
5.3.1
Improving full stop and comma Detection . . . . . . . . . . . . . . . . . . . 107
5.3.2
Extension to Question Marks . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
Extension to other Languages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
5.4.1
Recovering Full-stop and Comma . . . . . . . . . . . . . . . . . . . . . . . . 115
5.4.2
Detection of Question Marks . . . . . . . . . . . . . . . . . . . . . . . . . . 119
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
iii
6
Conclusions and Future Directions
123
6.1
Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
6.2
Main Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
6.3
Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
6.4
Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
Bibliography
131
Nomenclature
143
A Portuguese Text Normalization
145
A.1 Date and time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
A.2 Ordinals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
A.3 Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
A.4 Optional Expressions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
A.5 Money . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
A.6 Abreviations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
A.7 Other . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
iv
List of Figures
1.1
Integration of the RT modules in the recognition system. . . . . . . . . . . . . . .
4
1.2
Overall architecture of the L2 F speech recognition system. . . . . . . . . . . . . .
5
1.3
Excerpt of a transcribed text, with different markup conditions. . . . . . . . . . .
7
2.1
Block diagram of the capitalization and punctuation tasks. . . . . . . . . . . . . .
22
2.2
Conversion of trigram counts into features. . . . . . . . . . . . . . . . . . . . . . .
23
2.3
Example of correct and incorrect slots. . . . . . . . . . . . . . . . . . . . . . . . . .
24
3.1
Focus distribution in terms of speech duration for Portuguese BN. . . . . . . . . .
29
3.2
Focus distribution in terms of speech duration for Spanish BN. . . . . . . . . . . .
31
3.3
Excerpt of the LDC1998T28 manual transcripts. . . . . . . . . . . . . . . . . . . . .
33
3.4
Except of the LDC2000S86 corpus. . . . . . . . . . . . . . . . . . . . . . . . . . . .
34
3.5
Excerpt of the LDC2005T24 corpus (XML format). . . . . . . . . . . . . . . . . . .
35
3.6
Excerpt of the LDC2005T24 corpus (RTTM format). . . . . . . . . . . . . . . . . .
36
3.7
Excerpt of the LDC2007S10 corpus. . . . . . . . . . . . . . . . . . . . . . . . . . . .
37
3.8
Creating an XML containing all required information for further experiments. . .
40
3.9
Example of a transcript segment extracted from the AUT data set. . . . . . . . . .
41
3.10 Capitalization alignment examples. . . . . . . . . . . . . . . . . . . . . . . . . . . .
42
3.11 Pitch adjustment for unvoiced regions. . . . . . . . . . . . . . . . . . . . . . . . . .
45
3.12 Integrating prosody information in the corpora . . . . . . . . . . . . . . . . . . . .
47
3.13 PCTM file containing the phones/diphones produced by the ASR system. . . . .
48
3.14 PCTM file with monophones and marked with syllable boundary and stress. . .
48
3.15 Excerpt of one of the final XML files, containing prosodic information. . . . . . .
49
3.16 Improvement in terms of correct word boundaries, after post-processing. . . . . .
51
v
3.17 Phone segmentation before and after post-processing. . . . . . . . . . . . . . . . .
52
3.18 Improvement in terms of correct word boundaries, after retraining. . . . . . . . .
53
4.1
The different capitalization classes and their distribution in the PUBnews corpus. 56
4.2
Number of words by frequency interval in the PUBnews corpus. . . . . . . . . .
57
4.3
Distribution of the ambiguous words by word frequency interval. . . . . . . . . .
58
4.4
Proportion of ambiguous words by word frequency interval. . . . . . . . . . . . .
59
4.5
Using the HMM-based tagger. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
60
4.6
Using a WFST to perform capitalization. . . . . . . . . . . . . . . . . . . . . . . . .
61
4.7
Number of OOVs considering written corpora. . . . . . . . . . . . . . . . . . . . .
67
4.8
Proportion of OOVs considering speech transcripts. . . . . . . . . . . . . . . . . .
67
4.9
Performance for different training periods. . . . . . . . . . . . . . . . . . . . . . .
68
4.10 Capitalization of written corpora, using forward and backwards training. . . . .
69
4.11 Automatic capitalization of speech transcripts, using forward retraining. . . . . .
70
4.12 Comparing the capitalization results of manual and automatic transcripts. . . . .
72
4.13 Manual transcription results, using all approaches. . . . . . . . . . . . . . . . . . .
75
4.14 Analysis of each capitalization feature usefulness. . . . . . . . . . . . . . . . . . .
80
4.15 Vocabulary coverage on written newspaper corpora. . . . . . . . . . . . . . . . . .
83
4.16 Vocabulary coverage for Broadcast News speech transcripts. . . . . . . . . . . . .
84
4.17 Forward and Backwards training results over written corpora. . . . . . . . . . . .
85
4.18 Forward training results over spoken transcripts. . . . . . . . . . . . . . . . . . . .
86
5.1
Punctuation marks frequency in Europarl.
. . . . . . . . . . . . . . . . . . . . . .
91
5.2
Punctuation marks frequency in the ALERT-SR corpus (old version). . . . . . . .
92
5.3
Punctuation marks frequency in the ALERT-SR corpus (revised version).
. . . .
93
5.4
Converting time gap values into binary features using intervals. . . . . . . . . . .
94
5.5
Impact of each feature type in the SU detection performance.
. . . . . . . . . . .
97
5.6
Impact of each individual feature in the SU detection performance. . . . . . . . .
98
5.7
Impact of each feature type in the chunk detection performance. . . . . . . . . . . 101
5.8
Impact of each individual feature in the chunk detection performance. . . . . . . . 101
vi
5.9
Impact of lexical and acoustic features in the punctuation detection.
. . . . . . . 105
5.10 Impact of each individual feature in the punctuation detection performance.
. . 105
5.11 Relation between the acoustic features and each punctuation mark. . . . . . . . . 106
vii
viii
List of Tables
3.1
Different parts of the Portuguese BN corpus. . . . . . . . . . . . . . . . . . . . . .
28
3.2
Confusion matrix between the old to the new manual transcripts. . . . . . . . . .
30
3.3
User annotation agreement for punctuation marks in the Portuguese BN corpus.
30
3.4
Spanish BN corpora properties. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
31
3.5
English BN corpora properties. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
32
3.6
Portuguese Newspaper corpora properties. . . . . . . . . . . . . . . . . . . . . . .
37
3.7
European Spanish written corpora properties. . . . . . . . . . . . . . . . . . . . .
38
3.8
English written corpora properties. . . . . . . . . . . . . . . . . . . . . . . . . . . .
39
3.9
Capitalization alignment report. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
42
3.10 Punctuation alignment examples. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
43
3.11 Punctuation alignment report. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
43
4.1
Different LM sizes when dealing with PUBnews corpus. . . . . . . . . . . . . . .
63
4.2
Capitalization results of the generative methods over the PUBnews corpus. . . .
63
4.3
Capitalization results of the generative methods over the ALERT-SR corpus. . . .
64
4.4
ME-based capitalization results using limited vocabulary. . . . . . . . . . . . . . .
65
4.5
ME-based capitalization results using unlimited vocabulary. . . . . . . . . . . . .
65
4.6
Using the first 8 subsets of each year for training. . . . . . . . . . . . . . . . . . . .
68
4.7
Retraining from Jan. 1999 to Sep. 2004. . . . . . . . . . . . . . . . . . . . . . . . . .
69
4.8
Evaluating with manual transcripts. . . . . . . . . . . . . . . . . . . . . . . . . . .
71
4.9
Retraining with manual and evaluating with automatic transcripts. . . . . . . . .
72
4.10 Baseline capitalization results produced using BaseCM. . . . . . . . . . . . . . . .
74
4.11 Capitalization SER achieved for all different approaches. . . . . . . . . . . . . . .
74
4.12 Recent ME-based capitalization results for Portuguese. . . . . . . . . . . . . . . .
77
ix
4.13 Recent HMM-based capitalization results for Portuguese. . . . . . . . . . . . . . .
78
4.14 ME and CRF capitalization results for the PUBnews test set. . . . . . . . . . . . .
79
4.15 ME and CRF capitalization results for the force aligned transcripts test set. . . . .
79
4.16 ME and CRF capitalization results for the automatic speech transcripts test set. .
79
4.17 Comparing two approaches for capitalizing the English Language. . . . . . . . .
86
5.1
Frequency of each punctuation mark in written newspaper corpora.
. . . . . . .
90
5.2
Frequency of each punctuation mark in broadcast news speech transcriptions. . .
91
5.3
Recovering sentence boundaries over the ASR output, using the APP segmentation. 95
5.4
Recovering sentence boundaries in the force aligned data. . . . . . . . . . . . . . . .
96
5.5
Recovering sentence boundaries directly in the ASR output. . . . . . . . . . . . . . .
96
5.6
Recovering chunks over the ASR output, using only the APP segmentation. . . .
99
5.7
Recovering chunk units in the force aligned data. . . . . . . . . . . . . . . . . . . .
99
5.8
Recovering chunk units directly in the ASR output. . . . . . . . . . . . . . . . . . 100
5.9
Punctuation mark replacements. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
5.10 Recovering full-stop and comma in force aligned transcripts. . . . . . . . . . . . . . 103
5.11 Recovering full-stop and comma in automatic transcripts. . . . . . . . . . . . . . . . 103
5.12 Punctuation results over manual transcripts, combining prosodic features. . . . . 108
5.13 Punctuation performance over automatic transcripts, combining prosodic features.109
5.14 Results for manual transcripts, bootstrapping from a written corpora model. . . . 110
5.15 Results for automatic transcripts, bootstrapping from a written corpora model. . 110
5.16 Recovering question marks using a written corpora model. . . . . . . . . . . . . . 112
5.17 Performance results recovering the question mark in different corpora. . . . . . . 113
5.18 Punctuation results for English BN transcripts. . . . . . . . . . . . . . . . . . . . . 115
5.19 Punctuation results for English BN manual transcripts, adding prosody. . . . . . 117
5.20 Punctuation results for English BN automatic transcripts, adding prosody. . . . . 117
5.21 Punctuation for force aligned transcripts, bootstrapping from a written corpora
model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
5.22 Punctuation for automatic transcripts, bootstrapping from a written corpora
model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
x
5.23 Recovering question marks using a written corpora based model. . . . . . . . . . 119
5.24 Recovering the question mark, adding acoustic and prosodic features. . . . . . . 120
xi
xii
Introduction
1
TV stations, radio broadcasters, and other media organizations are now producing large
quantities of digital audio and video data on a daily basis. Automatic Speech Recognition
(ASR) systems can now be applied to such sources of information in order to enrich them with
additional information for applications, such as indexing, cataloging, subtitling, translation
and multimedia content production. The ASR output consists of raw text, often in lowercase
format, without any punctuation information, numbers are represented with words instead of
symbols, and possibly containing different types of disfluencies. Even if useful for many applications, like indexing and cataloging, for other tasks, such as subtitling and multimedia content
production, the ASR output would benefit from other information. In general, enriching the
speech output aims at enhancing information for a better human and machine processing.
Speech units do not always correspond to sentences, as established in the written sense.
They may, in fact, be quite flexible, elliptic, restructured, and even incomplete. Taking into
account this idiosyncratic behavior, the notion of utterance in Jurafsky and Martin (2009) or
sentence-like unit (SU) in (Strassel, 2004; Liu et al., 2006) is often used instead of sentence. Detecting positions where a punctuation mark is missing, roughly1 corresponds to the task of
detecting a SU, or finding the SU boundaries. SU boundaries provide a basis for further natural language processing, and their impact on subsequent tasks has been analyzed in many
speech processing studies (Harper et al., 2005; Mrozinsk et al., 2006; Ostendorf et al., 2008).
The capitalization task, also known as truecasing (Lita et al., 2003), consists of assigning
to each word of an input text its corresponding case information, which sometimes depends
on its context. Proper capitalization can be found in many information sources, such as newspaper articles, books, and most of the web pages. Many computer applications, such as word
processing and e-mail clients, perform automatic capitalization along with spell correction and
grammar check. One important aspect related with capitalization concerns neologisms that are
frequently introduced, and also archaisms. This so-called language dynamics are relevant and
must be taken into consideration in what concerns capitalization.
This thesis addresses two important metadata extraction tasks (MDE) that take part in the
production of rich transcripts: recovering punctuation marks and capitalization. Both tasks are
1 Roughly
because, for instance, units delimited by commas often do not correspond to sentences.
2
CHAPTER 1. INTRODUCTION
critical for the legibility of speech transcripts, and are now gaining increasing attention from
the scientific community. Consider for example the following extract “the crisis was expected last
may a man died...”. The extract does not provide any clue concerning the speaker pauses and
intonation, making it very difficult to decide what has been said: “The crisis was expected. Last
May a man died” or “The crisis was expected last May. A man died”. These last two versions of
the extract are easier to read and to process. Besides improving human readability, punctuation marks and capitalization provide important information for Parsing, Machine Translation
(MT), Information Extraction, Summarization, Named Entity Recognition (NER), and other
downstream tasks that are usually also applied to written corpora. Apart from the insertion
of punctuation marks and capitalization, enriching speech recognition covers other important
activities, such as Speaker Identification, and the Detection and Filtering of Disfluencies, which
are not covered by the scope of this thesis.
1.1
Emerging Interest in Rich Transcription
The production of rich transcripts involves both the speech-to-text (STT) technologies and
metadata extraction technologies. The rich transcription (RT) process usually involves the process of recovering structural information and the creation of metadata from that information.
The final enriched transcript contains all the metadata, which can then be used to enhance the
final recognition output. The following are the most relevant MDE tasks used in the production
of rich transcripts:
Speaker diarization: Covering sub-tasks such as “Who Speaks When” and "Who Said What", consists of assigning the different parts of the speech to the corresponding speakers. This task
is important for many speech sources, such as telephone conversations, and meetings.
Sentence segmentation: Often also referred as Sentence Boundary Detection, this task consists
of identifying the SU (Sentence-like Unit) boundaries. When dealing with speech, the
notion of SU (Strassel, 2004) or utterance (Jurafsky and Martin, 2009) is often used instead
of “sentence”. More details concerning this task can be found in Section 2.2.
Punctuation recovery or detection: This task consists of identifying and inserting punctuation
marks, which can be full-stops, commas, question marks, exclamation marks, and other less
common punctuation marks. This task shares most of the strategies also used in the
sentence segmentation task.
Capitalization: Also known as truecasing, consists of assigning the correct case for each word,
when such information is unavailable.
Disfluency detection and filtering: Disfluencies are well-known phenomena, occurring specially in spontaneous speech. They are inevitable because humans make utterances while
1.1. EMERGING INTEREST IN RICH TRANSCRIPTION
3
thinking about what to say. Disfluencies are classified into two broad categories (Furui
and Kawahara, 2008): (1) fillers, with which speakers try to fill pauses while thinking or
to attract the attention of listeners (e.g., “um”, “uh”, “well”, “you know”); and (2) repairs,
including repetitions and restarts, with which speakers try to correct, modify, or abort
earlier statements.
While the speech-to-text core technologies have been developed for more than 30 years (Furui,
2005), metadata extraction/annotation technologies only have gained significant importance
during the latest years. For example, (Jurafsky and Martin, 2009), published in 2009, contains
an entire section dedicated to this subject (Chapter 10 - Speech Recognition: Advanced Topics), while this topic was only briefly mentioned in the first version of that book, published in
2000 (Jurafsky and Martin, 2000). On top of that, the recent advances in Rich Transcription were
the focus of the recent (September 2010) special issue on “New Frontiers in Rich Transcription”,
part of the IEEE Transactions on Audio, Speech, and Language Processing publication.
The Rich Transcription project, from the DARPA-sponsored EARS (Effective, Affordable
Reusable Speech-to-text) program, was a five-year project, with the aim of advancing the state
of the art in automatic rich transcription of speech. The Metadata Extraction and Modeling
task, described in the project, aims at introducing structural information into the ASR output,
as a good human transcriber would do, and includes the following topics: “Punctuation and
topic segmentation”, “Disfluency detection and clean-up”, “Semantic annotation”, “Dialogue
act modeling”, “Speaker recognition, segmentation, and tracking”, and “Annotation of speaker
attributes”.
The NIST RT evaluation series2 is also another important initiative that supports some
of the goals of the DARPA EARS program, providing means to evaluate STT and MDE technologies. The main propose of this initiative was to involve the community, by studying the
STT/MDE integration and determining the MDE goals. Two other important goals consist of
providing a state-of-the-art baseline and begining the creation of flexible/extensible evaluation
paradigms (new formats and new software). These evaluation series cover some of the metadata extraction tasks, such as: Speaker Diarization; Edit Word Detection; Filler Word Detection;
and Sentence Boundary detection. The RT 2002 evaluation was the first of this RT evaluation series, covering STT tasks for three different speech sources (broadcast news, conversational telephone speech, and meetings) and speaker diarization for broadcast news and conversational
telephone speech. It is important to notice that by this time the “rich transcription” concept
was not yet completely established. In 2003, two evaluations were conducted: RT-03S – focusing on the STT tasks, but also covering the speaker diarization task “Who Spoke When”; RT03F
– focusing on the MDE tasks, which included "Who Said What" speaker diarization, sentence
boundary detection, and disfluencies detection for broadcast news speech and conversational
telephone speech. Other languages besides English were covered in the 2004 evaluation series,
2 http://www.nist.gov/speech/tests/rt/
4
CHAPTER 1. INTRODUCTION
Speech signal
Rich Transcription
Pitch, Energy
...
Diarization
Punctuation
Audio
Segmentation
Speech
Recognition
Speech
Transcription
....................
....................
Enriched
Transcription
....................
....................
....................
Close-captioning
Multimedia content
Capitalization
...
...
Further automatic
processing
Disfluency
Detection and filtering
Indexing, Retrieval, ...
Machine translation
Summarization
Named Entity Recognition
....
Figure 1.1: Integration of the RT modules in the recognition system.
but after this period these evaluations focused again on the English language. After 2004, all
evaluation focused the English Meeting domain, the STT task and the MDE speaker diarization task. In 2005 two specific speaker diarization sub-tasks, under the scope of the meetings
domain, were introduced: Speech Activity Detection – consists of detecting when someone is
talking; and Source Localization – determining the 3D position of the person who is talking.
Nonetheless, it is important to notice that despite the emerging efforts in the RT scope, many
metadata extraction tasks are still not covered by these evaluation plans.
Most of the current research focus on the English language. However, some initiatives have
also been reported for other languages in the last few years. For example, the ESTER evaluation
campaign is an important initiative for evaluating automatic broadcast news transcription systems for the French language (Gravier et al., 2004). ESTER is part of the EVALDA project and
focuses on the evaluation of rich transcription and indexing of broadcast news. The campaign
started in 2003 with a pilot evaluation, using a subset of the final corpus, and implemented
three categories of tasks: transcription, segmentation and information extraction.
1.2
Rich Transcription Integration
Most of the information required by the rich transcription modules is usually provided by
the speech transcript itself, produced by the speech recognition modules. However, additional
information may be extracted directly from the speech signal, or provided by other sources
of information, such as linguistic knowledge. Figure 1.1 summarizes the integration of rich
transcription modules in a generic recognition system. The raw recognition output is still useful for some applications (e.g., indexing and information retrieval), but the RT tasks provide
additional information for improved results in a broader set of applications.
Depending on the application, the RT modules may be required to work online. For example, on-the-fly subtitling for oral presentations or TV shows demands a very small delay
between the speech production and the corresponding transcript. In these systems, both the
1.2. RICH TRANSCRIPTION INTEGRATION
5
Online System
TV broadcasted signal
Teletex
Server
Control System
& GUI
Subtitling
Generation
JD
APP
ASR
Jingle detection
Audio segmentation
Speech Recognition
Punctuation
Capitalization
Topic Segmentation
and Indexing
XML file
Audio
Summarization
Web
Offline System
Figure 1.2: Overall architecture of the L2 F speech recognition system.
computational delay and the number of words to the right of the current word that are required
to make a decision, are important aspects to be taken into consideration. On the other hand,
an offline system can perform multiple passes on the same data, thus being able to produce
improved results. One of the goals behind this work consists of building a module for integration both on the on-the-fly subtitling system and on the offline multimedia content production
system, which are currently being in use.
The blocks that constitute the current architecture of the current Spoken Language Laboratory (L2 F) recognition system (Amaral et al., 2007) are illustrated in Figure 1.2. The system
follows the connectionist paradigm. The JD (Jingle Detection) module detects the beginning
and the end of the show, as well as the possible breaks. The selected blocks of speech are then
processed by the APP (Audio Pre-Processing or Audio Segmentation) module (Meinedo and
Neto, 2003), which splits the input into homogeneous speech segments based on the acoustic
properties of the speech signal, performs speaker clustering, identifies the speaker gender, and
classifies the speech according to the different focus conditions (noisy, clean, etc.). Audimus,
a large vocabulary continuous speech recognition module (Meinedo et al., 2008), is the ASR
module that processes each speech segment, previously identified by the APP module, and
produces the initial speech transcript. The punctuation and capitalization modules, developed
in the scope of this thesis, update the speech transcript data, by adding the punctuation and
capitalization information. The whole system can be used either for TV Close-captioning (online) of for producing multimedia content (offline). The online version of the system uses the
Subtitling Generator to send the final output to the Teletex Server. The speech recognition processing chain used for producing multimedia content corresponds to the offline version of the
6
CHAPTER 1. INTRODUCTION
system. An XML3 file is iteratively updated with information coming from each processing
module, and its final content contains all the information required for web publishing. The
Topic Segmentation and Indexing (Amaral and Trancoso, 2008), as well as the Summarization
(Ribeiro and Matos, 2007, 2008) modules, take part in the processing chain. Because this is an
offline chain, the BN show can processed as a whole by each processing module in multiple
passes, thus producing improved results.
The first modules of this system, including punctuation and capitalization, are optimized
for on-line performance, given their deployment in the fully automatic subtitling system that
is running on the main news shows of the public TV channel in Portugal, since 2008 (Neto
et al., 2008). All the information is being used by the Selective Dissemination of Multimedia
Information system – SSNT (Amaral et al., 2007), which has been deployed since 2003, and is
now processing Portuguese broadcast news on a daily basis.
1.3
Motivation and Goals
The recent advances in the ASR systems, together with the increase of computational resources, make it now possible to process broadcast speech signals, continuously being produced by a number of broadcasters. The Portuguese recognition system, developed at the Spoken Language Laboratory (L2 F), and described previously in Section 1.2, is a state-of-the-art
ASR system, now being applied to different domains of the Portuguese language. It has been
applied since the beginning of 2003 to the main TV news show, produced by the National Portuguese Broadcaster RTP. Nowadays, it is being used for processing the most important news
shows, produced by all Portuguese TV broadcasters: RTP, SIC and TVI. The system performs
two different tasks: live close-captioning, and multimedia content production for offline usage.
The original content produced by this system was still difficult to read and to process, mainly
because of the incorrect segmentation, and also because a number of basic information was
missing, such as capital letters. Enriching the speech recognition output is an important asset
for a speech recognition system that performs tasks like online captioning or produces multimedia content. Hence, the main motivation behind this thesis consists of producing enhanced
transcripts, that can be applied to real life speech recognition systems. More specifically, the
main target corresponds to correctly address the tasks of recovering punctuation marks and
capitalization information, when dealing with speech transcripts, produced by an automatic
speech recognition system. Accordingly, one of the expected outcomes is a prototype module,
incorporating Rich Transcription tasks, for integration in the L2 F recognition system.
Figure 1.3 shows an excerpt of a transcribed text, where the upper rectangle shows the original text, corresponding to the output of a recognition system, with a paragraph segmentation
based purely on the acoustic elements of the signal. The second rectangle introduces the cap3 http://www.w3.org/XML/
1.3. MOTIVATION AND GOALS
7
1
boa tarde a ministra da educação pronunciou sobre a polémica do professor suspenso
maria de lurdes rodrigues disse que vai aguardar pelos resultados do processo que está a decorrer
e garantiu que não tem motivos para duvidar do funcionamento da direcção regional de educação do norte
que suspendeu passou por ter feito um comentário pose do primeiro-ministro
a ministra disse lamentar que este tipo de pesados marquem
à agenda mediática
até este momento do muito
que o li e ouvi
não tenho nenhum sinal
não tenho nenhum motivo para duvidar do funcionamento das instituições
ou para de a considerar que pode estar em causa o funcionamento da direcção regional
de educação dos seus de serviços
aquilo que é a minha preocupação é que no âmbito deste processo estejam garantidos
a existência dos mecanismos de defesa
2
boa tarde a Ministra da Educação pronunciou
sobre a polémica do professor suspenso
Maria de Lurdes Rodrigues disse que vai
aguardar pelos resultados do processo que está
a decorrer
e garantiu que não tem motivos para duvidar do
funcionamento da Direcção Regional de Educação do
Norte
que suspendeu passou por ter feito um comentário pose
do primeiro-ministro
a ministra disse lamentar que este tipo de pesados
marquem
à agenda mediática
até este momento do muito
que o li e ouvi
não tenho nenhum sinal
não tenho nenhum motivo para duvidar do
funcionamento das instituições
ou para de a considerar que pode estar em causa o
funcionamento da Direcção Regional
de Educação dos seus de serviços ...
Boa tarde.
4
Boa tarde.
Boa tarde.
A Ministra da Educação pronunciou sobre a
polémica do professor suspenso.
A Ministra da Educação pronunciou sobre a
polémica do professor suspenso.
3
Maria de Lurdes Rodrigues disse que vai
aguardar pelos resultados do processo que está a
decorrer e garantiu que não tem motivos para duvidar
do funcionamento da Direcção Regional de Educação
do Norte, que suspendeu passou por ter feito um
comentário pose do primeiro-ministro.
A ministra disse lamentar que este tipo de pesados
marquem à agenda mediática.
Até este momento do muito que o li e ouvi não tenho
nenhum sinal.
Não tenho nenhum motivo para duvidar do
funcionamento das instituições ou para de a considerar
que pode estar em causa o funcionamento da Direcção
Regional de Educação, dos seus de serviços.
Aquilo que é a minha preocupação é que no âmbito
deste processo estejam garantidos a existência dos
mecanismos de defesa.
A Ministra da Educação pronunciou-se sobre a
polémica do professor suspenso.
5
Maria de Lurdes Rodrigues disse que vai
aguardar pelos resultados do processo que está a
decorrer e garantiu que não tem motivos para duvidar
do funcionamento da Direcção Regional de Educação
do Norte, que suspendeu passou por ter feito um
comentário pose do primeiro-ministro.
Maria de Lurdes Rodrigues disse que vai
aguardar pelos resultados do processo que está a
decorrer e garantiu que não tem motivos para duvidar
do funcionamento da Direcção Regional de Educação
do Norte, que suspendeu o professor por ter feito um
comentário a propósito do primeiro-ministro.
A ministra disse lamentar que este tipo de pesados
marquem à agenda mediática.
A ministra disse lamentar que este tipo de episódios
marquem a agenda mediática.
Até este momento do muito que o li e ouvi não tenho
nenhum sinal.
Até este momento do muito que li e ouvi não tenho
nenhum sinal.
Não tenho nenhum motivo para duvidar do
funcionamento das instituições ou para de a considerar
que pode estar em causa o funcionamento da Direcção
Regional de Educação, dos seus de serviços.
Não tenho nenhum motivo para duvidar do
funcionamento das instituições ou para considerar que
pode estar em causa o funcionamento da Direcção
Regional de Educação, ou dos seus serviços.
Aquilo que é a minha preocupação é que no âmbito
deste processo estejam garantidos a existência dos
mecanismos de defesa.
Aquilo que é a minha preocupação é que no âmbito
deste processo estejam garantidos a existência dos
mecanismos de de defesa.
Figure 1.3: Excerpt of a transcribed text, with different markup conditions.
8
CHAPTER 1. INTRODUCTION
italization information. The third rectangle shows the same text, enriched with capitalization,
punctuation, and the corresponding segmentation. The last two rectangles also show that this
excerpt cannot be seen exactly as written text, due to the presence recognition errors and other
phenomena that occurs in speech, such as disfluencies. The last rectangle shows the perfect
result, without recognition errors, and where the punctuation and capitalization was correctly
assigned. The third rectangle shows the desired output for the modules developed in the scope
of this thesis. Whereas the output still contains a number of recognition errors, the punctuation
and punctuation information was correctly assigned.
The output of a speech recognition system includes a broad set of lexical, acoustic and
prosodic information, such as: time gaps between words, speaker clusters and speaker gender,
which can be combined to produce the best results. On the other hand, the speech signal is
also an important source of information when certain features, such as the pitch and energy,
are not available on the recognition output. An important initial goal addressed by this thesis
consisted of investigating and evaluating different punctuation and capitalization methods.
An important requirement for a given method is its ability in combining all the available and
relevant information. An additional outcome expected from this thesis is to better understand
the individual contribution of each feature to the final performance, in both tasks.
Current computational resources allow the manipulation of large-sized data, and the application of complex learning methods on such data. On the other hand, we are now witnessing
the mass production of online written content in the web. Different Portuguese newspaper
companies are now publishing their news and last minute news content freely available on the
web. This written corpus constitutes an important resource for processing the Portuguese language, and it also provides important basic information for speech processing. Given that only
a limited set of manually labelled speech training data is now available for Portuguese, one
of the main goals of this thesis consisted of using additional sources of information whenever
possible, including large written corpora.
Finally, the Portuguese language is spoken by a large community in many countries around
the world, such as Brazil and many African countries. For that reason it would be important
that this work could be easily extended to such language varieties and even to other languages.
The research conducted in the scope of this thesis is, as much as possible, language independent, but a special focus is given to the specific problems of Portuguese. The extension to
other languages is restricted to a number of experiments concerning the English and Spanish
languages.
1.4
Proposed Strategy
This study considers both punctuation and capitalization tasks as two classification tasks,
thus sharing the same classification approach. Both tasks will be performed using the same
1.5. DOCUMENT STRUCTURE
9
maximum entropy (ME) modeling approach, a discriminative approach, suitable for dealing
with speech transcripts, which includes both read and spontaneous speech, the latter being
characterized by more flexible linguistic structures and by adjustments to the communicative
situation (Blaauw, 1995). The use of a discriminative approach facilitates the combination of
different data sources and different features for modeling the data. It also provides a framework for learning with new data, while slowly discarding unused data, making it interesting
for problems that comprise language variations in time, such as capitalization. With this approach, the classification of an event is straightforward, making it interesting for on-the-fly
integration, with strict latency requirements.
The capitalization of a word depends mostly on the context where that word appears, and
can be regarded as a sequence labeling or a lexical ambiguity resolution problem. The Hidden
Markov Model (HMM) framework is a typical approach, used since the early studies, that can
be easily applied to such problems. That is because computational models for sequence labeling or lexical ambiguity resolution usually involve language models (LM) built from n-grams,
which can also be regarded as Markov models. For that reason, capitalization experiments
reported will include comparative results achieved using an HMM-based approach. Rather
than comparing with other approaches, punctuation experiments will focus on the usage of
additional information sources, and the wide range of features provided by the speech data.
1.5
Document Structure
This document presents the research developed under the scope of this thesis and points
possible future directions for the ongoing research. The document is structured as follows:
Chapter 2 makes an overview of the Rich Transcription, describing the current state of the art
on specific tasks of this domain, the metrics currently in use, some of them adopted for this document, and the approach adopted here for automatic punctuation and capitalization. Chapter
3 describes the different corpora used for training and testing, focusing on issues related with
its treatment and on the feature extraction process. Chapter 4 presents the capitalization task,
and reports different experiments comparing different methods and analysing the impact of
language variation over time. Chapter 5 deals with the punctuation task and the corresponding experiments. Finally, Chapter 6 presents the conclusions and proposes a number of future
tasks to further extend the work here described.
10
CHAPTER 1. INTRODUCTION
State of the Art
2
Spoken language is similar to written text in many aspects, but is different in many others, mostly due to the way these communication methods are produced. Current ASR systems
focus on minimizing the WER (Word Error Rate), making few attempts to detect structural information which is available in written texts. Spoken language is also typically less organized
than textual material, making it a challenge to bridge the gap between spoken and written material. The text produced by a standard speech recognition system consists of raw single-case
words, without punctuation marks, with numbers written as text, and with many different
types of disfluencies. The representation format of this text, equivalent to SNOR (Standard
Normalized Orthographical Representation), is difficult to read and sometimes even hard to
understand because of the missing information (Jones et al., 2005b). Moreover, the missing
information, specifically punctuation, sentence boundary, and capitalization, is also important
for many types of automatic downstream processing, such as parsing, information extraction,
dialog act modeling, NER, summarization, and translation (Shriberg et al., 2000; Zechner, 2002;
Huang and Zweig, 2002; Kim and Woodland, 2003; Kahn et al., 2004; Niu et al., 2004; Ostendorf
et al., 2005; Jones et al., 2005a; Makhoul et al., 2005; Shriberg, 2005; Khare, 2006; Matusov et al.,
2006; Cattoni et al., 2007; Ostendorf et al., 2008). For example, Kahn et al. (2004) and Harper
et al. (2005) reveal that the parsing accuracy is strongly affected by sentence boundary detection errors. Makhoul et al. (2005) show that punctuation marks can improve the accuracy of
information extraction algorithms over speech transcripts. Other studies have shown that the
punctuation marks, or at least sentence boundaries, are important for machine translation (Matusov et al., 2006; Cattoni et al., 2007) and summarization (Zechner, 2002). The capitalization
information is also important for human readability and other tasks, such as parsing, machine
translation, and NER (Lita et al., 2003; Niu et al., 2004; Khare, 2006; Wang et al., 2006).
The rich transcription process usually involves the process of recovering structural information, which has been the focus of many studies (Heeman and Allen, 1999; Kim et al., 2004),
and the creation of metadata from that information. Liu et al. (2005) presents an overview
of the research on metadata extraction, in the scope of the Rich Transcription project, from
the DARPA EARS program. The paper focuses on the detection of structural information in
the word stream, covering four main tasks: Sentence Unit detection, Edit word detection, Filler
word detection, and Interruption point detection. Speaker diarization, excluded from the scope
12
CHAPTER 2. STATE OF THE ART
of this thesis, is overviewed by Reynolds and Torres-Carrasquillo (2005). Chen et al. (2006);
Soltau et al. (2005) describe the advances on the IBM speech recognition technology during the
EARS program. Liu et al. (2004, 2006) describe the ICSI-SRI-UW system for extraction of metadata, also previously developed as part of the EARS Rich Transcription program. The system
includes sentence boundary detection, filler word detection, and detection/correction of disfluencies. The paper reports results on the NIST Rich Transcription metadata tasks. Strassel et al.
(2003) describes the efforts at LDC (Linguistic Data Consortium) to create shared resources for
improved speech-to-text technology, motivated by the EARS program demands. The DARPA
GALE program1 is an on-going project, also involving LDC, whose goal is to develop and apply computer software technologies to absorb, analyze and interpret huge volumes of speech
and text in multiple languages. The LDC now provides quick transcription specifications for
a number of languages, including: Arabic, Chinese, and English, and will collect transcripts
for large volumes of speech, conform such specifications. The specification elements include
accurate transcription of sentences (segmentation), sentence type identification, standardized
punctuation, and orthography.
A fair doubt for punctuation and capitalization is whether the ASR system can be adapted
for dealing with both tasks, instead of creating additional modules. The work reported by Kim
and Woodland (2004) addresses this question by proposing and evaluating two methods: i)
adapting the ASR system for dealing with both punctuation and capitalization, by duplicating
each vocabulary entry with the possible capitalized forms, modeling the full stop with silence,
and training with capitalized and punctuated text, and ii) using a rule-based named entity
tagger and punctuation generation. The paper shows that the first method produces worse
results, due to the distorted and sparser language model, thus suggesting the separation of the
punctuation and capitalization tasks from the speech recognition system.
A number of studies consider punctuation and capitalization recovery as two connected
tasks and perform them concomitantly (Mikheev, 1999, 2002; Baldwin and Joseph, 2009; Gravano et al., 2009; Lu and Ng, 2010). For example, Stevenson and Gaizauskas (2000) perform a
number of experiments using human annotators, and conclude that case information is an important feature for sentence boundary detection. That is also according to Baldwin and Joseph
(2009), who conclude that these two tasks are highly dependent, and that if we can get one of
the two tasks correct, the other becomes considerably easier. Nevertheless, each one of these
tasks presents its own challenges, and therefore they are treated individually by most of the reported work. The remainder of this section overviews the related work on each one of the two
rich transcription tasks, describes the approach adopted for this study, considering a number
of requirements, and presents the performance evaluation metrics adopted here.
1 http://projects.ldc.upenn.edu/gale/
2.1. RELATED WORK ON CAPITALIZATION
2.1
13
Related Work on Capitalization
The capitalization task, also known as truecasing (Lita et al., 2003; Jurafsky and Martin,
2009), consists of rewriting each word of an input text with its proper case information given
its context. Many languages distinguish between uppercase and lowercase letters, even so,
capitalization does not apply to a number of languages that do not use Latin, Greek or Cyrillic
scripts, such as Chinese, Thai, Japanese, Arabic, Hindi, Hebrew, etc. Proper capitalization can
be found in many information sources, such as newspaper articles, books, and most of the web
pages. Besides improving the readability of texts, capitalization provides important semantic
clues for further text processing tasks. Different practical applications benefit from automatic
capitalization as a preprocessing step: many computer applications, such as word processing
and e-mail clients, perform automatic capitalization along with spell corrections and grammar
check; and while dealing with speech recognition output, automatic capitalization provides
relevant information for automatic content extraction, Named Entity Recognition, and Machine
Translation. For example, Kubala et al. (1998) perform NER over speech data and conclude that
the performance is affected by the lack of punctuation and capitalization information, specially
when dealing with proper names, given that most entities involve capitalized words.
In applications where capitalization is expected, a typical approach consists of modifying
the process that usually relies on case information in order to suppress the need of that information (Brown and Coden, 2002; Manning et al., 2008). The NE extraction on ASR output, a
core task in the DARPA’s sponsored Broadcast News workshops, is a good example of such
approach. Bikel et al. (1997) describes a HMM-based (Baum and Petrie, 1966) NE extraction
system that performed well in these circumstances. When trained with data previously converted to lowercase, the system performance was almost as good as when tested with mixed
case data. An alternate approach to modify the applications for lowercase input is to previously
recover the capitalization information, which can also benefit a number of other applications
that use case information. Concerning this subject, Niu et al. (2004) propose an alternative twostep approach to Named Entity tagging. Instead of excluding the case-related features, which
is the traditional approach, the authors introduce a preprocessing module designed to restore
case information. Based on the observation that the size of a training corpus is often a more
important factor than the complexity of the model, the authors use a bigram HMM, trained
using a large corpus (7M words) of case sensitive documents, concluding that this approach
(i) outperforms the feature exclusion approach for Named Entity tagging, (ii) leads to limited
degradation for semantic parsing and relationship extraction, (iii) reduces system complexity,
and (iv) has wide applicability: the restored text can feed both statistical model and rule-based
systems. Also concerning this subject, Kim and Woodland (2004) describe a Capitalization
Generation system used in their NER system. The authors show that using a ruled-based NE
tagger and punctuation generation is better than duplicating each vocabulary entry with the
possible capitalized forms, model the full-stop with silence, and to train the ASR with capital-
14
CHAPTER 2. STATE OF THE ART
ized and punctuated text. The performance measures include Precision, Recall and SER (Slot
Error Rate), being that only three different capitalization classes are considered, and that the
first word of each sentence is used in the evaluation.
All the capitalization results reported in this thesis do not consider the first word of a
sentence for evaluation, because its capitalized form depends mostly on its position in the sentence. However, the connection between case information and punctuation has been reported
by a number of studies. Mikheev (1999, 2002) presents an approach to the disambiguation of
capitalized words, consisting of a cascade of different simple positional heuristics, but only
where capitalization is expected, such as the first word of the sentence or after a period. This
study was performed over written corpora where the capitalization is provided. Another study
recovering capitalization for punctuated texts involving heuristics is reported by Brown and
Coden (2002), where a series of techniques and heuristics are evaluated. The co-dependence of
case information and punctuation was also recently investigated by Baldwin and Joseph (2009)
and Gravano et al. (2009). In both studies punctuation and case information is restored simultaneously in English texts. Baldwin and Joseph (2009) explores multi-class SVMs, and Gravano
et al. (2009) uses purely text-based n-gram language models.
Capitalization can be viewed as a lexical ambiguity resolution problem, where each word
has different graphical forms (Yarowsky, 1994; Gravano et al., 2009). A pilot example of such
approach is reported by Yarowsky (1994), which presents a statistical procedure for lexical ambiguity resolution, based on decision lists, that achieved good results when applied to accent
restoration in Spanish and French. The capitalization and accent restoration problems can be
treated using the same methods, given that a different accentuation can be regarded as a different word form. Capitalization can also be viewed as a specialized spelling correction by
considering different capitalization forms as spelling variations (Lita et al., 2003). Finally, the
capitalization problem may also be seen as a sequence tagging problem, where each lowercase word is associated to a tag that describes its capitalization form (Lita et al., 2003; Kim and
Woodland, 2004; Chelba and Acero, 2004; Khare, 2006). This corresponds to a classification
problem that can be dealt with a vast number of approaches, like POS taggers, HMM-based,
ME-based, SVM-based, MEMM-based (McCallum et al., 2000), and CRF-based (Lafferty et al.,
2001) classifiers.
Lita et al. (2003) build a trigram language model (LM) with pairs (word, tag), estimated
from a corpus with case information, and then use dynamic programming to disambiguate
over all possible tag assignments on a sentence. This paper reports experiments that reveal a
positive impact of the capitalization in named entity recognition, automatic content extraction,
and machine translation.
Chelba and Acero (2004) study the impact of using increasing amounts of training data as
well as a small amount of adaptation. This work uses an approach based on Maximum Entropy
Markov Models (MEMM). A large written newspaper corpus (WSJ) is used for training and the
2.1. RELATED WORK ON CAPITALIZATION
15
test data consists of Broadcast News (BN) data (CNN and ABC prime time).
Khare (2006) finds the usefulness of Joint Learning to the tasks of NER and Capitalization
Generation. The study goes further to look for feature sets that help or do not help the Joint
task. This is achieved by using Dynamic Conditional Random Fields (DCRFs) as models for
experiments with the two tasks. The Joint model is compared with both simple systems for
each task that do not use the other task, and with traditional pipeline systems that perform the
two tasks sequentially. The paper concludes that either joint learning or NER tagging do not
help Capitalization. Despite that, errors made in the Capitalization are crucial for NER, and
that improving Capitalization improves NER significantly.
Other related work includes a bilingual capitalization model for capitalizing Machine
Translation (MT) outputs using Conditional Random Fields (CRFs) and is reported by Wang
et al. (2006). This work exploits case information both from source and target sentences of the
MT system, producing better performance than a baseline capitalizer using a trigram language
model. Another truecasing module that works inside a machine translation system is presented
by Agbago et al. (2005). The module is used in a Portage system and combines a n-gram language model, a case mapping model, and a specialized language model for unknown words.
The module presents 80% relative error rate reduction over the baseline using unigrams only.
Also concerning machine translation, Stüker et al. (2006) describe a set of evaluation systems
that participated in the TC-STAR 2006 (Technology and Corpora for Speech to Speech Translation) evaluation. The capitalization of the recognition output is performed in a post-processing
stage, after the actual decoding procedure, and before the punctuation. The process relies on a
4-gram language model, built both from the transcriptions and the final text editions.
A recent work performing experiments on large corpora using different n-gram orders is
reported by Gravano et al. (2009). This paper is of particular interest not only because of the
high complexity of the applied models, where the n-gram order varies from n = 3 to n = 6, but
also because of the large amount of training data, which varies from 58 million to 55 billion tokens. The paper concludes that using larger training data sets leads to increasing improvements
in performance, but increasing the n-gram order does not significantly improve capitalization
results. However, it seems that capitalization results consider the first word of each sentence,
whose capitalization is highly dependent from the assigned punctuation, implying that these
two tasks cannot be measured separately.
Most of the words and structures of a language is not subject to diachronic substantial
changes. However, the usage of new words (neologisms), frequently introduced, and the usage
of others that may decay with time (archaisms) introduces dynamics in the lexicon. Being part
of the Language Dynamics emerging field (Wichmann, 2008), this problem has been addressed
for Portuguese BN in the work of Martins et al. (2007b), which proposes a daily adaptation of
the vocabulary and language model to the topic of current news, based on texts daily available
on the Web. Also concerning this subject, Mota and Grishman (2008, 2009) analyze the relation
16
CHAPTER 2. STATE OF THE ART
between corpora variation over time and the NER performance, proving that, as the time gap
between training and test data increases, the performance of a named entity tagger based on
co-training Blum and Mitchell (1998); Collins and Singer (1999) also decreases. These studies
have shown that, as the time gap between corpora increases, the similarity between the corpora and the names shared between those corpora also decreases. The language adaptation
problem concerning capitalization has been addressed by Batista et al. (2008d,c), concluding
that the capitalization performance is influenced by the training data period. All these studies
emphasize the relation between named entities and capitalized words, showing that they are
influenced by time variation effects.
2.2
Related Work on Punctuation and Sentence Boundary Detection
When dealing with conversational speech, the notion of utterance or sentence-like unit is
often used instead of “sentence” (Strassel, 2004; Liu et al., 2006; Jurafsky and Martin, 2009). A
SU may correspond to a grammatical sentence, or can be semantically complete but smaller
than a sentence. Detecting a SU consists of finding its limits, and roughly corresponds to
the task of detecting positions where a punctuation mark is missing. The problem of sentence boundary detection is connected to the punctuation recovery problem, especially with
respect to predicting sentence boundary punctuation like full-stops, question marks, and exclamation marks (Shieber and Tao, 2003). Nevertheless, this problem is distinct from the sentence
boundary disambiguation problem, where the punctuation is provided and the task consists
of deciding if whether or not it marks a sentence boundary (Palmer and Hearst, 1994, 1997;
Reynar and Ratnaparkhi, 1997).
Despite being originally used mostly for marking breaths, punctuation is nowadays used
for marking structural units, thereby used to disambiguate meaning and to provide cues to
coherence of the written text (Kowal and O’Connell, 2008). Inserting punctuation marks into
spoken texts is a way of approximating such texts to written texts, keeping in mind that speech
data is linguistically structured. Despite that, a punctuation mark may assume different behavior in speech, for example, a sentence in spontaneous speech does not always correspond to a
sentence in written text. A large number of punctuation marks can be considered for spoken
texts, including: comma; period or full stop; exclamation mark; question mark; colon; semicolon; and
quotation marks. However, most of these marks rarely occur and are quite difficult to insert or
evaluate. Hence, most of the available studies focus on full stop and comma, which have higher
corpus frequencies. A number of studies also consider the question mark, but most of them have
not yet shown promising results (Christensen et al., 2001). Previous work on other punctuation
marks, such as exclamation marks, are rarely found on the literature.
Sentence boundary detection has gained increasing attention during recent years, and it
has been part of the NIST rich transcription evaluations. It provides a basis for further natural
2.2. RELATED WORK ON PUNCTUATION AND SENTENCE BOUNDARY DETECTION 17
language processing, and its impact on subsequent tasks has been recently analyzed in many
speech processing studies (Harper et al., 2005; Mrozinsk et al., 2006; Ostendorf et al., 2008). The
detection of sentence boundaries is one of the main structural events annotated in the DARPA
EARS rich transcription program. This topic is addressed in Liu et al. (2005), where prosody is
shown to be more helpful for broadcast news than for conversational speech, and recognition
errors are shown to affect the performance significantly.
Recovering hidden punctuation or sentence boundaries is considered by Shriberg (2005) as
the first of the four main spoken language properties of spontaneous speech that impose challenges for spoken language applications, on the basis that these properties violate assumptions
often applied in automatic processing technology. Different approaches have been reported to
address the punctuation recovery problem. Computational models for detecting punctuation
marks and sentence boundaries in speech typically involve a combination of N-gram language
models and prosodic classifiers. The HMM framework is a common approach, used since the
early studies for similar problems, that allows combining different knowledge sources. More
recently, other model types have been used successfully, such as Maximum Entropy (ME) models and Conditional Random Fields (CRFs).
The general HMM framework for detecting sentence boundaries, combining lexical and
prosodic cues, has been reported by a number of studies (Gotoh and Renals, 2000; Shriberg
et al., 2000; Liu et al., 2006; Stolcke and Shriberg, 1996; Stolcke et al., 1998). A similar approach
was also used for punctuation recovery by Kim and Woodland (2001) and Christensen et al.
(2001). One of the first studies on sentence segmentation of speech is reported by Stolcke and
Shriberg (1996). It uses an n-gram model, based on linguistic features and turn markers. Later,
Stolcke et al. (1998) study a combined approach for detecting sentence boundaries and four
classes of disfluencies on spontaneous, automatically transcribed telephone speech. The system
combines prosody, modeled by decision trees, and n-gram language models. The study demonstrated that model combination yields significantly better results than just using individual
models. Gotoh and Renals (2000) present an approach for identifying sentence boundaries in
broadcast speech transcripts, based on FSM (Finite State Machines). This work concludes that
a model estimated from pause duration information outperforms an n-gram language model
based on textual information, but the combination of the two models achieves even better results. Shriberg et al. (2000) combine prosodic cues with word-based approaches, showing that
the prosodic model alone performs on par with, or better than, word-based statistical language
models. The paper concludes that prosodic models capture language-independent boundary
indicators.
A multi-pass linear fold algorithm for sentence boundary detection in spontaneous speech
that uses prosodic features has been proposed by Wang and Narayanan (2004). This study
focus on the relation between sentence boundaries and their correlates, pitch breaks and pitch
durations, covering their local and global structural properties. Detecting sentence boundaries
was also addressed by Liu et al. (2006), who report state-of-the-art results according to the NIST
18
CHAPTER 2. STATE OF THE ART
RT-04F evaluation. Besides the common HMM approach, the usage of maximum entropy and
conditional random fields is also investigated, conducting experiments in both broadcast news
data and conversational telephone speech. The system described combines information from
different types of textual knowledge sources, with information from a prosodic classifier. The
paper reports that discriminative models usually outperform the generative models.
The ICSI+ sentence segmentation system (Zimmermann et al., 2006) works both on English
and Mandarin, and is a result of a joint effort involving ICSI, SRI and UT Dallas. The system
uses an HMM approach for exploiting lexical information, and maximum entropy and boosting
classifiers to exploit lexical and prosodic features, speaker changes and syntactic information.
The methodology uses prosodic features, including pitch-based and energy-based features, and
is significantly better than a baseline system based on words and pause features. The paper
concludes that the pause duration in between words is a very important feature for this task,
as well as features derived from speaker turns coming from the speaker diarization system.
The work reported by Favre et al. (2008) moves towards the use of long-distance dependencies
in combination with local features. The authors construct an initial hypothesis lattice using
local features, and then assign syntactic language model scores to the candidate sentences.
The resulting system, that combines global syntactic scores with local scores, outperforms the
popular HMM model for sentence segmentation.
In terms of punctuation marks, comma is the most frequent, but it is also the most problematic because of it serves many different purposes, and it is used in different syntactic contexts.
It can be used, for example, to separate elements in a series (e.g., They can read, write, and execute); to separate two independent clauses joined by a coordinating conjunction (e.g., They
have completed the tasks, but some will have to be repeated); to separate long independent constructions; to set off or enclose certain adverbs (e.g., therefore, nevertheless); to enclose parenthetical
words and phrases within a sentence; to separate each group of three digits in representing
large numbers (in English texts), or as the decimal separator (Portuguese texts); to separate elements in dates; in geographical names; and also prevent misreading, by separating words that
might otherwise be misread as closely related (e.g., After the bear had eaten the zookeeper cleaned
its cage vs. After the bear had eaten, the zookeeper cleaned its cage). Punctuation marks are closely
related with syntactic, and semantic properties. Thus, the presence/absence of a comma in specific locations may influence the grammatical judgments of the SUs. As synthesized by Duarte
(2000), commas should not be placed between: i) the subject and the predicate; ii) the verb and
the arguments; iii) the antecedent and the restrictive relative clause; iv) before the copulative
conjunction e/and. Then again, commas should separate: i) adverbial subordinate clauses, such
as participial or gerundive ones; ii) appositive modifiers; iii) parenthetical constituents; iv) anteposed constituents; v) asyndetically coordinated constituents; and vi) vocatives.
Concerning the question mark, European Portuguese (EP), as other languages, has different
interrogative types (Mateus et al., 2003): yes/no questions (total/global interrogatives), alterna-
2.2. RELATED WORK ON PUNCTUATION AND SENTENCE BOUNDARY DETECTION 19
tive2 questions, wh- (partial questions) and tag questions. A yes/no question requests a yes or
no answer (Estão a ver a diferença?/Can you see the difference?). In EP they generally present the
same syntactic order as a statement, contrarily to English that may encode the yes/no interrogative with an auxiliary verb and subject inversion. An alternative question presents two or more
hypotheses (Acha que vai facilitar ou vai ainda tornar mais difícil?/Do you think that it will make
it easier or will it make it even harder?) expressed by the disjunctive conjunction ou/or. A whquestion has a wh interrogative pronoun or adverb, such as o que/what, quem/who, quando/when,
onde/where, etc., corresponding to what is being asked about (Qual é a ideia?/What is the idea?). In
a tag question, an interrogative clause is added to the end of a statement (Isto é fácil, não é?/This
is easy, isn’t it?).
The published literature on intra-sentence punctuation recovery is quite limited. Beeferman et al. (1998) describe a lightweight method for automatically inserting intra-sentence punctuation marks into text. This method relies on an HMM with trigram probabilities, built using
solely lexical information, and it uses the Viterbi algorithm for classification. The paper focus
on the comma restoration problem, and presents a qualitative evaluation based on user satisfaction. It concludes that the system’s performance is qualitatively higher than sentence accuracy
rate would otherwise indicate. The use of syntactic information for tackling the comma restoration problem is reported by (Shieber and Tao, 2003; Favre et al., 2009). Shieber and Tao (2003)
show improved results over the use of lexical information alone, after replicating the Beeferman et al. (1998) trigram-based model. Favre et al. (2009) analyse the impact of the syntactic
features on other subsets of features, and conclude that syntactic cues can help characterizing large syntactic patterns such as appositions and lists, which are not necessarily marked by
prosody. All these papers assume sentence boundaries as given, which is not the case in the
experiments reported in this thesis, where no punctuation whatsoever is assumed, including
sentence boundaries. For that reason, a direct comparison of these studies with the work here
reported is quite limited.
Kim and Woodland (2001) generate the punctuation simultaneously with the speech recognition output, and the multiple hypothesis are re-scored using prosodic information. They
conclude that prosodic information alone outperforms the use of lexical information, but the
best results are achieved by combining all information, since that information is complementary. Their experiments report a small WER reduction (0.2%). Besides the HMM framework,
Christensen et al. (2001) present an alternate approach to automatic punctuation of speech transcripts, based on MLP (multi-layered perceptions), that also models prosodic features. Despite
the different test sets, they consider their results similar to results reported by Kim and Woodland (2001). Both studies consider full-stops, commas and question marks, but the later does not
discriminate between the three punctuation marks.
2 In
the literature, alternative questions may not be considered as a type of interrogative, rather as a subtype. For
the sake of distinguishing alternative questions from disjunctive declarative clauses, we included the alternatives
as well.
20
CHAPTER 2. STATE OF THE ART
Huang and Zweig (2002) describe an ME-based method for inserting punctuation marks
into spontaneous conversational speech. This work views the punctuation task as a tagging
task, where words are tagged with the appropriate punctuation. The ME tagger uses both
lexical and prosodic features, it uses the Switchboard corpus released by LDC for training and
Hub5-2000 evaluation data for testing (about 20% WER). This work covers three punctuation
marks: comma, period and question mark. The best results on the ASR output are achieved using
bigram-based features and by combining lexical and prosodic features. (full-stop: 73% Precision
and 65% Recall, comma: 77% Precision and 74% Recall, and question mark: 64% Precison and 14%
Recall).
Stüker et al. (2006) describes the ISL machine translation system used in the TC-STAR 2006
evaluation. In this system, the output is enriched with punctuation marks, or as the authors
call them, boundary marks, by means of a case-sensitive, 4-gram language model and hardcoded rules based on pause duration. In this system, the punctuation is performed after the
capitalization. Also concerning Machine Translation, Cattoni et al. (2007) report a system that
recovers punctuation directly over confusion networks. This paper compares three different
ways of inserting punctuation and concludes, by evaluation over manual transcriptions, that
the best results are achieved when the training corpus include punctuation marks in both languages, which means that the translation is performed from punctuated input to punctuated
output.
Lu and Ng (2010) propose an approach based on CRFs, which jointly perform both sentence
boundary and sentence type prediction, as well as punctuation prediction on speech utterances.
Evaluations have been performed on English and Chinese transcribed conversational speech
texts. The authors use an empirical evaluation and conclude that their method outperforms an
approach based on linear-chain conditional random fields and other previous approaches.
Other recent studies have shown that the best performance for the punctuation task is
achieved when prosodic, morphologic and syntactic information are combined (Liu et al., 2006;
Ostendorf et al., 2008; Favre et al., 2009).
2.3
The Maximum Entropy Approach
Most of the experiments described in this document treat the punctuation and capitalization tasks as two classification tasks, thus being able to share the same approach. The approach
is based on logistic regression classification models, which corresponds to the maximum entropy classification for independent events, firstly applied to natural language problems in
(Berger et al., 1996). A maximum entropy model estimates the conditional probability of the
events given the corresponding features. Let us consider the random variable y ∈ C that can
take k different values, corresponding to the classes c1 , c2 , ... ,ck . The maximum entropy model
is given by the following equation:
2.3. THE MAXIMUM ENTROPY APPROACH
P(c|d) =
1
Zλ ( F )
21
∑λci fi (c, d)
× exp
!
i
determined by the requirement that
∑c∈C P(c|d)=1.
Zλ ( F ) is a normalizing term, used to make the exponential a true probability, and is given by:
Zλ ( F ) = ∑ exp
c0 ∈C
∑ λ c0 i f i ( c 0 , d )
!
i
f i are feature functions corresponding to features defined over events, and f i (c, d) is the feature defined for a class c and a given observation d. The index i indicates different features,
each of which has associated weights λci , one for each class. The ME model is estimated by
finding the parameters λci with the constraint that the expected values of the various feature
functions match the averages in the training data. These parameters ensure the maximum
entropy of the distribution and also maximize the conditional likelihood ∏i P(y(i) |d(i) ) of the
training samples. Decoding is conducted for each sample individually and the classification is
straightforward, making it interesting for on-the-fly usage.
Maximum entropy is a probabilistic classifier, a generalization of Boolean classification,
that provides probability distributions over the classes. The single-best class corresponds to
the class with the highest probability, and is given by:
exp
i
ĉ = argmax P(c|d) = argmax
c∈C
c∈C
∑
!
λci f i (c,d)
∑ exp
c0 ∈C
∑λ 0 f (c0 ,d)
!
c i i
i
This approach provides a clean way of expressing and combining different aspects of the
information. This is especially useful for the punctuation task, given the broad set of lexical,
acoustic and prosodic features that can be used.
Another interesting property of this method concerns feature selection, a central problem in
Machine Leaning, that consists of finding the best subset of features for a given problem. Most
methods for feature selection are based on information theory measures, like mutual information
and entropy. Maximum entropy solves this problem naturally by finding the entropy of each
feature, which means that “more features never hurt!”.
22
CHAPTER 2. STATE OF THE ART
Training process
Corpus
Capitalization and Punctuation tasks
PoS tagging
ASR
Transcript
Capitalized
Transcript
Punctuation
Features
Capitalization
Features
Capitalization
Features
Punctuation
Features
ME classifier
ME classifier
ME train
Capitalization
model
Punctuation
model
Cap
Lexicon
Punct & Cap
Transcript
Figure 2.1: Block diagram of the capitalization and punctuation tasks.
2.3.1
Application to Rich Transcription
Figure 2.1 illustrates the classification approach for both tasks, where the left side of the
picture represents the training process using a set of predefined features, and the right side
corresponds to classification using previously trained models. An updated lexicon containing
the capitalization of new and mixed-case words (e.g., “McGyver” is an example of a mixedcase word) can be used as a complement for producing the final capitalization form, since
it contains the corresponding written form. Notice, however, that our evaluation results involve the classification only. As shown in the figure, capitalization comes first in the classification pipeline, thus producing suitable information for feeding a part-of-speech tagger.
Subsequently, part-of-speech information is used to aid detecting the punctuation marks, corresponding to SU boundaries. The capitalization of the first word of each sentence is assigned
in a post-processing step, based on the previously detected SU boundaries.
The maximum entropy models described in these experiments are trained using the MegaM
tool (Daumé III, 2004), which uses an efficient implementation of conjugate gradient (for binary
problems) and limited memory BFGS (for multiclass problems) for training the ME models.
The MegaM tool includes an option for predicting results from previously trained models. Nevertheless, by the time these experiments started, it was not prepared to deal with a stream of
data, producing results only after completely reading the input. An on-the-fly predicting tool
was created, that uses the models in their original format and overcomes this problem.
2.3. THE MAXIMUM ENTROPY APPROACH
23
Trigram counts
Class
Weight
Features
w1 w2 w3 ⇒ count1
w2 w3 w4 ⇒ count2
w3 w4 w5 ⇒ count3
...
class(w2 )
class(w3 )
class(w4 )
...
WEIGHT=count1
WEIGHT=count2
WEIGHT=count3
W:w2 PB:w1 w2 NB:w2 w3 T:w1 w2 w3
W:w3 PB:w2 w3 NB:w3 w4 T:w2 w3 w4
W:w4 PB:w3 w4 NB:w4 w5 T:w3 w4 w5
→
Figure 2.2: Conversion of trigram counts into features.
2.3.2
Large Corpora Issues
This approach requires all information to be expressed in terms of features, causing the
resultant data file to become several times larger than the original one. Capitalization models, for example, are usually trained using large written corpora, which contain the required
capitalization information. On the other hand, the memory required for training with this approach increases with the size of the corpus (number of observations). The MegaM tool, used
in our experiments, requires the training to be performed in a single machine and using all
the training data in a single step. This constitutes a training problem, making it difficult to
use large corpora for training. Two different training strategies are proposed here to deal with
these memory limitations and minimize this problem:
N-gram counts based strategy is based on the fact that, in MegaM, scaling the event by the number of occurrences is equivalent to use multiple occurrences of that event. Accordingly
to this, our strategy to use large training corpora consists of counting all n-gram occurrences in the training data and then use such counts to produce the corresponding input
features. Figure 2.2 illustrates this process considering trigram counts. The class of each
word corresponds to the type of capitalization observed for that word. Each trigram provides a feature vector concerning the middle word, namely: W (current word), PB (previous bigram), NB (next bigram), and T (trigram containing the three words). By pruning
the less frequent n-grams if necessary, this strategy allows the usage of large corpora sets.
Higher order n-grams can be used, however, it is not possible to produce all the desirable
representations from n-gram counts, for example, sentences containing less than n words
are discarded in n-gram counts, which may conduct to defective results.
Retraining strategy The memory problem can also be solved by splitting the corpus into several subsets, and then iteratively retraining with each one separately. The first subset is
used for training the first ME model, which is then used to provide initial values for the
weights of the next iteration over the next subset. This process goes on, comprising several epochs, until all subsets are used. Although the final ME model contains information
from all corpora subsets, events occurring in the latest training sets gain more importance in the final model. As the training is performed with the new data, the old models
are iteratively adjusted to the new data. This approach provides a clean framework for
24
CHAPTER 2. STATE OF THE ART
Ref : w1 w2
w3 w4 w5 . w6 w7 , w8 w9 w10 .
Hyp : w1 w2 . w3 w4 w5
w6 w7 . w8 w9 w10 .
ins
del
sub
cor
Ref : here is an Example of a Big capitalization SER
Hyp : here Is an example of a BIG capitalization SER
ins
del
sub
cor
Figure 2.3: Example of correct and incorrect slots.
language dynamics adaptation: i) new events are automatically considered in the new
models; ii) the final discriminative model collects information from all corpora subsets;
iii) with time, unused events slowly decrease in weight (Batista et al., 2008d,c). Sorting
the trained model by the relevance of each feature and limiting the number of features
kept in each model, it is possible to reduce computational resources for the next training
stage, without much impact on the results.
2.4
Evaluation Metrics
Throughout this chapter, several evaluation metrics have been quoted, which will now be
described in detail. Precision, Recall, F-measure, and Slot Error Rate (SER) (Makhoul et al.,
1999) are defined in equations (2.1) to (2.4). All these metrics are based on slots, which, for the
punctuation task, corresponds to the occurrence of a punctuation mark in the corpus, and for
the capitalization task, corresponds to all words not written as a lowercase form.
Precision =
Recall =
C
C
=
Re f
C+S+D
(2.1)
(2.2)
2 ∗ Precision ∗ Recall
Precision + Recall
(2.3)
total slot errors
I+D+S
=
Re f
C+D+S
(2.4)
F − measure =
SER =
C
C
=
Hyp
C+S+I
In the equations, C is the number of correct slots; I is the number of insertions (spurious
slots / false acceptances); D is the number of deletions (missing slots / false rejections); S is
the number of substitutions (incorrect slots); Re f is the number of slots in reference; and Hyp
is the number of slots in hypothesis. The first three performance metrics are guaranteed to
assume values between 0 and 1, but that is not the case of the SER, which can assume values
2.4. EVALUATION METRICS
25
greater than 100%. Both examples presented in Figure 2.3 achieve 33% Precision, 33% Recall,
and 33% F-measure. However, the SER is 100%, which may be a more meaningful measure,
given that the number of slot errors in the example is greater than the number of correct ones.
F-measure is a way of having a single value for measuring all types of errors simultaneously
but, as reported by Makhoul et al. (1999), “this measure implicitly discounts the overall error
rate, making the systems look like they are much better than they really are”.
These performance metrics are widely used by the scientific community, but a number of
variations in their usage can be found. For example, Kim and Woodland (2001) assume that
correctly placed but wrongly identified punctuation marks count as half an error, therefore
improving the performance of all the metrics here presented.
The challenge of obtaining more meaningful performance measures for MDE scoring has
been accepted by Ostendorf and Hillard (2004). The authors analyse the performance measures for MDE, specifically for SU detection, and propose event-based statistical significance
measures. A study conducted by Liu and Shriberg (2007) shows the advantages of curves over
a single metric for sentence boundary detection. Other metrics, based on performance curves,
have also been proposed, and can also be used for more adequate analysis. The ROC (Receiver
Operating Characteristic) curve has been used for this propose, and it basically consists of plotting the false alarm rate on the horizontal axis, while the correct detection rate is plotted on
vertical. Martin et al. (1997) proposes the DET (Decision Error Tradeoff) curve, a variant of the
ROC curve, where error rates are plotted in both axes.
Most of the results presented in the scope of this thesis include the standard metrics: Precision, Recall, F-measure, and Slot Error Rate. However, the SER is the preferred performance
metric for the performance evaluation. The SER for punctuation corresponds to the NIST error
rate for sentence boundary detection, which is defined as the sum of the insertion and deletion
errors per number of reference sentence boundaries (Liu and Shriberg, 2007).
26
CHAPTER 2. STATE OF THE ART
Corpora
3
This chapter describes the most relevant data involved in the work performed in the scope
of this thesis, most of them reported in this document. Preliminary experiments used only
the data from a Portuguese broadcast news corpus, but the small size of the corpus soon demanded new data sources. This applies essentially to the capitalization task, where the lexical
features are of extreme importance. Despite speech transcripts and written corpora being much
different in many aspects, both types of corpora share important information concerning punctuation marks and capitalization information. For that reason, even if the main goal is to deal
with BN speech transcripts, large written corpora containing punctuation and capitalization information for each word and the corresponding context were also used as a way of improving
the punctuation and capitalization models.
The work described in this document has been performed using tools and corpora under
development. For example, the recognition system was upgraded several times during this
work, either for correcting existing bugs or for providing improved results. This imposes serious problems, since experiments with different tools or data versions may not be directly
compared. Consequently, a great number of previous experiments were repeated several times
along with the ongoing experiments, some of them taking several weeks to be completed. Unless otherwise stated, the data properties described in this chapter were recently calculated in
order to reflect the most recent data.
Recently, some of the work performed for Portuguese has been also ported to other languages. One important task concerning this goal was to setup the different corpora sets for
each one of the languages involved. That is a difficult task, mostly because of the different
formats and annotation criteria. So far, our experiments cover three different languages: Portuguese, Spanish and English. The remainder of this chapter describes the BN and the Written
Newspaper data, for each one of the languages were considered, as well as, the pre-processing
steps applied to each one of the corpora sets.
Most of the experiments performed in the scope of this thesis use either written newspaper
corpora or broadcast news corpora, described in this chapter. Nevertheless, other corpora was
also used for specific experiments and, in such cases, these resources are described only in the
context of such experiments.
28
CHAPTER 3. CORPORA
Usage Name
Train
Development
Eval
Test
JEval
RTP07
RTP08
Recording period
Oct., Nov. 2000
Dec. 2000
Jan. 2001
Oct. 2001
May, June, Sep., Oct. 2007
June, July 2008
Duration
46.5h
6.5h
4.5h
13.5h
4.8h
3.7h
Words
480k
67k
48k
136k
49k
39k
WER
14.1
19.6
19.4
17.9
19.1
25.4
Table 3.1: Different parts of the Portuguese BN corpus.
3.1
Broadcast News Data
Besides words, speech data transcripts typically contain additional information coming from
the speech signal, including information concerning the background noise, speaker gender,
and other metadata. Manual transcripts, which provide the reference data, may include information concerning sections to be excluded, punctuation, capitalization and other phenomena,
such as indication of foreign languages and disfluencies. The reference is usually divided into
segments, with information about the start and end locations in the signal file, speaker id,
speaker gender, and focus conditions. Automatic transcripts, produced by speech recognition
systems, usually include the time period corresponding to each word and other information,
such as confidence measures. All our automatic transcripts were produced by the APP (Audio Pre-Processing) and speech recognition modules, thus including the following information:
speaker cluster, speaker gender, background speech conditions (clean/noise/music), and the
confidence score for each word. Due to a recent upgrade, the recognition system now also
provides confidence scores for other information, such as speaker cluster, and speaker gender.
3.1.1
Portuguese Corpus
The Portuguese speech corpus (henceforth ALERT-SR corpus) is an European Portuguese
Broadcast News corpus, originally collected for training and testing the speech recognition and
topic detection systems, in the scope of the ALERT European project (Neto et al., 2003; Meinedo
et al., 2003). The original corpus includes two different evaluation sets: Eval and JEval, the
latter having been collected with the purpose of a “joint evaluation” among all project partners.
This corpus was recently complemented with two collections of 11 BN shows from the same
public TV channel (RTP). Table 3.1 presents details for each part of the corpus, where duration
values represent the duration of all speech sequences (silences not included). The reported
WER (Word Error Rate) values were calculated for the recognition system as it was on May
2010, but notice that these values change from time to time. The RTP08 test set was collected
with an 8 day time span between each BN show. The corpus includes two other subsets (Pilot
and 11March), which were not used for the experiments here described.
The manual orthographic transcription process follows the LDC Hub4 (Broadcast Speech)
3.1. BROADCAST NEWS DATA
29
Figure 3.1: Focus distribution in terms of speech duration for Portuguese BN.
transcription conventions1 , and includes information such as punctuation marks, capital letters
and special marks for proper nouns, and acronyms. Each segment in the corpus is marked as:
planned speech with or without noise (F40/F0); spontaneous speech with or without noise
(F41/F1); telephone speech (F2); speech mixed with music (F3); non-native speaker (F5); any
other speech (FX). Figure 3.1 shows the corpus distribution by focus condition, revealing that
most of the corpus consists of planned speech (F0+F40), but it also contains a large percentage
(35%) of spontaneous speech (F1+F41).
3.1.1.1
Manual revision of the corpus
The manual orthographic transcripts of this corpus were recently revised by an expert linguist, thereby removing many inconsistencies in terms of punctuation marks that affected our
previous results. The previous version of this corpus was manually transcribed by different
annotators, who did not follow consistent criteria in terms of punctuation marks. The revision process focused mostly on correcting punctuation marks and on adding disfluency annotation (Moniz, 2006), which were not previously annotated. Table 3.2 shows the number
of differences between the old reference data and the new, in terms of the punctuation marks
and disfluency annotation. For example, the table shows that, whereas about 40k full-stops
were kept from the older version to the newer, about 3.9k were replaced by commas, 233 were
replaced by question marks, and another 1574, were simply removed by the expert. Most of
the differences concern comma, and they are often due to different criteria when marking disfluencies. Since our previous data had no disfluency identification and no objective criteria
1 http://www.ldc.upenn.edu/Projects/Corpus_Cookbook/transcription/broadcast_speech/english/conventions.html
30
CHAPTER 3. CORPORA
Punctuation after revision
Punctuation before revision
none
.
,
?
1574
31429
.
1239
40064
1856
,
18828
3892
?
118
233
none
!
...
:
;
"
<>
52
67
249
11
6
202
0
33
172
63
14
5
0
0
41435
83
74
36
62
27
16
0
66
2017
3
11
2
1
0
0
>20k
!
14
13
18
0
64
0
0
0
0
0
10k-20k
...
139
30
43
1
1
172
0
0
0
0
5k-10k
:
110
83
140
0
4
0
104
8
1
0
1k-5k
;
63
52
54
0
0
0
0
11
0
0
100-1k
“
0
0
0
0
0
0
0
0
0
0
20-100
<>
14576
6
36
0
0
7
0
0
3
0
5-20
Table 3.2: Confusion matrix between the old to the new manual transcripts.
Cohen’s kappa
.
0.890
,
0.557
?
0.870
!
0.259
...
0.372
:
0.323
;
0.092
All punctuation
0.705
Table 3.3: User annotation agreement for the punctuation marks in the Portuguese BN corpus,
in terms of Cohen’s kappa values.
were applied to deal with this, the annotators often delimited the disfluency sequences with
commas, or other punctuation marks. Moreover, they also applied a naive criterion of corresponding a comma to a silent pause, even if that was an ungrammatical convention, meaning
not respecting the syntactic structure. For example, about 41k commas were kept from the old
the the new version, but about 31k were removed, and another 19k were simply added to the
new corpora version. The question mark is mostly consistent, and results concerning other
punctuation marks are less significant given the lower frequency in the corpus. The bottom
line of the table reports statistics for the newly added disfluency boundaries, revealing that
about 15k disfluencies are now marked in the corpus.
Using the previous differences, Cohen’s kappa values (Carletta, 1996) have been calculated
for each punctuation mark, thus allowing to assess the user agreement between the original and
the revised version. Table 3.3 shows the corresponding results for all corpora, revealing that
the most consistent punctuation marks are the full-stop and the question-mark, and confirming
the strong disagreement concerning comma. Most of the differences concern comma, and they
are often due to different criteria when marking disfluencies. Since our previous data had no
disfluency identification, and no objective criteria were applied to deal with this, the annotators
often delimited the disfluency sequences with commas, or other punctuation marks. Moreover,
they also applied a naive criterion of corresponding a comma to a silent pause, even if that did
not respect the syntactic structure (e.g., often introducing a comma between the subject and the
predicate). Results concerning other punctuation marks are less significant, given the lower
3.1. BROADCAST NEWS DATA
Train
Development
Test
31
#shows
19
3
3
duration
13.6h
2.2h
1.9h
#words
152k
25k
22k
Table 3.4: Spanish BN corpora properties.
Figure 3.2: Focus distribution in terms of speech duration for Spanish BN.
frequency in the corpus.
3.1.2
Spanish Corpus
This Spanish corpus was recently created at L2 F/VoiceInteraction2 (Martinez et al., 2008;
Meinedo et al., 2010). Table 3.4 shows some properties of this corpus, which contains about 25
Broadcast News from the national Spanish TV station (TVE). The training data contains about
14h of usable speech, and the evaluation data contains about 2h. The focus distribution is
illustrated in Figure 3.2, revealing a very large percentage of planned speech (73%) when comparing with Portuguese broadcast news. Only about 11% of the corpus consists of spontaneous
speech.
The manual transcripts of this corpus are available in the TRS format. The conversion to
other formats follows the same strategy already adopted for the Portuguese corpus.
32
CHAPTER 3. CORPORA
Subset
Train
Development
Test
WER
1998T28
94%
6%
15.1%
LDC corpora set
2000S86 2000S88 2005T24
80%
10%
100%
100%
10%
23.7%
25.5%
16.1%
2007S10
100%
20.9%
Total
duration #words
79.1h
829k
6.3h
66k
9.3h
98k
Table 3.5: English BN corpora properties.
3.1.3
English Corpora
The English BN corpus used in our experiments combine five different English BN corpora
subsets, available from the Linguistic Data Consortium (LDC). Table 3.5 shows details of this
corpus. From the corpus LDC1998T28 (HUB4 1997 BN training data), about 94% of was used
for training and the rest for development. The first 80% of the LDC2005T24 corpus (RT-04
MDE Training Data Text/Annotations) was used for training, 10% for development and the
last 10% for evaluation. The evaluation data also includes the LDC corpus LDC2000S86 (HUB4
1998 BN evaluation) , LDC2000S88 (HUB4 1999 BN evaluation), LDC2005T24 (MDE RT04, only
the last 10% were used), and LDC2007S10 (NIST RT03 evaluation data). The training data
contains about 81h (transcribed speech only), which is almost twice the size of the Portuguese
BN training data.
Dealing with such corpora demanded normalization strategies, specifically adapted for
each corpus. They have been produced in different time periods, encoded with different annotation criteria, and are available in different formats as well. Besides, they were built for
different purposes, which makes them even more heterogeneous. One important step for dealing with these corpus consisted on converting the original format of each one into a common
format. The chosen common format was the STM (Segment Time Mark) format, which is easy
to process and understand, and can easily map all the information required for our experiments. This corpora contains portions of overlapped speech but, in order to correctly use our
recognition system, only one speaker was kept for such segments.
3.1.3.1
1997 Hub-4 Broadcast News Speech Corpus
This corpus contains a total of 97 hours of recordings from radio and television news broadcasts, gathered between June 1997 and February 1998. It has been prepared to serve as a supplement to the 1996 Broadcast News Speech collection (consisting of over 100 hours of similar
recordings). However, the 1996 BN speech collection does not include neither punctuation nor
capitalization information and was therefore excluded from our experiments.
The manual transcripts are available under two different distributions: LDC1998T28 and
2 http://www.voiceinteraction.pt/
3.1. BROADCAST NEWS DATA
33
<time sec=50.017>
{breath} We don’t know yet what it is the judge thinks the penalty should be, but it
will not be death. Only the jury could have made that call. {breath} So first, we go to
^Denver, and here is _A_B_C’s ^Bob ^Jamieson. ^Bob?
...
<time sec=836.976>
{breath} In an interview with _C_N_N, he praises ^Americans and he calls for a thoughtful
dialog between the people of the two countries. {breath} For the nearly twenty years
since ^Americans were held hostage in ^Iran, going soft on ^Iran {breath} has been taboo
in ^American politics. And so, it is a tough question for President ^Clinton.
<time sec=853.232>
How to respond {breath} to a friendly voice {breath} from ^Teheran? {breath} We check in
first at the White House. Here’s _A_B_C’s ^John ^Donvan.
</turn>
...
<turn speaker=Joseph_Lieberman spkrtype=male startTime=974.151 endTime=986.845>
<time sec=974.151>
^Iran has a serious ballistic missile development program that probably within less than
a year %uh will %uh threaten our troops in the ^Middle ^East and our allies there.
</turn>
Figure 3.3: Excerpt of the LDC1998T28 manual transcripts.
LDC1998E11, and the speech data is available under the LDC1998S71 distribution. The primary
motivation for the LDC1998T28 collection is to provide additional training data for the DARPA
"HUB-4" Project on continuous speech recognition in the broadcast domain. Transcripts have
been made of all recordings in this publication, manually time aligned to the phrasal level, and
annotated to identify boundaries between news stories, speaker turn boundaries, and gender
information about the speakers. The released version of the transcripts is in SGML format,
comparable to the format that was used in the 1996 Broadcast News Speech transcripts, there is
accompanying documentation, and an SGML DTD file, included with the transcription release.
The LDC1998E11 also contains transcripts from the this corpus, annotated with Named Entities
according to the new 1998 Hub-4 guidelines, and available in UTF format. Nonetheless, only
the LDC1998T28 collection was used, which serves well our goals. The corresponding evaluation corpora is available under the LDC catalog LDC2002S11. However, the manual transcripts
use no punctuation marks and no capitalization information, and therefore could not be used
in our experiments.
Figure 3.3 shows an excerpt of the original content of this corpus, which is available in
the SGML format. This content was converted into the standard STM format by means of the
bn_filt_sgml97 tool3 .
3 This tool was adapted from the tool bn_filter, which was supposed to be included in the distribution, but it
was instead provided by Thomas Pellegrini, a colleague from L2F, that found it elsewhere.
34
CHAPTER 3. CORPORA
<utf dtd_version="utf-1.0" audio_filename="h4e_98_1.sph" language="english" version="4"
version_date="981118" scribe="Reconciled">
<bn_episode_trans program="unk" air_date="unk">
<background type="other" startTime="0.0" level="low">
<Section startTime="0.004438" endTime="85.505438" type="report">
<Turn startTime="0.004438" endTime="13.638313" spkrtype="male" dialect="native"
speaker="David_Brancaccio" mode="planned" fidelity="high">
{breath The guardians of the electronic stock market <b_enamex
TYPE="ORGANIZATION">@NASDAQ<e_enamex> <contraction e_form="[who=>who][’ve=>have]">who’ve
been burned by past ethics questions, are moving to
head off
<time sec="6.839750">
market fraud by toughening the rules for companies that want to be listed on the
exchange. {breath
<time sec="11.182000">
{breath Marketplace’s <b_enamex TYPE="PERSON">^Philip ^Boroff<e_enamex> reports.
</Turn>
<Turn startTime="13.638313" endTime="49.432875" spkrtype="male" dialect="native"
speaker="Philip_Boroff" mode="planned" fidelity="high">
As part of the proposals, penny stocks will be eliminated from <b_enamex
TYPE="ORGANIZATION">@NASDAQ<e_enamex>.
<time sec="17.968750">
{breath These trade for literally <b_numex TYPE="MONEY">pennies<e_numex>.
...
<time sec="314.280751">
about a seventh of the <b_enamex TYPE="LOCATION">_U_S<e_enamex> market.
Figure 3.4: Except of the LDC2000S86 corpus.
3.1.3.2
1998 and 1999 Hub-4 Evaluation Corpora
The LDC2000S86 distribution contains the English evaluation test material used in the
1998 DARPA/NIST Continuous Speech Recognition Broadcast News HUB4 English Benchmark Test, administered by the NIST (National Institute of Standards and Technology) Spoken Natural Language Processing Group. The LDC2000S88 publication contains the English
evaluation test material used in the 1999 NIST Broadcast News Transcription Evaluation, also
administered by the NIST Spoken Natural Language Processing Group.
Each distribution contains about 1.5 hours of broadcast news speech. The transcription
data from these corpora is available in the UTF format, which is the same format as the Hub-4
English Compendium. Figure 3.4 contains an excerpt of the LDC2000S86 transcripts. It was
converted into the STM format by means of a adapted version of the utf_filt tool.
3.1.3.3
MDE RT-04 Training Data
This corpus was created by LDC to provide training data for the RT-04 Fall Metadata Extraction (MDE) Evaluation, part of the DARPA EARS (Efficient, Affordable, Reusable Speechto-Text) Program. The manual transcripts of this corpus are available under the LDC distribution LDC2005T24. The corresponding speech data is available under the LDC2005S16 (MDE
3.1. BROADCAST NEWS DATA
35
...
<Annotation id="ea980129-split001:1:E1W" type="token" start="ea980129-split001:1:2z"
end="ea980129-split001:1:30">
<Feature name="_SU*">ea980129-split001:1:EJH</Feature>
<Feature name="_next">ea980129-split001:1:E1X</Feature>
<Feature name="_segment*">ea980129-split001:1:E19</Feature>
<Feature name="_sn">86</Feature>
<Feature name="_speaker*">ea980129-split001:1:E1</Feature>
<Feature name="language"></Feature>
<Feature name="punctuation">period</Feature>
<Feature name="text">misleading</Feature>
</Annotation>
<Annotation id="ea980129-split001:1:EJH" type="SU" start="ea980129-split001:1:2z"
end="ea980129-split001:1:30">
<Feature name="_et">ea980129-split001:1:E1W</Feature>
<Feature name="_sn">86</Feature>
<Feature name="_speaker*">ea980129-split001:1:E1</Feature>
<Feature name="_st">ea980129-split001:1:E1W</Feature>
<Feature name="difficultToAnnotate">false</Feature>
<Feature name="type">statement</Feature>
</Annotation>
<Annotation id="ea980129-split001:1:E1X" type="token" start="ea980129-split001:1:31"
end="ea980129-split001:1:32">
<Feature name="_next">ea980129-split001:1:E1Y</Feature>
<Feature name="_segment*">ea980129-split001:1:E19</Feature>
<Feature name="_sn">87</Feature>
<Feature name="_speaker*">ea980129-split001:1:E1</Feature>
<Feature name="language"></Feature>
<Feature name="text">There</Feature>
...
Figure 3.5: Excerpt of the LDC2005T24 corpus (XML format).
RT04 Training Data Speech). This data set has been previously released to the EARS MDE
community as LDC2004E31. The LDC2005S16 remaps some original annotations to new MDE
elements, in order to support better annotation consistency.
The corpus includes 20h of CTS (Conversational Telephone Speech) from the Switchboard
corpus, and 20h of Broadcast News shows (23 shows) from the Hub-4 Broadcast News Corpus.
Only the BN portion of the corpus was used for experiments described in this document. The
original data is available in XML format, and can be further converted into the RTTM format, by
means of the ag-to-rttm script, included in the distribution. Figures 3.5 and 3.6 show excerpts
of this corpus in XML and RTTM formats, respectively. Each RTTM file was converted into the
STM format using a tool specially created for this purpose and developed in the scope of this
thesis.
3.1.3.4
2003 NIST Rich Transcription Evaluation Data
This corpus, distributed under the LDC2007S11 reference, contains the test material used
in the 2003 Rich Transcription Spring and Fall evaluations, administered by the NIST Speech
Group. The Spring evaluation (RT-03S), implemented in March-April 2003, focused on Speech-
36
CHAPTER 3. CORPORA
SPEAKER ea980129 1 87.987 3.054 <NA> <NA> Peter_Jennings <NA>
SU ea980129 1 87.987 3.054 <NA> statement Peter_Jennings <NA>
LEXEME ea980129 1 87.987 0.611 We’ll lex Peter_Jennings <NA>
LEXEME ea980129 1 88.598 0.611 take lex Peter_Jennings <NA>
LEXEME ea980129 1 89.209 0.610 A lex Peter_Jennings <NA>
LEXEME ea980129 1 89.819 0.611 Closer lex Peter_Jennings <NA>
LEXEME ea980129 1 90.430 0.611 Look lex Peter_Jennings <NA>
SEGMENT ea980129 1 91.041 9.111 <NA> <NA> spkr_1 <NA>
SPEAKER ea980129 1 91.041 10.427 <NA> <NA> spkr_1 <NA>
SU ea980129 1 91.041 4.795 <NA> statement spkr_1 <NA>
LEXEME ea980129 1 91.041 0.480 From lex spkr_1 <NA>
LEXEME ea980129 1 91.521 0.479 A. alpha spkr_1 <NA>
LEXEME ea980129 1 92.000 0.480 B. alpha spkr_1 <NA>
LEXEME ea980129 1 92.480 0.479 C. alpha spkr_1 <NA>
LEXEME ea980129 1 92.959 0.480 News lex spkr_1 <NA>
...
Figure 3.6: Excerpt of the LDC2005T24 corpus (RTTM format).
To-Text (STT) tasks for broadcast news speech and conversational telephone speech in three
languages: English, Mandarin Chinese and Arabic. That evaluation also included one Metadata Extraction (MDE) task, speaker diarization for broadcast news speech and conversational
telephone speech in English. The Fall evaluation (RT-03F), implemented in October 2003, focused on MDE tasks including speaker diarization, speaker attributed STT, SU (sentence/semantic unit) detection and disfluency detection for broadcast news speech and conversational
telephone speech in English. Surprisingly enough, the LDC2007S10 distribution does not contain the reference data for the MDE tasks, as it should. To complement this corpus, it would also
be interesting to use the LDC2004T12 distribution, corresponding to the RT-03 MDE Training
Data Text and Annotations, but unfortunately this corpus was not available.
The original information is provided in the TYP format, which can be read by the Transcriber speech tool. This tool is used for segmenting, labeling and transcribing speech (Barras
et al., 2001). Transcriber was also used to produce a TRS file, which was then used for producing
the adopted STM format. Figure 3.7 presents an excerpt of the data in its original format.
3.2
Written Newspaper Data
Written corpora contains important information concerning the two rich transcription tasks
focused by this thesis. In terms of capitalization, they provide information concerning the
context where the capitalized words appear. In terms of punctuation, the importance is not so
obvious, but they can also be used to improve speech data punctuation models.
In order to bring the written corpora closer to the speech transcripts, we have performed
automatic normalizations. This process transforms much of the original text into word sequences that can also be found in a speech transcripts, but without recognition errors. Each
corpus requires a specific normalization tool, which depends not only on the language, but
3.2. WRITTEN NEWSPAPER DATA
37
...
<t 83.813> <<female, Gillian_Findlay>>
It was an historic day today, ^Peter, {breath}
<b 85.882>
and already Mr. ^Sharon says he’s taken a phone call of congratulations from President
^Bush, {breath}
<b 90.714>
who, he says, reminds him of a tour that Mr. ^Sharon gave him years ago in ^Israel.
{breath}
<b 95.263>
At the time Mr. ^Bush said to him, you know one day I will be prime minister and you
will be p- -- %uh, I will be president and you will be prime minister. {breath}
...
<t 271.373> <<female, Betsy_Stark>>
But today the owner of that toy store told ~A ~B ~C News
<b 274.703>
she sees no signs of a slowdown in her business.
<t 277.633> <<female, spkr_4>>
January was a very good month. We still are growing. %um
<b 281.719>
{breath} {lipsmack}
<b 282.508>
Seems to be, so far, so good.
Figure 3.7: Excerpt of the LDC2007S10 corpus.
Train
Development
Test
PUBnews
data period
Jan 1999 - Sep 2004
Oct 2004 - Nov 2004
Nov 2004 - Dec 2004
#words
151 M
2.2 M
2.2 M
LMnews
data period
Jan 2005 - Jun 2009
Jul 2009 - Jul 2009
Aug 2009 - Nov 2009
#words
41 M
1.8 M
1.7 M
Table 3.6: Portuguese Newspaper corpora properties.
also on the corpus itself. This is the most difficult part in terms of corpora setup. The following subsections provide more details about each one of the written corpus and describes the
normalization steps performed.
3.2.1
Portuguese Corpora
Two written corpora subsets were used in the context of this thesis, both collected from the
Web. Table 3.6 summarizes the properties of each one of these corpora.
The first newspaper corpus is a subset of a larger newspaper corpus, consisting of collected
editions of the Portuguese “Público” newspaper, that have been collected at INESC-ID Lisboa
between 1995 and 2004. During the year 1998, the scripts used for this task became obsolete,
and only the first three months were collected in that year. For that reason, the subset used in
our experiments – PUBnews – covers only the period from 1999 to 2004, containing about 150
Million words. It was split into subsets of about 2 Million words each, resulting in 72 subsets
(between 10 to 14 per year). The last two subsets are used for development and evaluation,
38
CHAPTER 3. CORPORA
Train
Development
Test
subsets
37
1
1
data period
Jan 2003 - Nov 2008
Jan 2009 - Feb 2009
Feb 2009 - April 2009
#words
75M
2M
2M
Table 3.7: European Spanish written corpora properties.
respectively. Each subset is referred by the day corresponding to the latest data in that subset.
The LMnews corpus consists of online text, daily collected from the web, corresponding to
last minute news published by the “Público” newspaper. It began being collected in October
2008, during the execution of this thesis, but the content of this corpus covers from 2005 to
2009, since older data was also available. In October 2009, the newspaper company changed
the publication format of the data, making it very difficult to continue collecting it. Together
with this corpus, we have also collected last minute news from the TSF agency. It is currently
still being collected for language model adaptation of the speech recognition system, but it was
not used in the scope of this study.
Early experiments used an old Portuguese normalization tool (NUVEM) created at INESC
several years ago. Since then, unsuccessful efforts have been made to create an alternate normalization tool, to be used by other people within the group, and by the speech recognition
system itself. Recently, also in the scope of this thesis, the initial normalization tool was deeply
revised, and has been used during the latest experiments. The normalization now includes
several modules: units, numbers, money, dates, time, known expressions, and deals with other
type of expressions found in real corpora. Some modules in the tool can be optionally activated, such as expansion of abbreviations and separation of punctuation. Annex A shows
some examples of the output of this normalization tool.
3.2.2
Spanish Corpora
The Spanish written corpus consists of online editions of the Spanish newspaper “El País”,
collected since 2003. Table 3.7 summarizes the properties of this corpus. It was normalized
using the normalization tool also in use in the speech recognition system. The normalization
rules have been adapted from the Portuguese normalization rules (Martinez et al., 2008).
3.2.3
English Corpora
The English written corpora correspond to the LDC1998T30 (North American News Text
Supplement). The LDC1998T30 contains data from three different sources: APWS (Associated
Press World Stream), NYT (New York Times), and LATWP (Los Angeles Times & Washington
Post). Table 3.8 summarizes the size of the corpus, after cleaning the corpus and removing
problematic text (unknown chars, etc).
3.3. SPEECH DATA PREPARATION
Corpus Subset
APWS
NYT
LATWP
39
Data Period
Nov-94 to Apr-98
Jan-97 to Apr-98
Sep-97 to Apr-98
Train
228M
211M
16.4M
Words
Development
454K
574K
711K
Test
863K
1.2M
769K
Table 3.8: English written corpora properties.
Concerning the normalization of the English corpora, the existing tools have been used, as
much as possible. Most of the normalization task for English had already been performed by
Thomas Pellegrini. For this work, several existing normalization modules have been adapted
and combined, including those which have been previously worked on by Thomas Pellegrini.
3.3
Speech Data Preparation
Besides the manual transcripts, for each corpus we also have the automatic transcripts
produced by the recognition system (Neto et al., 2008). More recently, the recognition system has also been used to produce automatic force alignments for all the corpora, except for
the Spanish. All corpora were automatically annotated with part-of-speech information. The
morphological information was added to the Portuguese data using the morphological analyzer Palavroso (Medeiros, 1995), followed by the ambiguity resolver MARv (Ribeiro et al.,
2003, 2004), while the Spanish and the English data were annotated using TreeTagger (Schmid,
1994), a language independent part-of-speech tagger, using the dictionaries there included.
The manual orthographic transcripts include punctuation marks and capitalization information, providing our reference data. Whereas the manual transcripts already contain reference punctuation marks and capitalization, this is not the case of the automatic transcripts. In
the context of this thesis, the required reference was produced by means of word alignments
between the manual and automatic transcripts, which is a non-trivial task mainly because of
recognition errors. The alignment was performed using the NIST SCLite tool4 , followed by
an automatic post-processing stage, for correcting possible SCLite errors and aligning special
words which can be written/recognized differently. The automatic post-processing stage allows to overcome problems such as words A.B.C. or C.N.N. appearing as single words in the
reference data, but recognized as isolated letters.
We have adopted the standard XML representation format for keeping all the information
required for further experiments. During an early stage of this work, we have created two
XML data sets: MAN – built entirely from manual transcripts, where part-of-speech data were
added to each word; AUT – built from both automatic transcripts, where part-of-speech data
was added to each word and reference information coming from the manual transcripts was
4 available
from http://www.nist.gov/speech.
40
CHAPTER 3. CORPORA
CTM
Manually annotated
speech transcriptions
(TRS)
Alignment
STM
mark excl.
sections
update
punctuation
capitalization
ASR Output
(XML)
text
POS tagger
XML format
ASR output
excluded sections
focus conditions
punctuation & capitalization
morphology
update
morphology
Figure 3.8: Creating an XML (Extensible Markup Language) file with all the required information for further experiments. The following file formats are used: CTM (time marked conversation scoring), STM (segment time mark), and TRS (XML-based standard Transcriber).
also added. The two data sets were then used for training and testing our punctuation and
capitalization models. Concerning the capitalization, both data sets provided all the required
information. Nonetheless, the punctuation task makes use of important information, such as
pause durations, which sometimes is not available in the manual transcripts. For that reason,
the MAN data set was recently redefined to contain force aligned transcripts, treated in the
same way as the automatic transcripts, but without recognition errors. These two data sets
contain exactly the same type of information, allowing to apply the same procedures and tools
to each one of them.
Figure 3.8 illustrates the process of creating a XML file with the information required for all
further experiments. The resulting file includes APP/ASR output information: time intervals
to be ignored in scoring, focus conditions, speaker information, punctuation marks, part-ofspeech of each word and word confidence scores. The input for the POS tagger corresponds
to the text extracted from the original transcript, and segmented according to the acoustic segments previously identified by the APP module. Hence, a generic part-of-speech tagger that
processes written texts, can be used to perform this task, taking into account the surrounding
words. Figure 3.9 shows a transcript segment, extracted from an AUT file, where the punctuation, part-of-speech, focus conditions and information concerning excluded sections were
updated with information coming from the manual transcripts. Whereas, in the reference, exclusion and focus information are properties of a segment, in the speech recognition output,
such information must be assigned to each word individually. That is because reference segments are different from segments given by the APP/ASR.
3.3. SPEECH DATA PREPARATION
41
<TranscriptSegment>
<TranscriptGUID>13</TranscriptGUID>
<AudioType start="5022" end="5783" conf="0.686300">Clean</AudioType>
<Time start="5022" end="5783" reasons="" sns_conf="0.964000"/>
<Speaker id="52" id_conf="0.94" gender="F" gender_conf="0.92" known="T"/>
<SpeakerLanguage native="T">PT</SpeakerLanguage>
<TranscriptWordList>
<W start="5042" end="5053" conf="0.76" focus="F3" cap="Boa" pos="A.">boa</W>
<W start="5054" end="5095" conf="0.98" focus="F3" punct="." pos="Nc">noite</W>
<W start="5106" end="5162" conf="0.99" ... cap="Benfica" pos="Np">benfica</W>
<W start="5163" end="5169" conf="0.94" focus="F3" pos="Cc">e</W>
<W start="5170" end="5219" conf="0.96" ... cap="Sporting" pos="Np">sporting</W>
<W start="5220" end="5253" conf="0.99" focus="F3" pos="V.">estão</W>
<W start="5254" end="5280" conf="0.96" focus="F3" pos="S.">sem</W>
<W start="5281" end="5336" conf="0.99" ... punct="." pos="Nc">treinador</W>
<W start="5344" end="5370" conf="0.99" focus="F0" cap="José" pos="Np">josé</W>
<W start="5371" end="5399" conf="0.99" focus="F0" cap="Mourinho" pos="Np">mourinho</W>
<W start="5400" end="5441" conf="0.91" focus="F0" pos="V.+Pf">demitiu-se</W>
<W start="5442" end="5443" conf="0.86" focus="F0" pos="Pd">o</W>
<W start="5444" end="5498"
...
punct="." cap="Benfica" pos="Np">benfica</W>
<W start="5522" end="5568" conf="0.99" focus="F0" cap="Augusto" pos="Np">augusto</W>
<W start="5569" end="5604" conf="0.99" focus="F0" cap="Inácio" pos="Np">inácio</W>
<W start="5605" end="5631" conf="0.99" focus="F0" pos="V.">foi</W>
<W start="5632" end="5698" conf="0.99" focus="F0" pos="V.">demitido</W>
<W start="5699" end="5709" conf="0.98" focus="F0" pos="S.">do</W>
<W start="5710" end="5766"
...
punct="." cap="Sporting" pos="Np">sporting</W>
</TranscriptWList>
</TranscriptSegment>
Figure 3.9: Example of a transcript segment extracted from the AUT data set.
3.3.1
Capitalization Alignment Issues
The reference capitalization in automatic recognition transcripts is automatically produced
by aligning the manual and the automatic transcripts. The problem about creating a reference
capitalization is that, for each and every word in the recognition output, a capitalization form
must be assigned. That means that if a mistake is made, the evaluation will reflect it. When in
the presence of a correct word, the capitalization can be assigned directly, but problems arise
from the recognition errors. Figure 3.10 shows examples of word aligmements, extracted from
the SCLite output, where the misalignments were marked by colors. Some alignment problems
here presented are solved by the automatic post-processing stage, by looking at the words in the
neighborhood. For example, the capitalized word “Portugal” from the first example becomes
correctly capitalized. In fact, all the underlined words become capitalized after applying the
post-processing step.
If no information exists concerning the capitalization of a word, it is considered lowercase
by default. Therefore, any word inserted by the recognition system that does not exist in the
reference (insertion) will be kept lowercase. On the other hand, if a reference word was skipped
by the recognition system (deletion), nothing can be done about it. Anyway, most of the insertions and deletions consist of short functional words which usually appear in lowercase and
42
CHAPTER 3. CORPORA
1)
REF: noutro processo também em Portugal que
está junto que é um apenso dos autos
HYP: noutro processo também ** ******** portugal está junto que é um apenso nos alpes
2)
REF: O pavilhão desportivo do Colégio Dom
Nuno Álvares Pereira
HYP: o pavilhão desportivo do ******* colégio dono novas
pereira
3)
REF: A SAD a
administração da SAD Luís Duque e
Augusto Inácio
HYP: * lhe assada administração da *** sad luís duque augusto inácio
4)
REF: Esta noite em Gondomar o líder dos Social Democratas
HYP: esta noite em gondomar o líder dos ****** social-democratas
Figure 3.10: Capitalization alignment examples.
lowercase
subs
Corrected alignements
Unsolved alignments
Cor
Del
Ins
Train
87%
1.9%
4.5%
5.5%
0.5%
0.1% 0.3%
0.6%
0.0%
0.0% 13.6%
Development
81%
2.5%
5.5%
8.4%
0.7%
0.1% 0.5%
0.9%
0.1%
0.0% 19.0%
Eval
81%
2.6%
5.5%
8.5%
0.5%
0.0% 0.4%
1.1%
0.1%
0.0% 19.4%
Jeval
82%
3.4%
4.4%
7.8%
0.6%
0.1% 0.5%
1.1%
0.1%
0.0% 18.1%
Rtp07
81%
2.2%
6.3%
8.4%
0.6%
0.1% 0.4%
1.0%
0.1%
0.0% 19.9%
Rtp08
76%
2.3% 10.3%
9.7%
0.5%
0.0% 0.5%
1.1%
0.1%
0.0% 26.8%
Sclite Compound
subs
probs
words
First
cap
All
Other
upper
WER
Table 3.9: Capitalization alignment report.
do not pose significant problems to the reference capitalization. Finally, if the words mismatch
but the reference word is lowercase, the word in the automatic transcript is kept in lowercase
and will not pose problems to the reference capitalization either. Most of the alignment problems arise from word substitution errors, where the reference word appears capitalized (not
lowercase). In this case, three different situations may occur: i) the two words have alternative
graphical forms, a not infrequent phenomena in proper nouns, for example: ”Menezes” and
“Meneses” (proper nouns); ii) the two words are different but share the same capitalization, for
example: “Andreia” and “André”; and iii) the two words have different capitalization forms,
for example “Silva” (proper noun) and “de” (of, from). The Levenshtein distance (Levenshtein,
1966) has been used to measure the difference between the two words. As the process is fully
automatic, we have decided to assign the same capitalization information whenever it was less
than 2. By doing this, capitalization assignments like the following were performed: Trabalhos
7→ Trabalho; Espanyol 7→ Espanhol; Millennium 7→ Millenium; Claudia 7→ Cláudia; Exigimos
7→ Exigidos; Andámos 7→ Andamos; Carvalhas 7→ Carvalho; PSV 7→ PSD; Tina 7→ Athina. Notice that if the capitalization assignments of the above words were not performed, those words
would appear lowercase in the reference capitalization, which would not be correct.
Table 3.9 presents statistics concerning the capitalization assignment after the word alignment. The proportion of correct alignments is show in column Cor; Del and Ins correspond
to the number of deletions and insertions in the word alignment; lowercase subs correspond to
3.3. SPEECH DATA PREPARATION
1)
REF:
HYP:
2)
REF:
HYP:
3)
REF:
HYP:
4)
REF:
HYP:
5)
REF:
HYP:
6)
REF:
HYP:
43
ESTAMOS SEMPRE A
DIZER À
senhoria .
******* ****** CALMO SEM
PESARÁ senhoria *
no centro , O rio ceira ENCHEU de forma que A aldeia de CABOUCO ficou INUNDADA .
no centro * * rio ceira INÍCIO de forma que * aldeia de TEMPO
ficou ******** *
é a primeira vez que isto LHE acontece ?
é a primeira vez que isto *** acontece *
sem PERCEBEREM , SEM
LHES DIZEREM quais são as consequências desta política
sem ********** * RECEBEREM SELHO DIZER
quais são as consequências desta política
ALIÁS , alguém DISSE , E EU ESTOU
de acordo , que hoje não temos UM governo ,
HÁLIA ÀS alguém ***** * * ** INDÍCIOS de acordo * que hoje não temos O governo *
no segundo * **** ,
COLIN MONTGOMERY ,
JARMO SANDELIN , michael e laura
no segundo O QUAL NÃO COBRE E
CRIAR UMA
CÉLULA
E michael e laura
Table 3.10: Punctuation alignment examples.
Corpus
full-stop (.)
Ok
Bad
comma (,)
Good
Ok
question mark (?)
Bad
Good
Ok
Bad
exclamation mark (!)
subset
Good
Train
71.4% 24.7%
3.9% 76.1% 18.4%
5.5% 41.1% 44.3% 14.7%
Good
Ok
Bad
Development
66.3% 28.8%
4.9% 66.2% 27.1%
6.7% 33.7% 40.6% 25.7%
Eval
63.9% 30.1%
6.0% 64.6% 27.6%
7.9% 27.5% 51.0% 21.6%
25.0% 75.0%
0.0%
Jeval
65.0% 30.7%
4.3% 65.8% 27.4%
6.8% 44.7% 37.5% 17.8%
33.3% 66.7%
0.0%
Rtp07
63.6% 29.2%
7.2% 65.0% 25.6%
9.5% 27.5% 44.1% 28.4%
Rtp08
56.9% 30.9% 12.3% 59.8% 28.1% 12.1% 20.4% 49.0% 30.6%
19.4% 36.1% 44.4%
33.3%
0.0% 66.7%
Table 3.11: Punctuation alignment report.
substitution of words involving lowercase words, which do not pose problems to the reference
capitalization. Corrected alignments show the percentage of corrections performed during the
post-processing stage. The unsolved alignments correspond unsuccessful alignments, involving
first capitalized words (e.g., proper nouns), all uppercase letters (e.g., acronyms), and other
type of capitalization (e.g., McDonald’s). The recognition WER (Word Error Rate) is shown in
the last column, revealing the proportion of recognition errors in the corpus when the alignment was performed.
3.3.2
Punctuation Alignment Issues
Like the reference capitalization, inserting the correct reference punctuation to the automatic transcripts, according with the manual transcripts, is not an easy task, but it also faces
different challenges. The effect of the speech recognition errors is only relevant when they
occur in the neighbourhood of a punctuation mark. Table 3.10 shows punctuation alignment
examples extracted from the SCLite output. The recognition errors in the first three examples
do not pose problems to the reference punctuation, which means that they will provide a good
44
CHAPTER 3. CORPORA
reference. However, the last three examples present more difficult challenges. The fourth and
the fifth examples can still be solved in an acceptable manner, and provide acceptable reference
data. The last example is very difficult to solve, even by a human annotator, and will provide
a bad reference data. The REF data suggests the use of three commas, but considering only the
speech recognition output (HYP data) and the speech signal, would one use none, some, or
all three commas? And, in the case they are used, where to put them? Table 3.11 presents the
alignment summary for each punctuation mark, where the alignments are classified as good,
ok (acceptable), or bad. The worst alignments concerning the question mark are related with the
fact that these sentences consist mostly of spontaneous speech. The sum of exclamation marks in
all the corpora is only about forty, therefore, results concerning this punctuation mark are less
significant. The final alignment would benefit from manual correction, an issue to be addressed
in the future. Nevertheless, even an expert human annotator would find difficulties doing this
task and sometimes would not perform it coherently.
3.4
Additional Prosodic Information
The reference XML, described in section 3.3, contains lexical and acoustic information,
which served as data source for the initial experiments performed in the scope of this thesis.
Nevertheless, linguistic evidences state that nuclear contour, boundary tones, energy slopes,
and pauses are crucial to delimit sentence-like units across languages (Vassière, 1983). This
section describes the prosodic feature extraction process and the creation of a new data source,
containing additional prosodic information.
3.4.1
Extracting the Pitch and Energy
Pitch ( f 0 ) and energy (E) are two important sources of prosodic information that can be
extracted directly from the speech signal. By the time our experiments were conducted, that
information was not available in the speech recognition system output. For that reason, we
have extracted that information directly from the speech signal, using the Snack (Sjölander
et al., 1998) toolkit. Nevertheless, because this is an important starting point for the use of
prosody, in the future it will be directly available in the speech recognition output.
Both pitch and energy were extracted using the standard parameters, taken from the
wavesurfer tool configuration (Sjölander and Beskow, 2000). Energy was extracted using a
pre-emphasis factor of 0.97 and a 200ms hamming window, while pitch was extracted using the
ESPS method (auto-correlation).
We have removed all the pitch values calculated for unvoiced regions in order to eliminate
irregular values. This is performed in a phone-based analysis by detecting all the unvoiced
3.4. ADDITIONAL PROSODIC INFORMATION
45
Figure 3.11: Pitch adjustment for unvoiced regions.
phones. Figure 3.11 illustrates this process, where the original pitch values are represented by
dots and the resultant pitch is represented by the gray line.
3.4.2
Adding Phone Information
Audimus (Meinedo et al., 2008) is a hybrid automatic speech recognizer that combines
the temporal modeling capabilities of Hidden Markov Models with the pattern discriminative
classification capabilities of Multi-layer Perceptrons (MLP). Modeling context dependency is a
particularly hard problem in hybrid systems. For that reason, this speech recognition system
uses, in addition to monophone units modeled by a single state, multiple-state monophone
units, and a fixed set of phone transition units aimed at specifically modeling the most frequent intra-word phone transitions (Abad and Neto, 2008). That information was then converted into monophones by another tool, specially designed for that purpose. Still, the existing
information is insufficient for correctly assigning phone boundaries. For example, the phone
sequence “k k=u u”, presented in figure 3.11, must be converted into the monophone sequence
“k u”, but the exact boundary between the first and the second phone can only be guessed. We
have used the mid point of the phone transition. The speech recognition system could alternatively perform the speech recognition based purely on monophones, but the cost would be an
increased WER.
46
CHAPTER 3. CORPORA
3.4.3
Marking the Syllable Boundaries and Stress
Another important step consisted of marking the syllable boundaries as well as the syllable
stress. This was achieved by means of a lexicon containing all the pronunciations of each word,
marked with syllable boundaries and stress. For the Portuguese BN, a set of syllabification
rules was designed and applied to the lexicon5 . The rules account fairly well for the canonical
pronunciation of native words, but they still need improvement for words of foreign origin.
Regarding the English language, most of the lexicon content was created from the CMU
dictionary (version 0.7a). The phone sequence for the unknown words was provided by the
text-to-phone CMU/NIST tool addttp46 , and the stressed syllables were marked using the tool
tsylb2 (Fisher, 1996), which uses an automatic phonological-based syllabication algorithm.
3.4.4
Producing the Final XML File
After extracting and calculating the above information, all data was merged into a single data source, which provides all the required information for later use. The existing Data
Type Definition (DTD) (Section 3.3) has been upgraded in order to accommodate the additional
prosodic information.
Figure 3.12 illustrates the involved processing steps. The pitch and energy values are
extracted from the speech signal. A Gaussian mixture model (GMM) classifier is then use
to automatically detect speech/non-speech regions, based on the Energy7 . Both pitch and
speech/non-speech values are used to adjust the boundaries of the acoustic phone transitions,
generally known as diphones (Section 3.5 contains further details). An excerpt of the PCTM
input file, produced by the speech recognition system and containing all the phones/diphones
sequence, is shown in Figure 3.13. The referred PCTM file is modified with the new unit boundaries, and then used to produce another file (in the same format) containing only monophones.
The monophone units are used both for removing the pitch values from unvoiced regions and
to produce a new PCTM file containing the syllable/phone information. Figure 3.14 presents
an except of the resultant information, where the syllable boundaries and stress are marked.
A final XML file combines all the previous information together with pitch and energy
statistics for each unit. Figure 3.15 shows an excerpt from one of these files, containing one
transcript segment. Information concerning words, syllables and phones can be found in the
file, together with pitch, energy and duration information. For each unit of analysis we have
calculated the minimum, maximum, average, median and slope, both for pitch and energy.
Pitch slopes were calculated after converting the pitch differences into semitone values.
5 Work
performed by Isabel Trancoso and Helena Moniz, in cooperation with Hugo Meinedo.
of Thomas Pellegrini
7 In cooperation with Alberto Abad.
6 Work
3.5. SPEECH DATA WORD BOUNDARIES REFINEMENT
47
Input data
speech signal
PCTM
Lexicon
Extract Energy
Energy
Extract Pitch
GMM classifier
Pitch
SNS
Adjust diphone
boundaries
Pitch
produce
monophones
Pitch
adjustment
Mark syllables
XML
excluded regions
focus conditions
punctuation
capitalization
morphology
PCTM
Add syllables
& phone
Energy
Pitch
Add statistics
Final XML
Figure 3.12: Integrating prosody information in the corpora.
Recent efforts have been made to also include additional valuable information, available
in the manual transcripts into the XML files. The most recent XML files include information
concerning filled pauses and other marks (inspiration, disfluency boundaries) available in the
manual annotations, now aligned with the corresponding speech signal.
3.5
Speech Data Word Boundaries Refinement
Duration of silent pauses is one of the most important features for detecting punctuation
marks, or at least sentence boundaries (Gotoh and Renals, 2000). Even though they may not
be directly converted into punctuation, silent pauses are in fact a basic cue for punctuation
and speaker diarization (Chen, 1999; Kim and Woodland, 2001). The durations of phones and
silent pauses are automatically provided by a large vocabulary continuous speech recognition
48
CHAPTER 3. CORPORA
2000_12_05-17_00_00-Noticias-7.spkr000
2000_12_05-17_00_00-Noticias-7.spkr000
2000_12_05-17_00_00-Noticias-7.spkr000
2000_12_05-17_00_00-Noticias-7.spkr000
2000_12_05-17_00_00-Noticias-7.spkr000
2000_12_05-17_00_00-Noticias-7.spkr000
2000_12_05-17_00_00-Noticias-7.spkr000
2000_12_05-17_00_00-Noticias-7.spkr000
2000_12_05-17_00_00-Noticias-7.spkr000
2000_12_05-17_00_00-Noticias-7.spkr000
2000_12_05-17_00_00-Noticias-7.spkr000
2000_12_05-17_00_00-Noticias-7.spkr000
2000_12_05-17_00_00-Noticias-7.spkr000
2000_12_05-17_00_00-Noticias-7.spkr000
2000_12_05-17_00_00-Noticias-7.spkr000
2000_12_05-17_00_00-Noticias-7.spkr000
2000_12_05-17_00_00-Noticias-7.spkr000
2000_12_05-17_00_00-Noticias-7.spkr000
2000_12_05-17_00_00-Noticias-7.spkr000
2000_12_05-17_00_00-Noticias-7.spkr000
2000_12_05-17_00_00-Noticias-7.spkr000
2000_12_05-17_00_00-Noticias-7.spkr000
2000_12_05-17_00_00-Noticias-7.spkr000
2000_12_05-17_00_00-Noticias-7.spkr000
2000_12_05-17_00_00-Noticias-7.spkr000
2000_12_05-17_00_00-Noticias-7.spkr000
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
14.00
14.27
14.28
14.29
14.33
14.34
14.36
14.37
14.40
14.41
14.43
14.44
14.45
14.46
14.48
14.49
14.51
14.56
14.61
14.63
14.65
14.71
14.75
14.76
14.77
14.78
0.27
0.01
0.01
0.04
0.01
0.02
0.01
0.03
0.01
0.02
0.01
0.01
0.01
0.02
0.01
0.02
0.05
0.05
0.02
0.02
0.06
0.04
0.01
0.01
0.01
0.06
interword-pause
L-m
m
m=u~
u~
u~=j~
j~
j~=t
t
t=u
u
u+R+
L-b
b
b+R
L-o~
o~
o~+R+
L-d
d
d=i
i
i=A
A
A+R+
interword-pause
Figure 3.13: PCTM file containing the phones/diphones produced by the ASR system.
2000_12_05-17_00_00-Noticias-7.spkr000
2000_12_05-17_00_00-Noticias-7.spkr000
2000_12_05-17_00_00-Noticias-7.spkr000
2000_12_05-17_00_00-Noticias-7.spkr000
2000_12_05-17_00_00-Noticias-7.spkr000
2000_12_05-17_00_00-Noticias-7.spkr000
2000_12_05-17_00_00-Noticias-7.spkr000
2000_12_05-17_00_00-Noticias-7.spkr000
2000_12_05-17_00_00-Noticias-7.spkr000
2000_12_05-17_00_00-Noticias-7.spkr000
2000_12_05-17_00_00-Noticias-7.spkr000
2000_12_05-17_00_00-Noticias-7.spkr000
1
1
1
1
1
1
1
1
1
1
1
1
14.000
14.270
14.310
14.350
14.385
14.420
14.450
14.490
14.610
14.680
14.755
14.780
0.270
0.040
0.040
0.035
0.035
0.030
0.040
0.120
0.070
0.075
0.025
0.060
interword-pause
"m
u~
j~
#t
u+
"b
o~+
"d
i
#A+
interword-pause
Figure 3.14: PCTM file with monophones and marked with syllable boundary and stress.
3.5. SPEECH DATA WORD BOUNDARIES REFINEMENT
49
<TranscriptSegment>
<TranscriptGUID>2</TranscriptGUID>
<AudioType start="1400" end="1484" conf="1.000000">Music</AudioType>
<Time start="1400" end="1484" reasons="" sns_conf="1.000000"/>
<Speaker id="2000" id_conf="1.000000" name="Mulher" gender="F" gender_conf="1.000000"
known="F"/>
<SpeakerLanguage native="T">PT</SpeakerLanguage>
<TranscriptWordList phones="10" ph_duration="51" ph_avg="5.1">
<Word start="1427" end="1444" conf="0.999766" focus="F3" cap="Muito" pos="R." name="muito"
phseq="_mu~j~#tu+" pmax="271.9" pmin="220.8" pavg="258.3" pmed="263.7" emax="59.7"
emin="35.0" eavg="47.8" emed="46.5" pslope="-1.8" eslope="0.4">
<syl stress="y" start="1427" dur="11.5" pmax="271.9" pmin="251.2" pavg="263.7" pmed="266.0"
emax="59.7" emin="35.0" eavg="49.0" emed="50.2" pslope="2.0" eslope="2.2">
<ph name="m" start="1427" dur="4" pmax="0.0" pmin="0.0" pavg="0.0" pmed="0" emax="39.2"
emin="35.0" eavg="37.1" emed="37.0" pslope="0.0" eslope="1.6"/>
<ph name="u~" start="1431" dur="4" pmax="266.0" pmin="251.2" pavg="258.9" pmed="259.1"
emax="58.8" emin="45.1" eavg="53.8" emed="55.5" pslope="2.5" eslope="4.7"/>
<ph name="j~" start="1435" dur="3.5" pmax="271.9" pmin="268.0" pavg="270.1" pmed="270.3"
emax="59.7" emin="47.7" eavg="56.2" emed="58.7" pslope="-0.8" eslope="-3.6"/>
</syl>
<syl start="1438.5" dur="6.5" pmax="220.8" pmin="220.8" pavg="220.8" pmed="220.8" emax="47.9"
emin="41.6" eavg="45.7" emed="45.8" pslope="0.0" eslope="-0.2">
<ph name="t" start="1438.5" dur="3.5" pmax="0.0" pmin="0.0" pavg="0.0" pmed="0" emax="47.7"
emin="41.6" eavg="45.5" emed="46.5" pslope="0.0" eslope="-1.7"/>
<ph name="u" start="1442" dur="3" pmax="220.8" pmin="220.8" pavg="220.8" pmed="220.8"
emax="47.9" emin="44.9" eavg="46.0" emed="45.2" pslope="0.0" eslope="0.2"/>
</syl>
</Word>
<Word start="1445" end="1460" conf="0.994520" focus="F3" pos="A." name="bom" phseq="_bo~+"
pmax="253.3" pmin="217.4" pavg="234.8" pmed="232.1" emax="60.8" emin="41.2" eavg="51.8"
emed="52.8" pslope="-1.9" eslope="0.2">
<syl stress="y" start="1445" dur="16" pmax="253.3" pmin="217.4" pavg="234.8" pmed="232.1"
emax="60.8" emin="41.2" eavg="51.8" emed="52.8" pslope="-1.9" eslope="0.2">
<ph name="b" start="1445" dur="4" pmax="0.0" pmin="0.0" pavg="0.0" pmed="0" emax="49.2"
emin="41.2" eavg="44.0" emed="42.8" pslope="0.0" eslope="2.7"/>
<ph name="o~" start="1449" dur="12" pmax="253.3" pmin="217.4" pavg="234.8" pmed="232.1"
emax="60.8" emin="46.1" eavg="54.4" emed="56.2" pslope="-1.9" eslope="-1.3"/>
</syl>
</Word>
<Word start="1461" end="1477" conf="0.996707" focus="F3" punct="." pos="Nc" name="dia"
phseq="_di#6+" pmax="231.6" pmin="222.4" pavg="227.2" pmed="226.4" emax="61.6" emin="37.9"
eavg="50.7" emed="52.9" pslope="0.8" eslope="1.3">
<syl stress="y" start="1461" dur="14.5" pmax="229.3" pmin="222.4" pavg="225.5" pmed="225.0"
emax="57.3" emin="37.9" eavg="48.8" emed="48.6" pslope="0.7" eslope="1.3">
<ph name="d" start="1461" dur="7" pmax="0.0" pmin="0.0" pavg="0.0" pmed="0" emax="46.4"
emin="37.9" eavg="43.2" emed="44.0" pslope="0.0" eslope="-0.5"/>
<ph name="i" start="1468" dur="7.5" pmax="229.3" pmin="222.4" pavg="225.5" pmed="225.0"
emax="57.3" emin="50.8" eavg="54.4" emed="54.1" pslope="0.7" eslope="1.0"/>
</syl>
<syl start="1475.5" dur="2.5" pmax="231.6" pmin="231.1" pavg="231.3" pmed="231.3" emax="61.6"
emin="59.3" eavg="60.5" emed="60.5" pslope="0.3" eslope="2.3">
<ph name="6" start="1475.5" dur="2.5" pmax="231.6" pmin="231.1" pavg="231.3" pmed="231.3"
emax="61.6" emin="59.3" eavg="60.5" emed="60.5" pslope="0.3" eslope="2.3"/>
</syl>
</Word>
</TranscriptWordList>
</TranscriptSegment>
Figure 3.15: Excerpt of one of the final XML files, containing prosodic information.
50
CHAPTER 3. CORPORA
module, Audimus (Meinedo et al., 2008). By the time prosodic cues began being used in this
study, it was found that the automatic phone segmentation was not being completely well
performed, so prosodic/acoustic cues were explored to improve a baseline phone segmentation
module (Moniz et al., 2010). An analysis of the baseline results revealed problems in word
boundary detection. A solution to these was put in place using post-processing rules based on
prosodic features (pitch, energy and duration).
A limited subset of the ALERT-SR corpus, containing about 1h of speech, was transcribed
at the word boundary level8 , in order to allow for the evaluation of the efficacy of the postprocessing rules. With this sample one could evaluate the speech segmentation robustness
with several speakers in prepared non-scripted and spontaneous speech settings, and with
different strategies regarding speech segmentation and speech rate.
The recognizer was used in a forced alignment mode on this reduced test set of 1h duration that was manually transcribed at the word boundary level. As explained above, this
revealed several problems, namely in the boundaries of silent pauses, and in their frequent
miss-detection.
The post-processing rules achieved better results in terms of inter-word pause detection,
durations of previously detected silent pauses, and also durations of phones at initial and final
sentence-like unit level. Moreover, these experiments showed that these improvements had
impact both in terms of acoustic models and punctuation (Moniz et al., 2010). This work was
the first step towards more challenging problems, namely to combine prosodic and lexical features for the identification of sentence-like units. It was also a determinant step towards the
goal of adding the identification of interrogatives to the punctuation module.
3.5.1
Post-processing rules
The post-processing rules were applied off-line, using both pitch and energy information.
In terms of pitch values, the only information used was the presence or absence of pitch. The
energy information was also extracted off-line for each audio file. Speech and non-speech portions of the audio data were automatically segmented at the frame-level with a bi-Gaussian
model of the log energy distribution. That is, for each audio sample, a 1-dimensional energy
based Gaussian model of two mixtures is trained. In this case, the Gaussian mixture with the
“lowest” mean is expected to correspond to the silence or background noise, and the one with
the “highest” mean corresponds to speech. Then, frames of the audio file having a higher likelihood with the speech mixture are labeled as speech and those that are more likely generated
by the non-speech mixture are labeled as silence.
The integration of extra information was implemented as a post-processing stage with four
8 Work
performed by Helena Moniz, an expert linguist.
3.5. SPEECH DATA WORD BOUNDARIES REFINEMENT
-./01234/2#5/5167#89./4#
51
-./01234/2#:/67#89./4###
*"#
!"#$%&'()*%+,(-$.'/)
)"#
("#
'"#
&"#
%"#
$"#
!"#
(#
$!# $(# %!# %(# &!# &(# '!# '(# (!# ((# )!# )(# *!# *(# +!# +(# ,!# ,(# $!!#
*%+,(-$0)12$'/2%3()4"/5)
Figure 3.16: Improvement in terms of correct word boundaries, after post-processing.
rules9 :
1. if the word starts by a plosive sound, the duration of the preceding pause is unchanged
(typically around 50 to 60 ms for European Portuguese);
2. if the word starts or ends by a fricative, the energy-based segmentation is used;
3. if the word starts with a liquid sound, energy and pitch are used;
4. otherwise, they are delimited by pitch.
With these rules, more adequate word boundaries than those with previous segmentation
methods were expected, without imposing thresholds for silent pause durations, which were
recognized by (Campione and Véronis, 2002) as misleading cues that do not account for differences between speakers, speech rate or speech genres.
3.5.2
Results
By comparing the results in terms of word boundaries before and after the post-processing
stage in the limited test set of 1h duration, it was found that 9.3% of the constituent initial
phones and 10.1% of the constituent final phones were modified, in terms of boundaries. Regarding the inter-word pauses, 62.5% of them were modified and 10.9% more were added.
9 Proposed
by Helena Moniz
52
CHAPTER 3. CORPORA
Figure 3.17: Phone segmentation before (top) and after (bottom) post-processing.
Figure 3.16 illustrates the improvement in terms of correct boundaries, when different boundary thresholds are used. The graph shows that most of the improvements are concentrated in
an interval corresponding to 5-60 ms. The manual reference has 443.82 seconds of inter-word
pauses, the modified version correctly identified more 67.71 seconds of silence than in the original one, but there are still 14.81 seconds of silence that were not detected.
Figure 3.17 shows an example of a silent pause detection corresponding to a comma. The
original sentence is:
o Infarmed analisa cerca de quinhentos [medicamentos], os que levantam mais dúvidas
quanto à sua eficácia/ Infarmed analyses about five hundred [drugs], those that raise more
doubts about their effectiveness.
Initial and final word phones are marked with “L-”, and “+R”, respectively, whereas frequent
phone transition units are marked with “=”. The two automatic transcripts correspond to the
results obtained before (miss-detection) and after post-processing.
3.5.3
Impact on acoustic models training
A new acoustic model has been retrained using the modified phone boundaries after applying the above mentioned rules. Using this second model, the WER decreased from 22.0% to
21.5%.
The number of correct phone boundaries has also been compared for a given threshold in
the results produced by these two acoustic models, and Figure 3.18 shows the corresponding
results. The graph shows that the phone boundaries produced by the second acoustic model
are closer to the manual reference.
3.6. SUMMARY
53
./01234503$606278$9:/05$
./01234503$;078$9:/05$$$
!"#$%&'()*%+,(-$.'/)
)#$
(#$
'#$
&#$
"#$
%#$
!"#$
)$
"%$ ")$ &%$ &)$ '%$ ')$ (%$ ()$ )%$ ))$ *%$ *)$ +%$ +)$ ,%$ ,)$ -%$ -)$ "%%$
*%+,(-$0)12$'/2%3()4"/5)
Figure 3.18: Improvement in terms of correct word boundaries, after retraining.
3.6
Summary
This chapter described the most relevant data that has been used for the experiments performed in the scope of this thesis. Most of the experiments were conducted over broadcast
news data, but the relatively small size of the speech corpora soon demanded complementary
data, specially required for the capitalization task.
The current version of the Portuguese BN corpus has recently been completely revised by
an expert linguist, thereby removing many inconsistencies, specially in terms of punctuation
marks, but also in terms of capitalization. That was particularly important given than the
previous version of this corpus was manually transcribed by different annotators, who did not
follow consistent criteria in terms of punctuation marks. The user annotation agreement, in
terms of punctuation marks, was calculated, using the two versions of the corpus.
Besides the Portuguese speech corpus, other languages were processed, namely Spanish and English. The English BN corpus combine five different English BN corpora subsets.
Each corpora subset has been produced in different time periods, built for different purposes,
encoded with different annotation specification criteria, and is available in a different format. Combining these heterogeneous corpora demanded a normalization strategy, specifically
adapted for each corpus.
Written corpora contains information that is specially important for capitalization, given
that they provide information concerning the context where the capitalized words appear.
54
CHAPTER 3. CORPORA
All written corpora were normalized in order to be closer to a speech transcript, this task
demanded improving the existing normalization tools. Each corpus required a specially designed (or at least adapted) tool for dealing with specific phenomena. The Portuguese corpora
was collected from the web, and the English corpus is available from the LDC.
The automatic transcripts for all speech corpora were produced by the L2 F recognition system. The reference punctuation and capitalization for the automatic transcripts were provided
by means of an alignment between the manual and the automatic transcripts. That is not a
trivial task mainly because of the recognition errors.
The speech reference data has been upgraded recently in order to accommodate additional
prosodic information. This chapter described the prosodic feature extraction process and the
creation of a new data source, which accommodates additional prosodic information. The final
content is available as an XML file, containing not only pitch and energy, extracted directly
from the speech signal, but also phone information, syllable boundaries and syllable stress.
The prosodic features (pitch, energy and duration) were then used to adjust word boundaries,
automatically identified by the speech recognition system, using post-processing rules.
Capitalization Recovery
4
Many information sources, like newspaper articles, books, and most web pages, contain
proper capitalization. The capitalization task consists of rewriting each word of an input text
with its proper case information given its context. Besides improving the readability of texts,
capitalization provides important semantic clues for further text processing tasks. Different
practical applications benefit from automatic capitalization as a preprocessing step: many computer applications, such as word processing and e-mail clients, perform automatic capitalization along with spell corrections and grammar check; and while dealing with speech recognition output, automatic capitalization provides relevant information for automatic content
extraction, NER, and MT.
This chapter focus on the automatic recovery of capitalization information both in written
newspaper corpora and in spoken transcripts. This study considers that the capitalization of
the first word of each sentence is performed in a separated processing stage (after punctuation
for instance), since its correct orthographical form depends on its position in the sentence.
Results described here do not consider the first word of the sentence for evaluation. However,
results may be influenced when taking such words into account (Kim and Woodland, 2004).
This chapter is structured as follows: Section 4.1 analyses the subject of the capitalization
task in a written corpora basis. Section 4.2 reports on early work comparing different capitalization approaches, and choosing the best approach for processing speech transcripts. Section
4.3 goes deeper, studying the effects of the language dynamics on the capitalization task. Section 4.4 describes the efforts to understand the best way how capitalization models should be
updated. Section 4.5 presents the most recent work on capitalization. Section 4.6 reports on the
work porting the capitalization task to other languages. Finally, Section 4.7 summarizes all the
content of this chapter.
4.1
Capitalization Analysis Based in Written Corpora
The orthographical form of a given word can be classified as: lowercase (e.g., verbs, functional words, common nouns), first-letter-capitalized (e.g., proper nouns), uppercase (e.g.,
acronyms), and mixed-case (special words, such as McGyver or LaTeX). Many words assume a
56
CHAPTER 4. CAPITALIZATION RECOVERY
'$#'%&
*+,-./01-&
2.134566-.&
0**4566-.&
!"#$%&
'#(%&
'#$%&
789-:4/01-&
07;8<5+51&
)#"%&
Figure 4.1: The different capitalization classes and their distribution in the PUBnews corpus.
fixed capitalization form, unless they appear in the beginning of a sentence or in a title written
in uppercase. Many other words may assume different capitalization forms, depending on the
context they are used (e.g., the English words bank/Bank or miles/Miles).
This section describes the frequency of each capitalization class in a corpus, and calculates the portion of words with ambiguous/unambiguous capitalization. The results reported
were achieved using the PUBnews newspaper corpus, described in Section 3.2.1. The training portion of the corpus contains about 700K different word forms, but most of them rarely
occur, as predicted by Zipf’s Law. In fact, about 46% of these word forms occur only once
(hapax legomena), and many of them consist of spelling errors (e.g., abandomos instead of abandonos); verbs with enclitic pronouns (e.g., abater-se-á); unusual strings of words that were connected
by hyphens for some discursive reason (e.g., a-comida-que-veio-do-mar/food-that-came-fromthe-sea); words used in uppercase for emphasis purposes, which would otherwise written in lowercase
(e.g., ABANDONADO); foreign words (e.g., Abkykalykov); and other less common words (e.g.,
abismaticamente, abarcavam). Some of these words are easy to capitalize, for example the
verbs with enclitic pronouns are always written in lowercase. There are 27 different enclitic pronouns that can be used in the same verbal form, which originates a very large number of wordforms, but their identification is straightforward, at least in written corpora.
4.1.1
Capitalization Ambiguity
The capitalization ambiguity was analyzed by considering unique case-folded words occurring in the training portion of the corpus, and then finding all their possible graphical forms.
For this study, a word is considered unambiguous in respect to its capitalization, if it occurs
in the corpus with the same orthographical form at least 99% of the times. This assumption
works well for high frequency words, but gets less accurate for lower frequency words. For
that reason, only case-folded words occuring at least 5 times were considered, also reducing
the influence of spelling errors and unusual words connected by hyphens. Figure 4.1 illus-
4.1. CAPITALIZATION ANALYSIS BASED IN WRITTEN CORPORA
,-./01234/4"5264."
57
7264."589:"-;<8=>2>."?-@89-38A-B2C"
!"#$%&'()'*(&+,'
'#!!!"
'!!!!"
&#!!!"
&!!!!"
%#!!!"
%!!!!"
$#!!!"
$!!!!"
#!!!"
!"
$"
%"
&"
'"
#"
("
)"
*"
+" $!" $$" $%" $&" $'" $#" $(" $)" $*"
-(&+')&%."%/01'2/3%&456'
Figure 4.2: Number of words by frequency interval in the PUBnews corpus.
trates the proportion of each capitalization class in the corpus. The graph shows that most
words in this corpus are written in lowercase, but sill a large percentage is capitalized, and
the first-letter-capitalized (first-upper) is the most common capitalization form. The number of
mixed-case words is quite similar to the number of uppercase words, and together they constitute the smaller subsets. The capitalization problem concerns the portion of ambiguous words
in respect to their capitalization. The figure shows that only a small portion of the words (about
9.6%) fits into this category, still they are representative.
The relation between the word frequency and the capitalization ambiguity is an important
issue, specially for the less frequent words. Such relation provides important knowledge concerning the ability of predicting the capitalization ambiguity of rare and unseen words. Figure
4.2 shows the number of different case-folded words and ambiguous words in terms of capitalization, by frequency interval. In order to achieve more meaningful results, the analysis is
performed using frequency intervals of the form Fi = 2i , 2i+1 , where the size of each interval Fi grows exponentially, corresponding to intervals of size 2i . For example, interval [16, 31]
contains about 30k different case-folded words, 2140 of them are ambiguous in terms of capitalization. As expected, the figure shows (upper curve) that the number of case-folded words
with a given frequency decreases exponentially when moving from lower to higher word frequency intervals, thereby confirming the well known Zipf’s law. The figure also shows (lower
curve) that the number of words with capitalization ambiguity does not follow the same tendency. On the contrary, their number grows from the intervals F3 to F7 (frequency between 8
and 255), despite the first intervals containing far more different case-folded words.
58
CHAPTER 4. CAPITALIZATION RECOVERY
,-./01234/456274.
8274.569:;5-<=9>?2?.5@-A9:-39B-C2D
!"#$"%&'(")*+),*#-.
%#"
%!"
$#"
$!"
#"
!"
$
%
&
'
#
(
)
*
+
$!
$$
$%
$&
$'
$#
$(
$)
$*
/*#-)+#"01"%$2)3%&"#4'5
Figure 4.3: Distribution of the ambiguous words by word frequency interval.
This corpus contains about 189K different words after case-folding, and considering only
occurrencies above 5. The total number of ambiguous words in terms of capitalization is about
9.6%, which corresponds to about 18K. Figure 4.2 has shown that lower frequency intervals
contain much more words than higher frequency intervals, while the capitalization ambiguity
distribution does not have the same configuration. That is illustrated better in Figure 4.3, which
shows the word distribution per interval, considering the whole corpus. The figure reveals that
most of the ambiguous words are neither uncommon nor highly frequent, while it confirms
that most of the words have lower frequency. Finally, is interesting to calculate the proportion
between ambiguous and unambiguous words in terms of capitalization, for a given frequency
interval. Figure 4.4 illustrates such relation, revealing that words having frequency around 214
(16K) are likely to be more ambiguous in terms of capitalization.
One important conclusion coming from these results is that a suitable capitalization model
must consider the context where each word appears, and a simple capitalization lexicon is not
sufficient to deal with the capitalization ambiguity of many frequent words occurring in written
corpora. Nevertheless, a capitalization lexicon containing the most frequent capitalization form
of each word may be of use for dealing with out-of-vocabulary (OOV) words when retraining
is not possible.
4.2
Early Work Comparing Different Methods
An important preparatory step performed in the scope of this thesis consisted of asserting which methods would apply better to the capitalization problem, considering the different
4.2. EARLY WORK COMPARING DIFFERENT METHODS
59
Capitalization ambiguity
50%
40%
30%
20%
10%
0%
1
2
3
4
5
6
7
8
9
10 11 12 13 14 15 16 17 18
Word frequency interval
Figure 4.4: Proportion of ambiguous words by word frequency interval.
types of corpora. In addition to the maximum entropy modeling approach, described in Section 2.3, we have explored two additional approaches, namely: (1) an HMM-based tagger, as
implemented by the disambig tool from the SRILM toolkit (Stolcke, 2002); (2) a transducer,
built from a previously created language model (LM). The two additional approaches are generative (joint), while the maximum entropy approach is discriminative (conditional).
Discriminative approaches model the posterior probabilities directly, and the parameter
values are inferred from the set of labelled training data. On the other hand, generative approaches model the joint distribution p(k, X ) of events and labels, which can be done, for instance, by learning the class prior probabilities p(k ) and the class-conditional densities p( X |k )
separately, and, finally, by using the Bayes’ theorem to calculate the posterior probabilities
p(k | X ). Generative models are very good at modeling context, and they can handle missing
data or partially labelled data. Nevertheless, all other things being equal, it would be expected
that discriminative methods would have better predictive performance, since they are trained
to predict the class label rather than the joint distribution of input vectors and targets (Ulusoy
and Bishop, 2005). The following subsections provide details on the generative methods and
present results comparing the three methods.
4.2.1
Description of the Generative Methods
The two generative approaches start with an n-gram language model (LM), created from
the training corpus. The HMM-based approach uses the language model directly, while the
WFST-based approach requires creating a transducer from it. Our experiments use unigram,
bigram, and trigram language models, created using backoff estimates, as implemented by the
ngram-count tool of the SRILM toolkit, without n-gram discounts.
60
CHAPTER 4. CAPITALIZATION RECOVERY
Training process
Capitalization process
Map
Corpus
ana
canto
luís
faria
...
ana Ana ANA
canto Canto CANTO
luís Luís LUÍS
faria Faria FARIA
Lower-case
sentence
Count
ngrams
HMM
tagger
Language
Model
Capitalized
sentence
Figure 4.5: Using the HMM-based tagger.
Despite the fact that the capitalization experiments here described rely solely on word information, a strong disadvantage of using generative methods is the difficulty of benefiting
from a variety of features, such as part-of-speech tags or acoustic/prosodic features.
4.2.1.1
HMM-based Tagger
The HMM-based tagger, implemented by the disambig tool, uses a hidden-event n-gram
LM (Stolcke and Shriberg, 1996), and can be used to perform capitalization directly from the
LM. Figure 4.5 illustrates the process, where each cloud represents a process and ellipses represents data. Map represents a file that contains all possible orthographical forms of words in the
vocabulary. The idea consists of translating a stream of tokens from a vocabulary L (lowercase
words) to a corresponding stream of tokens from a vocabulary C (capitalized words), according
to a 1-to-many mapping. Ambiguities in the mapping are resolved by finding the C sequence
with the highest probability given the L sequence. This probability is computed from the LM1 .
This implementation of the HMM-based tagger can use different algorithms for decoding.
However, results presented here use the Viterbi decoding algorithm, where the output is the
sequence with the highest joint posterior probability. This is a straightforward method, producing fast results, and often used by the scientific community for this task. For example, it
has been the baseline suggested in the IWSLT workshop competitions2 .
4.2.1.2
Finite state transducers
The capitalization based on Weighted Finite State Transducers (WFST) is illustrated in figure 4.6. This approach makes use of the LM previously built for the HMM-based tagger, which
1 See
disambig manual for more information.
IWSLT2006/downloads/case+punc_tool_using_SRILM.instructions.txt
2 http://www.slt.atr.jp/
4.2. EARLY WORK COMPARING DIFFERENT METHODS
Training process
61
Capitalization process
Language Model
Lower-case
sentence
LM to FSA
Convertion
Text to FSA
FSA (S)
FSA
bestpath (SoT)
Change input
to lowercase
Wfst to text
Capitalized
sentence
Capitalization
wfst (T)
Figure 4.6: Using a WFST to perform capitalization.
is converted into an automaton (FSA), corresponding to a WFST having the input equal to the
output. The capitalization transducer T is created from this last WFST by converting every
word in the input to its lowercase representation. Notice that the input of the transducer T
uses a lowercase vocabulary while the output includes all orthgraphical forms. In order to capitalize a given input sentence, it must be firstly converted into an FSA (S) and then composed
with the transducer T. The resultant transducer contains all possible sequences of capitalized
words, given the input lowercase sequence. The bestpath() operation over this composition
returns the most probable sequence of capitalized words.
From a more theoretical point of view, the capitalization process consists of calculating the
best sequence of capitalized tokens c ∈ C ∗ for the lowercase sequence l ∈ L∗ , as expressed in
equation 4.1.
ĉ = argmax P (c|l )
(4.1)
c∈C ∗
using Bayes’ rule:
P(c|l ) =
P(l |c) P(c)
P(l, c)
=
P(l )
P(l )
(4.2)
Assuming that P(l ) is a constant, the capitalization process consists of maximizing the result of
P(l |c) ∗ P(c) or P(l, c) as expressed by equation 4.3.
62
CHAPTER 4. CAPITALIZATION RECOVERY
ĉ = argmax P (l, c)
(4.3)
c∈C ∗
In terms of transducers, the prior P(c) can be computed from the FSA built from the LM,
and P(l |c) is computed from the FSA built from the sentence. The composition SoT contains
all possible capitalization sequences c for the input sequence l, and the P(l, c) can be computed
from all paths associated with sequence c. The Viterbi approximation is used, therefore the
bestpath() operation over the composition returns the c sequence that maximizes P(l, c).
4.2.2
Comparative Results
The three methods provide different ways of performing automatic capitalization. However, while the generative methods require a predefined vocabulary, this is not the case of the
discriminative method. In order to compare all three methods, the training data was restricted
to a vocabulary. The vocabulary contains almost all the words that can be found in the speech
transcripts. Notice that the latest versions of the ASR use a dynamic vocabulary. As a closed
vocabulary is used, all words outside the vocabulary were marked “unknown”. The punctuation marks were also removed from the corpus, bringing the written newspaper corpus closer
to speech transcripts. The out-of-vocabulary (OOV) words include proper nouns, domainspecific words, and unusual verbal forms, but their capitalized form is usually fixed. Hence,
apart from verbs, most of them can be handled with domain-specific and periodically updated
lexica. These experiments used only the word identification information, sometimes combined
as bigrams and trigrams.
The capitalization of mixed-case words is simple to accomplish using the generative methods, but increases considerably the discriminative model complexity. For that reason, these
experiments considered only three ways of writing a word: lowercase, uppercase, and firstcapitalized, not covering the mixed-case words. The experiments were conducted both on written newspaper corpora and on spoken transcripts, making it possible to analyze the impact of
the different methodologies over these two different data sets. The first set of experiments was
performed on written newspaper corpora, using the PUBnews corpus both for training and
evaluation, allowing to establish an upper-bound for capitalization. Written newspaper corpus
and spoken transcripts were combined in order to provide richer training sets and to reduce
the problem of having small quantities of spoken data for training. The spoken data evaluation subset corresponds to merging Eval, JEval and RTP07, but only manual transcripts were
used because, by the time these experiments were performed, the automatic speech transcript
did not include reference case information. Results achieved using only the most common orthographical form were included in the experiments, which is a popular baseline for similar
work (Lita et al., 2003; Chelba and Acero, 2004; Kim and Woodland, 2004; Agbago et al., 2005).
4.2. EARLY WORK COMPARING DIFFERENT METHODS
LM options
LM size (MB)
unigrams
3.2
bigrams
198
63
trigrams
504
Table 4.1: Different LM sizes when dealing with PUBnews corpus.
LM options
unigrams
bigrams
trigrams
HMM-based tagger
Precision Recall
F
SER
89.6%
61.5% 72.9% 45.2%
92.0%
72.0% 80.8% 33.8%
93.3%
74.5% 82.8% 30.4%
Precision
85.0%
89.5%
91.7%
WFST
Recall
F
64.4% 73.3%
73.1% 80.5%
75.3% 82.7%
SER
46.3%
34.8%
30.9%
Table 4.2: Capitalization results of the generative methods over the PUBnews corpus.
4.2.2.1
The generative approaches
An LM created from a large written newspaper corpus may include spelling errors and rare
words, which combined as bigrams and trigrams increases the size of the LM without much
gain. Thus, all bigrams and trigrams occurring less than 4 times were removed from LMs built
from the PUBnews training data. Doing so, a significant reduction in the LM size is achieved
without much impact on the results. Table 4.1 shows the size of each LM, after this restriction,
depending on the building options.
Table 4.2 shows results achieved by training and testing on written newspaper corpus,
where F corresponds to F-measure. The left side of the table shows results produced by the
HMM-based tagger, while the right side shows equivalent results produced using the WFST
approach (transducers), for the same training and testing data. Similar results were expected
from both methods, since the transducers were built from exactly the same LM, nevertheless
the HMM-based tagger achieves a slightly better performance. As expected, results improve
as the LM order increases: the best results were achieved using trigram models, however the
largest difference occurs when moving from unigrams to bigrams. We have performed other
experiments using 4-grams, but results do not improve significantly. That is consistent with the
work of Gravano et al. (2009), that concludes that increasing the n-gram order does not improve
capitalization results. Despite no spelling errors exist in the ASR output, recognition errors
and disfluencies are quite frequent, specially in spontaneous speech. For this reason, results
on a written newspaper corpus should be taken as an upper-bound for the capitalization over
automatic speech transcripts.
The spoken data is insufficient for training, so both PUBnews and the ALERT-SR BN training data were combined in order to provide a richer LM. The final LM is a linear interpolation
between: LM1 - built from PUBnews training data; and LM2 - built from the ALERT-SR training data, where the interpolation parameter lambda was 0.737121 when considering trigrams
(perplexity = 174.5) and 0.690977 when considering bigrams (perplexity = 267.7). Previous
lambda values, calculated using the compute-best-mix tool (included in the SRILM toolkit),
64
CHAPTER 4. CAPITALIZATION RECOVERY
LM options
unigrams
bigrams
trigrams
HMM-based tagger
Precision Recall
F
SER
85.9%
72.5% 78.6% 39.3%
82.5%
81.9% 82.2% 35.1%
81.8%
83.4% 82.6% 34.8%
Precision
85.9%
82.6%
81.9%
WFST
Recall
F
72.7% 78.8%
82.0% 82.3%
83.3% 82.6%
SER
39.2%
34.8%
34.7%
Table 4.3: Capitalization results of the generative methods over the ALERT-SR corpus.
minimize the perplexity of the interpolated model, considering the BN corpus development
subset, not previously used for training. Perplexity is a common way of evaluating language
models, which are probability distributions over texts. Given a probability model q, one may
evaluate q by asking how well it predicts a separate test sample x1 , x2 , ..., x N . The perplexity of
N 1
the model q is defined as 2−∑i=1 N log2 q(xi ) . Better models will tend to have a lower perplexity,
which means they are less surprised by the test sample (Jelinek et al., 1977; Brown et al., 1992).
Table 4.3 shows results for capitalization of speech transcripts, where the left side of the
table shows results achieved by the HMM-based tagger and the right side shows equivalent results achieved using a transducer. Results reveal a decrease of performance when moving from
written newspaper corpora to speech transcripts. The precision values decrease significantly,
while the recall increases because of the vocabulary restrictions of the spoken transcriptions.
The best results are produced with trigrams, but they are very similar to the results obtained
from bigrams, given the flexible linguistic structure of spoken texts, in opposition to written
texts. Since the written newspaper corpora have properties different from speech transcripts,
the availability of more spoken training data would certainly improve these results.
In short, the two generative methods applied produce similar results. Nevertheless, the
current implementation of the WFST method implies loading, composing and searching a large
non-deterministic transducer for each sentence, thus being the method computationally most
expensive among those here proposed. This process was accelerated in the latter experiments
by considering blocks of n sentences (e.g., 1K sentences) and applying the capitalization transducer to the whole block. The automaton created from the block contains a single path that
goes from the initial to the final state, covering the whole text sequence. The time required for
the composition operation is only a few seconds, and it is performed once for each group of
n sentences. Using this strategy, the time required for this method became similar to the time
taken by the HMM tagger.
4.2.2.2
The discriminative approach
Experiments concerning the discriminative approach, presented in this subsection, use the
following unigram and bigram features for a given word: wi , hwi−1 , wi i, and hwi , wi+1 i. Additional experiments using trigrams were also initially performed, but no improvements were
achieved. The memory limitations mentioned in Section 2.3 make it difficult to use all writ-
4.2. EARLY WORK COMPARING DIFFERENT METHODS
Training approach
PUBnews corpus
ALERT-SR corpus
Prec.
92.0%
87.4%
n-gram counts
Recall
F
73.8% 81.9%
80.5% 83.8%
SER
32.4%
31.0%
65
successive retraining
Prec. Recall
F
SER
93.2% 72.5% 81.6% 32.6%
87.2% 80.5% 83.7% 31.2%
Table 4.4: ME-based capitalization results using limited vocabulary.
Training approach
PUBnews corpus
ALERT-SR corpus
Prec.
93.3%
87.3%
n-gram counts
Recall
F
82.2% 87.4%
85.4% 86.3%
SER
23.3%
26.8%
successive retraining
Prec. Recall
F
SER
93.2% 83.2% 87.9% 22.5%
86.5% 85.9% 86.2% 27.2%
Table 4.5: ME-based capitalization results using unlimited vocabulary.
ten newspaper corpus for training. Therefore, the following experiments use the two different
strategies, described in the Section 2.3.2: i) use all training data, by extracting n-gram counts
and then producing features for each corresponding n-gram; ii) Perform successive retraining
over all training data, using blocks of fixed size (2.5 million words). All the words occurring
at least four times in the PUBnews corpus were used for training. Table 4.4 shows the corresponding results, revealing similar performances in terms of SER for both strategies. In what
concerns the BN speech transcripts, the ALERT-SR training data was used together with the
PUBnews training data in order to create the ME models. For written corpora, the first strategy
achieves a better recall, while the second one achieves a better precision, but results are quite
similar. Results concerning the speech transcripts reveal a lower precision but a better recall
when comparing to written corpora. The features used in these experiments are more expressive than simple bigrams and less expressive than trigrams. This statement is supported by
the achieved results, given that the performance is better than using bigrams with the generative approaches, and worse than using trigrams with the generative approaches. While the
generative approaches are more adequate for capitalizing written newspaper corpora, the discriminative approach produces better results for the BN transcripts, corresponding to the best
results seen so far. The second strategy learns the most common capitalization combinations
appearing in the corpus, being suitable for the flexible linguistic structure found in the speech
transcripts, specially in spontaneous speech.
4.2.3
Results using Unlimited Vocabulary
Data from previous subsection, was restricted to a limited vocabulary, in order to allow
comparing all the three methods. However, such restriction is only imposed by the generative approaches, not being required by the discriminative approach. Therefore, another set of
experiments has been performed without any vocabulary restrictions, using uniquely the discriminative approach. All the words occurring less than four times in the PUBnews training
corpus have been removed, minimizing the effects of misspelled words and reducing memory
66
CHAPTER 4. CAPITALIZATION RECOVERY
limitations. Table 4.5 shows the achieved performance, where the capitalization model applied
to the BN transcripts was also retrained with BN speech transcripts (ALERT-SR corpus) before
being applied to the BN data. Results reveal an expected increase of performance, specially in
terms of recall, when comparing with results from Table 4.4. Differences are significant, specially concerning the written corpus (about 10% absolute). In fact, this corpus contains a very
large number of different words, previously considered as OOV words. The speech data contains much less OOV words, but the differences (about 4% absolute) are still quite significant.
4.3
Impact of Language Dynamics
In order to better understand the way we should train our capitalization models, we have
started by analyzing the newspaper corpus for establishing a relation between the vocabulary
usage and the time-line. The major goal with these experiments is to assess the importance of
training data collected in nearby testing periods for capitalizing written corpora, manual and
automatic transcripts. The first set of experiments concern the capitalization of written corpora.
The capitalization over Broadcast News (BN) orthographic transcripts is addressed afterwards.
The major source of capitalization information is always provided by written newspaper corpora, and the evaluation was conducted on three different subsets of speech transcripts, collected from different time periods.
Experiments described in this section were performed using the discriminative modeling
approach, described in Section 2.3, following the retraining strategy. Like the experiments from
the previous section, experiments in this section used only features comprising word identification, sometimes combined as bigrams: wi (current word); hwi−1 , wi i, hwi , wi+1 i (bigrams). Only
words occurring more than once were included for training, and only three ways of writing a
word were explored: lowercase, uppercase, and first-capitalized.
4.3.1
Data Analysis
In order to verify the relation between the vocabulary usage and the time line, we have
started by analysing the PUBnews newspaper corpus. Each subset of about 2.5 million words
from the newspaper corpus contains about 86K unique words, where only about 50K occur
more than once. In order to assess the relation between the word usage and the language
period, a set of vocabularies were created, with the 30K more frequent words appearing in
each training set (roughly corresponds to a frequency above 3). Then, the first and last corpora
subsets were checked against each one of the vocabularies. The name of each vocabulary is the
same as the corresponding corpora subset, and corresponds to the month of the latest data in
that subset. Figure 4.7 shows the results, revealing that the number of OOV (Out of Vocabulary
Words) decreases as the time gap between the train and test periods gets smaller.
4.3. IMPACT OF LANGUAGE DYNAMICS
67
2004-12
1999-01
140k
130k
OOV
120k
110k
100k
90k
1999-01
1999-04
1999-07
1999-10
1999-12
2000-03
2000-05
2000-07
2000-10
2000-12
2001-02
2001-05
2001-07
2001-09
2001-11
2002-01
2002-03
2002-05
2002-08
2002-10
2003-01
2003-04
2003-07
2003-09
2003-12
2004-02
2004-04
2004-06
2004-08
2004-11
80k
Vocabulary period
Figure 4.7: Number of OOVs considering written corpora.
Eval
JEval
RTP07
4.5%
OOV
4.0%
3.5%
3.0%
1999-01
1999-04
1999-07
1999-10
1999-12
2000-03
2000-05
2000-07
2000-10
2000-12
2001-02
2001-05
2001-07
2001-09
2001-11
2002-01
2002-03
2002-05
2002-08
2002-10
2003-01
2003-04
2003-07
2003-09
2003-12
2004-02
2004-04
2004-06
2004-08
2004-12
2.5%
Vocabulary Period
Figure 4.8: Proportion of OOVs considering speech transcripts.
The same conclusion also applies to speech transcripts, but the visualization is not so obvious. Using the same 57 previously calculated vocabularies, we have counted the number of
words appearing in each testing set of the speech corpus that were not covered by each one of
the vocabularies. Figure 4.8 shows the corresponding results, where for each one of the testing
periods the closest training periods are marked with a circle. The graph shows that, for each
evaluation set, vocabularies built with nearby data have a better coverage of the data.
These results suggest that lexical information changes whenever the language period
changes. But, how does this affect the capitalization task? The rest of this section addresses
this question, by presenting capitalization experiments with different time conditions.
68
CHAPTER 4. CAPITALIZATION RECOVERY
Train
1999
2000
2001
2002
2003
2004
Precision
93.3%
93.5%
93.7%
93.2%
93.1%
92.5%
2000-12 test set
Recall
F
79.0% 85.6%
80.1% 86.3%
80.0% 86.3%
79.2% 85.6%
77.7% 84.7%
78.0% 84.6%
SER
26.1%
25.0%
25.0%
26.1%
27.5%
27.8%
Precision
92.4%
92.4%
92.9%
92.5%
93.0%
92.5%
2004-12 test set
Recall
F
76.3% 83.6%
76.8% 83.9%
76.4% 83.8%
78.2% 84.8%
78.2% 85.0%
79.7% 85.6%
SER
29.6%
29.1%
29.1%
27.7%
27.3%
26.4%
Table 4.6: Using the first 8 subsets of each year for training.
!(((+)!$,-./$.-/$012$
!(("+)!$,-./$.-/$012$
!"#$
'(#$
!&#$
!%#$
!"#$
)***$
!((($
!(()$
!((!$
!(('$
!(("$
%&'()()*$+,&(-.$
Figure 4.9: Performance for different training periods.
4.3.2
Capitalization of Written Corpora
The following experiments use the PUBnews newspaper corpus, where the first 57 subsets
were used for training and last subset is used for evaluation. The punctuation marks were
removed from the original text, and only events occurring more than once were included for
training.
4.3.2.1
Separated training using one year of training data
In order to assess how time affects the capitalization performance, the first experiments
consist of producing six isolated capitalization models, one for each year of training data. For
each year, the first 8 subsets were used for training and the last one was used for evaluation.
Table 4.6 shows the corresponding capitalization results for the first and last testing subsets, revealing that performance is affected by the time lapse between the training and testing periods.
The best results were always produced with nearby the testing data, even if it is not so obvious
for the first test set. The performance results for each training period are also illustrated in Figure 4.9. A similar behavior was observed on the other four testing subsets, corresponding to the
last subset of each year, but results are not presented here for simplicity reasons. Results reveal
a degradation of performance when the training data is from a time period after the evaluation
4.3. IMPACT OF LANGUAGE DYNAMICS
Checkpoint
1999-12
2000-12
2001-12
2002-12
2003-12
2004-08
LM #lines
1.27 Million
1.86 Million
2.36 Million
2.78 Million
3.10 Million
3.36 Million
69
Precision
92.4%
92.5%
93.0%
93.2%
92.9%
93.2%
Recall
77.0%
79.3%
79.9%
80.8%
82.2%
83.2%
F
84.0%
85.4%
86.0%
86.6%
87.2%
87.9%
SER
29.0%
26.6%
25.7%
24.7%
23.6%
22.5%
Table 4.7: Retraining from Jan. 1999 to Sep. 2004.
Forward Training
Backward Training
45%
SER
40%
35%
30%
25%
20%
1999-01
2000-07
2001-11
2003-04
2004-08
Checkpoint data
Figure 4.10: Capitalization of written corpora, using forward and backwards training.
data. Results presented in the last row concerning the 2004-12 evaluation set are about 3.9%
worse (in terms of SER) than those presented in Section 4.2.3 for the same test set and using the
same training conditions. This is mostly due to a low coverage of the training data, revealing
that 20 million words training sets do not provide sufficient data for the capitalization task.
4.3.2.2
Forward and Backward Retraining
Previous results used one year of training data, by iteratively retraining the previously calculated capitalization models with the new data. The following results use the same retraining
strategy, but considering all the training corpus. Table 4.7 shows the results achieved with this
approach for the PUBnews test set (Oct. to Dec. 2004), revealing higher performance as more
training data is available.
The following experiment shows that, besides the amount of data, the training order is important. In fact, previous results could suggest that the increase of performance comes uniquely
from the increasing number of training events. For that reason, another experiment has been
performed, using the same training and testing data, but retraining backwards, similarly to
70
CHAPTER 4. CAPITALIZATION RECOVERY
PUBnews test
Eval
JEval
RTP07
45%
SER
40%
35%
30%
25%
20%
1999-01
2000-07
2001-11
2003-04
2004-08
Training data period
Figure 4.11: Automatic capitalization of speech transcripts, using forward retraining.
what has been performed by Mota and Grishman (2009) for NER. Corresponding results are
illustrated in Figure 4.10, revealing that the backwards training results are worse than forward
training results, and that backward training results do not always increase, rather they stabilize
after a certain amount of data. Despite the fact that both training experiments use all the training data, in the case of forward training the time gap between the training and testing data gets
smaller for each iteration, while in the backwards training it grows. Similar conclusions were
achieved by Mota and Grishman (2009) for their NER tagger based on co-training, where the
gain was higher when older seeds and contemporary unlabeled data was used, instead of using
contemporary seeds and older unlabeled data. Figure 4.10 also attest that a retraining-based
strategy is suitable for using large amounts of data and dealing with language adaptation.
4.3.3
Capitalization of Speech Transcripts
The following experiments apply the previous capitalization models, learned from written
corpora, directly to the ALERT-SR evaluation data. The impact of the speech recognition errors
in the capitalization of speech data is obtained by performing the same experiments both in
manual and automatic transcripts.
4.3.3.1
Manual transcripts
In order to assess the relation between the data period and the capitalization performance
of the speech data, the forward trained capitalization models, used previously for written corpora (see Figure 4.10), were also applied to each one of the SR corpus evaluation subsets. Figure
4.11 shows the corresponding performance variations for each speech data evaluation set, in
4.3. IMPACT OF LANGUAGE DYNAMICS
Training
Data
1999
2000
2001
2002
2003
2004
All
Prec.
83.2%
83.4%
84.3%
84.1%
82.9%
83.9%
83.3%
Eval
Recall
80.0%
79.6%
80.2%
79.5%
78.9%
78.9%
80.7%
SER
35.8%
35.9%
34.5%
35.5%
37.3%
36.1%
35.2%
Prec.
86.3%
86.3%
86.7%
86.2%
85.9%
86.5%
85.7%
71
JEval
Recall
83.8%
84.4%
86.7%
86.1%
85.4%
84.7%
88.4%
SER
29.4%
28.9%
26.4%
27.5%
28.4%
28.3%
26.2%
Prec.
92.5%
92.4%
93.1%
93.2%
91.5%
92.4%
92.2%
RTP07
Recall
80.3%
80.9%
79.6%
80.6%
81.9%
82.2%
84.5%
SER
26.1%
25.7%
26.2%
25.1%
25.6%
24.4%
22.4%
Table 4.8: Evaluating with manual transcripts.
terms of SER. The figure suggests that, for each test set, the performance grows until the correspondent time period is reached, but it does not significantly improve after that period. The
continuously growing performance for the PUBnews and RTP07 test sets is related with their
time period being later to the training data. The graph also shows that the performance is similar both to written corpora and to speech transcripts. However, the performance evolution
concerning the written corpora is smoother and steeper.
Another relevant question concerns the amount of training data required for achieving the
best results. Is it necessary to use all the training data? Besides using all corpora for training,
some experiments were also conducted using only the first 8 corpora subsets of every year
for training (about 20 million words). Another set of experiments, either applying the written
corpora LMs directly to the transcripts data or retraining it with transcripts before applying it,
revealed a better performance for the later, due to closer properties between the training and
the testing sets. Hence, we have retrained the ME models calculated from written corpora with
manual transcripts training data, thus achieving 2% to 5% improved performance. Table 4.8
shows the corresponding results, in terms of SER, Precision and Recall, where bold marks the
best results for each corpus subset. The first 6 rows corresponds to the initial training with the
first 8 corpora subsets of each year, while the last row corresponds to using all training data.
The table shows that when all data is used, a better recall is achieved whereas the precision
slightly decreases. In general, we may conclude that, for capitalizing manual transcripts, large
amounts of training data are not necessary, if recent data is available. Results also show that
the RTP07 test subset consistently presents best performances in opposition to the Eval subset.
Nonetheless, the worse performance for the Eval and JEval sets is also due to the unusual topics
covered in the news by that time (US presidentials and War on Terrorism). Notice that a number
of unusual foreign names were brought to focus by the media organizations by that time.
4.3.3.2
Automatic transcripts
Table 4.9 shows the results of capitalizing automatic transcripts with the LMs also used for
the results of Table 4.8. As shown in the table, overall results are about 20% worse in terms of
72
CHAPTER 4. CAPITALIZATION RECOVERY
Training
Data
1999
2000
2001
2002
2003
2004
All
Prec.
71.9%
72.1%
72.8%
72.9%
71.8%
72.5%
71.6%
Eval
Recall
72.9%
72.7%
73.4%
73.0%
72.4%
72.8%
73.6%
SER
55.2%
55.1%
53.7%
54.0%
55.8%
54.6%
55.2%
JEval
Recall
76.4%
77.2%
78.0%
77.9%
77.1%
76.5%
79.5%
Prec.
74.6%
74.8%
74.9%
74.7%
74.2%
74.8%
73.7%
SER
49.4%
48.5%
47.8%
48.1%
49.3%
49.0%
48.4%
Prec.
79.4%
79.5%
79.7%
80.4%
79.2%
80.3%
79.2%
RTP07
Recall
72.7%
73.0%
72.0%
72.9%
73.7%
73.7%
75.2%
SER
46.0%
45.5%
46.1%
44.7%
45.5%
44.3%
44.3%
Table 4.9: Retraining with manual and evaluating with automatic transcripts.
ASR output
Eval
JEval
2004
2003
2002
2001
2000
1999
2004
2003
2002
2001
2000
1999
2004
2003
2002
2001
2000
58%
53%
48%
43%
38%
33%
28%
23%
1999
SER
Manual Transcriptions
RTP07
Figure 4.12: Comparing the capitalization results of manual and automatic transcripts.
SER, which is mostly due to the recognition WER (Word Error Rate). Besides, some errors may
also be due to unsolved alignment problems. In fact, more accurate results would be achieved
if a manually corrected capitalization alignment could be used. These results also suggest a
strong relation between the performance and the training period. The distribution of values is
similar to the previous results concerning manual transcripts. Figure 4.12 illustrates the relation
between training and testing periods, for both manual and automatic transcripts. A number of
additional experiments were also conducted, for example, by retraining each written corpora
capitalization model with automatic transcripts before applying to automatic transcripts, but
only small differences were achieved, sharing the same tendency.
4.3.4
Conclusions
We have studied the impact of language variations over time and the way it affects the capitalization task. The results reveal a strong relation between the capitalization performance and
the elapsed time between the training and testing data periods. Reported experiments suggest
that the capitalization performance decays over time, as an effect of language variation, supporting the idea that different capitalization models should be used for different time periods.
4.4. CAPITALIZATION MODEL ADAPTATION
73
That has also been previously addressed by related work on NER, with similar conclusions
(Collins and Singer, 1999; Mota and Grishman, 2008, 2009). The adopted approach, based on
maximum entropy models, takes the language changes into consideration, providing a clean
framework for learning with new data, while slowly discarding unused data.
4.4
Capitalization Model Adaptation
The application of automatic speech recognition (ASR) to close-captioning of Broadcast
News (BN) has motivated the adaptation of the lexical and language models of the recognizer
on a daily basis with text material retrieved from online newspapers (Martins et al., 2007b,a).
The vocabulary and language model adaptation approaches use 3 corpora as training data: the
manual transcriptions of the BN speech training data, a large newspaper text database with
741M words and a relative small adaptation set consisting of the 7 last days of online text. The
selection of the 100k dynamic vocabulary is POS-based. Relative to the first ASR version, which
used a fixed vocabulary of 57k words, the dynamic version achieves a relative reduction of 65%
in OOV (out-of-vocabulary) word rate and of 5.7% in WER (word error rate). Roughly half of
this improvement is due to the increased size of the vocabulary, as shown by the WER results
obtained with a baseline version using a static vocabulary of 100k words.
These improvements have an obvious impact on the quality of the automatically produced
subtitles, which include online punctuation and capitalization. An offline topic segmentation
and indexation module (Amaral and Trancoso, 2008) splits the BN show into stories and assigns one or more topics to each story from a closed set of topics. An extractive summarization
module (Ribeiro and Matos, 2007) also assigns a summary to each story. These post-ASR modules were originally trained with the material available until a certain date, in no way taking
advantage of the online newspapers which are daily being collected. One of the goals of this
study is to try to use this data to train better models for the capitalization.
This section analyses the capitalization performance either using a static capitalization
model (CM) or using dynamic capitalization models retrained over time with training data
from nearby time periods. The capitalization procedure uses the adopted ME-based approach,
and the features are based on word identification: wi (current word); and hwi−1 , wi i, hwi , wi+1 i
(bigrams including the previous and next words). Like before, only three ways of writing a
word are considered: lowercase, first-capitalized, and uppercase.
4.4.1
Baseline results
The capitalization model currently in use for BN close-captioning, denoted as BaseCM,
provides the baseline and it was trained with the content of the PUBnews corpus. The capitalization model adaptation uses the LMnews corpus, which consists of online text daily collected
74
CHAPTER 4. CAPITALIZATION RECOVERY
Evaluation set
Manual transcript
S_ASR
D_ASR
% Precision
86.0
70.5
72.2
% Recall
87.6
78.5
80.8
%F
86.8
74.3
76.3
% SER
26.6
54.0
50.1
Table 4.10: Baseline capitalization results produced using BaseCM.
Approach
baseline
LMnews only
adapt-base
adapt-iter
adapt-iter
Model period
2008-05-20
2008-05-20
2008-05-20
2008-05-20
daily model
Man
26.6 %
26.5 %
26.0 %
25.0 %
25.0 %
S_ASR
54.0 %
53.3 %
54.4 %
53.8 %
53.6 %
D_ASR
50.1 %
49.5 %
50.2 %
49.6 %
49.8 %
Table 4.11: Capitalization SER achieved for all different approaches.
from the web, and described in Section 3.2.1. By the time these experiments were conducted,
the corpus contained about 30M words. The evaluation was performed over the RTP08 test
set, which consists of 5 BN shows. The corpus contains about 40k words and was collected
during June and July 2008, with an 8 day time span between each BN show. Besides the manual orthographic transcript, two automatic transcripts were also available, sharing the same
preprocessing segmentation: S_ASR – produced using a static LM and a static 100k word vocabulary; and D_ASR – produced using a dynamic LM and vocabulary, built specifically for
the corresponding day, and the existing recognition system by that time.
The baseline results, achieved using the capitalization model currently in use for daily
subtitling (BaseCM), are shown in Table 4.10. The capitalization performance decreases when
dealing with automatic transcripts. Even so, a better performance is achieved for the D_ASR
transcript, where both the LM and the vocabulary are daily computed, and a lower WER is
achieved.
4.4.2
Adaptation results
The adaptation and retraining experiments performed in the scope of this work use LMnews corpora subsets of 2M words each, and the previously described retraining method. Each
subset is referred by the day corresponding to the latest data in that subset. Accordingly, the
capitalization model that results from retraining with a given corpora subset is also referred by
the day corresponding to the latest data in that subset.
Three adaptation approaches were tested: i) using only the LMnews corpus for training;
ii) adapting the BaseCM to a target period, by retraining with the latest data from that period;
and iii) iteratively retraining BaseCM with all the available corpora subsets. While the first
approach assumes that using only the most recent data (LMnews) is sufficient for training, the
other two approaches use this data to retrain the baseline CM, assuming that former data also
4.4. CAPITALIZATION MODEL ADAPTATION
baseline
75
LMnews only
adapt-base
adapt-iter
40%
38%
36%
SER
34%
32%
30%
28%
26%
daily models
2008-05-20
2008-03-13
2008-01-11
2007-11-08
2007-08-29
2007-06-02
2007-03-08
2007-01-06
2006-10-26
2006-08-27
2006-06-18
2006-04-11
2006-02-10
2005-11-29
2005-09-30
2005-06-22
2005-03-16
24%
Capitalization model
Figure 4.13: Manual transcription results, using all approaches.
provides important capitalization information. The second approach assumes that BaseCM already contains most of the capitalization information and a simple retrain with data from a
target period is sufficient. The last approach assumes that all corpora periods provide important capitalization information and contribute for a better final model. Table 4.11 shows the
final capitalization results for each approach. Concerning the manual transcript, all the proposed approaches yield better results than the baseline, and the best result is produced using
the third approach (lines 3 and 4), which combines the BaseCM with the LMnews information.
Concerning the automatic transcripts, despite achieving only small improvements, the third
approach also prove to be the best, specially for the D_ASR transcripts, currently in use. Results also show that the LMnews information alone is sufficient to beat the baseline, revealing
the importance of training data from periods closer to the testing data. The table shows that
results are not further improved by using daily CMs, which corresponds to retraining the 200805-20 capitalization model with the latest 2M words former to the testing data (5 daily models
were used), suggesting that a periodic retraining is suitable for this task.
Figure 4.13 illustrates the results achieved for the manual transcription, using different
capitalization models and all different approaches. All the approaches depict clear trend lines.
However, the capitalization models produced by the third approach are more stable, achieving
the best results after a certain period of time.
4.4.3
Conclusions
From the three different approaches for capitalization here proposed and evaluated, the
most promising one consists of iteratively retraining a baseline model with new available data,
76
CHAPTER 4. CAPITALIZATION RECOVERY
using small corpora subsets. When dealing with the manual transcription, the performance
improves about 1.6%. Results reveal that producing capitalization models on a daily basis
does not lead to a significant improvement. Therefore, the adaptation of capitalization models
on a periodic basis is the best choice. The small improvements gained in terms of capitalization
suggest that dynamically updated models may play a small role, but the updating does not
need to be done daily, a fact that is also according to our intuition. One possible solution for
the updating interval could be assessed by comparing the frequency of emerging words.
4.5
Recent Work on Capitalization
The experiments in this thesis have been conducted during several years with data and
software under development. For that reason, comparing results obtained in different time
periods is a difficult task. Nowadays, the speech recognition system performance has been
improved, a number of third party tools have been corrected or improved, more data became
available, and some of the old data has been revised. For example, we now use a revised version of the ALERT-SR speech corpus, where a large number of inconsistencies were corrected,
as described in Section 3.1.1.1. Finally, the significant increase of computational memory resources during these years, makes it possible to perform more complex experiments, using
more data and larger number of retraining iterations.
These facts led to a new set of experiments, which are supposed to reflect more accurate conclusions, given the improved conditions described above. The following subsections
present the most recent results achieved using a ME-based approach, and compare this approach with the most recent results achieved using HMMs and Conditional Random Fields
(CRFs). Besides the automatic transcripts we now perform experiments with force aligned
transcriptions, also produced by our speech recognition system. The most recent experiments
on capitalization consider all the four types of capitalization classes for a given word: lowercase,
first-capitalized, uppercase and mixed-case.
4.5.1
Capitalization Results using a ME-based Approach
These experiments use the most recent version of the MegaM tool for training the models.
A number of additional options were introduced and a number of bugs have been corrected
since the first version of the tool that had been used. The PUBnews corpus has again been used
for training the capitalization models, where the original texts were re-normalized and all the
punctuation marks removed. The normalization tool has been recently revised and improved,
which means that the content is not exactly the same as in the early experiments. The evaluation
data includes only the Eval and JEval portions of the ALERT-SR corpus, since the other two
evaluation subsets were still not completely revised by the time these experiments started.
4.5. RECENT WORK ON CAPITALIZATION
Evaluation data
Written corpora
Manual transcripts
ASR transcripts
Written corpora model only
Prec. Recall
F
SER
95.1% 85.3% 89.9% 18.8%
94.8% 88.0% 91.3% 16.5%
82.7% 81.7% 82.2% 34.9%
77
After retraining with transcripts
Prec. Recall
F
SER
95.4%
83.3%
88.6%
82.2%
91.9%
82.7%
15.4%
33.9%
Table 4.12: Recent ME-based capitalization results for Portuguese.
The retraining approach described in Section 2.3 was followed, with subsets of two million
words each. Sometimes the algorithm implemented in the MegaM tool assumes the optimization converges before it actually does, so now the optimization was forced to repeat several
times. Each epoch was retrained three times, using 200 iterations. For performance reasons,
each capitalization model was limited to 5 million weights. The following features have been
used for a given word w in the position i of the corpus: wi , 2wi−1 , 2wi , 3wi−2 , 3wi−1 , 3wi , where
wi is the current word, wi+1 is the word that follows and nwi± x is the n-gram of words that starts
x positions after or before the position i. Table 4.12 shows the corresponding results, where the
left of the table refers to same capitalization model, built from the written corpora, while the
right side refers to the model obtained after retraining with the ALERT-SR training data. These
results correspond to substantial improvements when considering results from Section 4.2.3.
The different reasons for this advancement include software bug corrections and increasing
the number of training iterations. Furthermore, results concerning written corpora reflect the
improved normalization, and results concerning transcripts reflect the revised version of the
reference data. The difference between the manual and automatic transcripts expresses the
speech recognition WER, which is about 18.5 for the evaluation speech data in use. Results
achieved after retraining the written corpora model with the speech training data are about 1%
better, and for that reason one can conclude that it is always important to include data similar
in style to the testing data in the training, even if it is only a small portion.
One final experiment concerning the ME results consisted on combining the predition of
each one of the 70 intermediate models, trained from each one of the PUBnews training subsets.
Results were 19.6% SER for written corpora, 16.2% SER for manual transcripts, and 34.8 SER
for automatic transcripts. Written corpora results were no better than results achieved using
only the last trained capitalization model, which suggests that the model includes most of the
information from the previous capitalization models. Furthermore, as it was trained with data
closer to the written corpora test set, it is more suitable for capitalizing that data. The gains
over speech transcripts are only residual and can be explained by the fact that the transcripts
testing data period being in the middle of the training data period.
78
CHAPTER 4. CAPITALIZATION RECOVERY
Evaluation data
Written corpora
Manual transcripts
ASR transcripts
Precision
94.4
84.8
69.3
Recall
90.6
91.4
85.9
F-measure
92.5
87.9
76.7
SER
14.4
24.7
51.5
Table 4.13: Recent HMM-based capitalization results for Portuguese.
4.5.2
Capitalization Results using an HMM-based Approach
An HMM-based approach requires a limited vocabulary, but the disambig tool from the
SRILM toolkit (Stolcke, 2002) facilitates the use of unrestricted vocabulary, by automatically
considering the new words in its internal vocabulary. The most recent experiments using the
HMM-based tagger implemented by the disambig tool use the same data sets used in previous subsection. Trigram language models have been used, which were created using backoff
estimates, as implemented by the ngram-count tool from the same toolkit, without n-gram discounts.
Table 4.13 shows results achieved using the same training and evaluation data previously
used in Table 4.12. As a first result, the ME approach achieves a better precision, while the
HMM-based approach achieves a better recall. Results indicate that the HMM-based approach
is better for written corpora, while the ME approach is significantly better for speech transcripts. Several reasons may explain this fact: i) the information expressivity is not the same in
both methods: while the HMM-based approach uses all the context of a word, the features used
in the ME-based approach may not express that complete context, e. g., ME experiments here
described do not use the information concerning the previous word (wi−1 ) as an isolated feature, while that information is available in the 3-gram LM used by the HMM-based approach;
ii) the ME-based approach is not much influenced by the context as the HMM-based approach,
which is quite important when dealing with speech units that may be, as stated in Section 1,
flexible, elliptic, and even incomplete; iii) the restricted training conditions used for limiting
computational resources. Finally, the WER impact is bigger for the HMM-based approach, because different words may cause completely different search paths. This same conclusion was
also reached in Section 4.2.
4.5.3
Capitalization Results using Conditional Random Fields
An ME model classifies a single observation into one of a set of discrete classes. A maximum entropy markov model (MEMM) (McCallum et al., 2000) is an augmentation of the basic
ME classifier so that it can be applied to an entire sequence, assigning a class to each element
in the sequence, just like it is done with HMM. However, while an HMM is a generative model
that optimizes the likelihood P(W | T ) and combines it with the prior P( T ) to calculate the posterior probability according to Bayes’ rule, an MEMM computes the posterior P( T |W ) directly
4.5. RECENT WORK ON CAPITALIZATION
Training
data
Year 1999
Year 2000
Year 2001
Year 2002
Year 2003
Year 2004
ALL
without the output bigram
Precision Recall
F
SER
86.9%
86.5% 86.7% 24.8%
88.1%
87.1% 87.6% 23.1%
88.7%
87.0% 87.9% 22.5%
88.6%
87.6% 88.1% 22.3%
88.6%
87.9% 88.2% 22.1%
88.5%
88.7% 88.6% 21.6%
95.2%
86.1% 90.4% 17.8%
79
with the output bigram
Precision Recall
F
SER
92.3%
82.1% 86.9% 24.1%
92.7%
82.8% 87.5% 23.1%
93.1%
83.5% 88.0% 22.1%
93.1%
83.8% 88.2% 21.9%
93.4%
84.2% 88.5% 21.2%
93.5%
85.0% 89.0% 20.4%
94.1%
83.7% 88.6% 21.0%
Table 4.14: ME and CRF capitalization results for the PUBnews test set.
Training
data
Year 1999
Year 2000
Year 2001
Year 2002
Year 2003
Year 2004
ALL
without the output Bigram
Precision Recall
F
SER
86.6%
88.8% 87.7% 24.2%
87.3%
90.6% 88.9% 21.9%
88.8%
90.8% 89.8% 20.1%
88.2%
90.4% 89.2% 21.2%
88.4%
90.2% 89.3% 21.0%
87.2%
89.8% 88.5% 22.7%
94.6%
89.3% 91.9% 15.5%
with the output bigram
Precision Recall
F
SER
92.8%
85.2% 88.8% 21.1%
93.1%
87.3% 90.1% 18.9%
93.8%
89.0% 91.3% 16.7%
93.8%
87.9% 90.7% 17.7%
94.1%
87.0% 90.4% 18.1%
93.7%
86.4% 89.9% 19.1%
94.5%
87.7% 91.0% 17.1%
Table 4.15: ME and CRF capitalization results for the force aligned transcripts test set.
(Jurafsky and Martin, 2009). The conditional random field (CRF) (Lafferty et al., 2001), being
also a discriminative sequence model, augments the MEMM. One advantage of CRF models is
the fact that it supports rich features, like ME, while it also accounts for label dependency, like
HMMs. The other advantage is that it performs global normalization, eliminating the labeling
bias problem.
A number of experiments have been performed in order to assess the interest of sequence
modeling for capitalization. The CRF++ tool3 , an open-source implementation that performs
3 http://crfpp.sourceforge.net/
Training
data
Year 1999
Year 2000
Year 2001
Year 2002
Year 2003
Year 2004
ALL
without the output Bigram
Precision Recall
F
SER
77.4%
84.1% 80.6% 39.9%
78.0%
85.8% 81.7% 37.8%
79.1%
86.3% 82.6% 36.1%
78.7%
85.8% 82.1% 37.0%
79.5%
85.8% 82.5% 35.9%
78.3%
85.1% 81.6% 38.1%
78.3%
85.1% 81.6% 38.1%
with the output bigram
Precision Recall
F
SER
81.6%
82.8% 82.2% 35.5%
82.2%
84.8% 83.5% 33.3%
82.1%
85.5% 83.8% 32.8%
82.3%
84.7% 83.5% 33.1%
83.1%
84.6% 83.8% 32.4%
82.9%
83.7% 83.3% 33.2%
78.3%
85.1% 81.6% 38.1%
Table 4.16: ME and CRF capitalization results for the automatic speech transcripts test set.
80
CHAPTER 4. CAPITALIZATION RECOVERY
(!!"#
'!"#
-!#
-!#
&!"#
$-.(#
$-.(#
$-!#
$-#
%!"#
*-.$#
*-.$#
*-.(#
*-.(#
$!"#
*-!#
*-!#
!"#
()#
(!)#
*!)#
(!!)#
*!!)#
(+#
$+#
*+#
,+#
!"#$%&'()'*+,%-'+,'./%'012+.1*+314(,'#(5%*''
Figure 4.14: Analysis of each capitalization feature usefulness.
the training based on L-BFGS (Limited memory Broyden-Fletcher-Goldfarb-Shanno), has been
used for training the capitalization models. Considering the memory resources currently available4 , it was not possible to use the entire training data at the same time. Therefore, six different
experiments have been conducted, each one using one year of training data. The data and the
set of features are the same previously used in this section. All experiments were performed
with and without the output bigram (results without the output bigram correspond to using
ME only). This way, the importance of using CRFs can be measured. Tables 4.14, 4.15 and 4.16
show results for written corpora, force aligned transcripts, and automatic transcripts, respectively. The last row (ALL) shows the results of combining all the previous models’ prediction,
similarly to what has been reported in section 4.5.1. The best results are consistently achieved
when the output bigram is enabled, due to the significant increase in the precision. The best
recall values are still achieved without the output bigram. The conclusion is that given the rich
feature set, the label dependency helps in all scenarios, supporting the idea that the capitalization of a word tends to be connected with the capitalization of words around.
Once again, it is interesting to notice a tendency for achieving better results when the training data period is closer to the testing data period, which supports the same conclusion also
achieved in Section 4.3. This is particularly clear when dealing with written corpora and force
aligned transcripts.
4.5. RECENT WORK ON CAPITALIZATION
4.5.4
81
Analysis of Feature Contribution
Consider that the number of feature weights in a capitalization model must be limited,
for example, for performance reasons or limitations on computational resources; which, then,
are the features that should be pruned? This issue may be of importance for a module that
performs capitalization on-the-fly. One possible answer consists of sorting the capitalization
model by the most discriminant features and then selecting the first k feature weights. We
have tested this process on the ME capitalization model built from the PUBnews training data
and used in Section 4.5.1, which contains 5M feature weights and has a size of 592 Mbytes.
The capitalization model was sorted by the standard deviation of the weights of each feature,
putting the most discriminant features at the top. The capitalization model was then pruned
with different sizes, and the proportion of each feature in each resultant capitalization model
was calculated. Figure 4.14 shows the obtained results, where nwi , corresponds to the n-gram
starting at the position i from the current word, for example 2wi−1 corresponds to the bigram
< wi−1 , wi >. As expected, the figure shows that the most discriminant capitalization feature
for a given word is the word itself, followed by the bigram covering the previous word (2wi−1 ).
The graph depicts that features involving previous words are more discriminant and therefore
more important that the features involving the following words.
4.5.5
Error Analysis and General Problems
The capitalization approaches described so far are language independent, but the capitalization performance can be further improved by considering language specific problems and
by performing choices based on natural language knowledge. This subsection analyses and
discusses a number of problems posed by our pure machine learning-based strategy for the
Portuguese language.
Machine learning methods depend on the training data, which is sometimes from a specific domain, and may not contain specific phenomena found in the evaluation data. For example, large written newspaper corpora provide useful capitalization information for building a
generic capitalization model, even so, frequent words found in speech transcripts rarely appear
in the newspaper data. For example, the verbal form “(tu) queres/ (you) want” is rarely found
in written newspaper data, while it is frequent in dialogs or in spontaneous speech. In this
specific example, as verbs are always written lowercase they do not pose significant problems,
since it was considered that when no information concerning a word exists, such word is kept
lowercase. Verbs with enclitic pronouns are easy to detect and are always written lowercase,
for that reason our future experiments will consider an additional feature for identifying such
words.
Our evaluation considers an absolute true form for each word in the original reference data.
4 By
the time these experiments were conducted, 3 machines with 24GB of memory were available.
82
CHAPTER 4. CAPITALIZATION RECOVERY
However, differences between the original reference and the automatic capitalization are not
always capitalization errors. For example, whereas the original reference contains “rádio Renascença” it could contain “Rádio Renascença” instead, which is most of the times prefered by
the capitalizer. A number of errors are still produced by the capitalizer. For example, movie titles like “A Esperança Está Onde Menos Se Espera” are frequently badly capitalized, given that
they rarely appear in the training corpus and most of the words are usually written lowercase.
Unusual organization names are also frequently badly capitalized as well. These conclusions
are similar to the qualitative analysis reported by Lita et al. (2003).
Media organizations sometimes bring to focus names of rarely mentioned people and
places, which are then frequently used for a given time period. Because such proper nouns
must be capitalized, this may constitute a problem if the capitalization model is not updated
on a periodic basis. Concerning this subject, we have applied a number of regular expressions
for detecting letter sequences that cannot occur in Portuguese (Meinedo et al., 2010). Then
we have conducted experiments where all the detected words, supposedly “foreign” words,
were capitalized. The capitalization performance has increased with this process. Moreover,
this post-processing strategy permits to capitalize words which never occurred before in the
training corpus.
4.6
Extension to other Languages
This section describes a number of capitalization experiments that have also been performed for other languages. Many experiments were performed both with Spanish and English data, but this section will focus on the English language, thus avoid reporting several
times similar conclusions. Nevertheless, whenever important, specific results on the Spanish
data will be also mentioned.
The English BN corpus combines different English BN corpora subsets, as described in
Section 3.1.3. The written corpus corresponds to the LDC corpus LDC1998T30, described in
Section 3.2.3. For these experiments, however, only the NYT (New York Times) portion of the
corpus was used. The data has been collected from January 1997 to April 1998 and contains
about 213 Million words, after cleaning the corpus and removing problematic text (unknown
characters, etc.). About 211 million words were used for training, 574 thousand for development, and 1.2 million for evaluation. The original texts were normalized and all punctuation
marks were removed, making them close to speech transcriptions. For the experiments here
described, only data previous to the evaluation data period was used for training.
4.6. EXTENSION TO OTHER LANGUAGES
83
05-22 Aug1997.voc
2.5%
OOV
2.0%
1.5%
1.0%
1998-03
1998-02
1998-01
1998-01
1997-12
1997-11
1997-11
1997-10
1997-09
1997-09
1997-08
1997-08
1997-07
1997-06
1997-06
1997-05
1997-04
1997-04
1997-03
1997-03
1997-02
1997-01
1997-01
0.5%
Vocabulary period
Figure 4.15: Vocabulary coverage on written newspaper corpora.
4.6.1
Analysis of the language variations over time
We have started by analyzing the newspaper corpus for establishing a relation between the
vocabulary usage and the time-line, like we previously did for Portuguese. The English corpus
was split into several subsets, each containing about 8 million words. Each subset, containing
about 88K unique words, was named with the month corresponding to the first data in that
subset. In order to assess the relation between the word usage and the language period, again
several vocabularies have been created with the 50K more frequent words appearing in each
training set (roughly corresponds to frequency greater than two). Then, the coverage of each
one of these vocabularies was checked against one of the subsets. The chosen subset contains
data from August 1997, and is located in the middle of the corpus time span. Figure 4.15 shows
the results, where each vocabulary was named with its corresponding corpora subset. The best
coverage is, as expected, achieved with the vocabulary built from the testing subset. The more
important result, however, is that the number of OOVs (Out of Vocabulary Words) decreases
as the time gap between the vocabulary and the testing period gets smaller.
The previous experiment was also performed on manual and automatic speech transcripts,
by selecting a piece of speech data from the BN corpus. Most of the English BN corpora from
Table 3.5 does not have a reference to its corresponding collect time period, especially for the
evaluation subsets. Therefore, the coverage of each one of the previous 23 vocabularies was
tested against a subset from the LDC1998T28 corpus, corresponding to January 1998. Again,
the number of words appearing in the test set of the speech corpus that were not covered by
each one of the vocabularies were counted. Figure 4.16 shows the correspondent results. The
graph shows that the vocabulary coverage is better for vocabularies built from data collected
nearby the testing data period. This same relation between the vocabulary usage and the timeline has been previously established in Section 4.3. These results confirm that different lexical
information is used in different language periods. This subject is further addressed in the
84
CHAPTER 4. CAPITALIZATION RECOVERY
Jan98 (Manual transcripts)
Jan98 (ASR)
2.0%
1.8%
OOV
1.6%
1.4%
1.2%
1998-03
1998-02
1998-01
1998-01
1997-12
1997-11
1997-11
1997-10
1997-09
1997-09
1997-08
1997-08
1997-07
1997-06
1997-06
1997-05
1997-04
1997-04
1997-03
1997-03
1997-02
1997-01
1997-01
1.0%
Vocabulary period
Figure 4.16: Vocabulary coverage for Broadcast News speech transcripts.
next section, where several capitalization experiments, with data collected from different time
periods, are presented to show how this affects the capitalization task.
It would be interesting to complete this work, comparing the same period of time in both
Portuguese and English to measure the impact of new named entities, e.g., in the beginning
of the Iraq war, or during U.S.A. presidential elections. That would depict the timeline effects
of the same event on both languages. Unfortunately, we do not have data suitable to perform
such experiments.
4.6.2
Results
Like in the previous section, capitalization experiments here described assume all the four
ways of writing a word: lowercase, first-capitalized, uppercase, and mixed-case. The capitalization models were trained with the newspaper corpora. The original texts were normalized
and all the punctuation marks removed, making them closer to speech transcriptions. The retraining approach described in Section 2.3 was followed, with subsets of two million words
each. Each epoch was retrained three times, using 200 iterations. For performance reasons,
each capitalization model was limited to 5 million weights. The following features were used
for a given word w in the position i of the corpus: wi , 2wi−1 , 2wi , 3wi−2 , 3wi−1 , where wi is
the current word, wi+1 is the word that follows and nwi± x is the n-gram of words that starts x
positions after or before position i.
In order to assess the impact on the capitalization task of the language variations in time,
two different strategies were used for training, based on the data period. The first capitalization
models were trained by starting with the oldest data available and by retraining each epoch
with more recent data. The second capitalization models were trained backwards, using the
newest data first and retraining each epoch with data older than the one used in the previous
4.6. EXTENSION TO OTHER LANGUAGES
-./$
85
/%01$%#&2'(#%)*)*+'
!"#$%#&'(#%)*)*+'
-./$
'%#$
'%#$
'"#$
'"#$
&%#$
&%#$
&"#$
&"#$
!%#$
!%#$
!"#$
())*+"($
())*+"%$
())*+("$
()),+"&$
()),+"&$
())*+("$
())*+"%$
!"#$
())*+"($
,%(%'-.#)"&'
Figure 4.17: Forward and Backwards training results over written corpora.
epoch. Each capitalization model was applied to the newspaper corpora evaluation subset,
and results are shown in Figure 4.17. While the models trained with the forward strategy
consistently increase the performance on the evaluation set, the performance of the models
produced with the backwards strategy does not increase after a certain number of epochs and
it even decreases. Although both experiments use the same training data, the best result is
achieved with the last model created using the forward strategy, because the latest training
data time period was closest to the evaluation time period. The small performance difference
between the forward and backwards strategy is related with the relatively small period of time,
less than one and a half year of data, covered by the English written corpus. During such a
small period of time, the vocabulary changes are quite limited. Notice, however, that both
results were achieved using exactly the same data, justifying the preference for the forward
strategy.
Each one of the previous capitalization models, previously created using the forward strategy, have also been used for restoring the capitalization of BN speech transcripts. The evaluation was conducted over data collected during January 1998, extracted from the LDC1998T28
corpus, and corresponding to about 100k words. Figure 4.18 shows the results for manual and
automatic transcripts, again revealing that the best models are precisely the ones closer to the
evaluation data period. In fact, the best model is the one built from data of the same period,
despite the training and evaluation data being from different sources. As expected, manual
transcripts achieve the best performance. The performance differences between manual and
automatic transcripts reflect the WER impact. Another important conclusion arising from the
two previous charts is that the amount of data is an important performance factor. In fact, our
results show that the performance increases consistently as more data is provided.
In order to compare this method with other methods, we also have performed the capi-
86
CHAPTER 4. CAPITALIZATION RECOVERY
-./0.1$23./4536724$
%%#$
89:$;3./4536724$
.%/0112342$5$6&712$
%"#$
!"#$
'%#$
'"#$
&%#$
&"#$
!%#$
!"#$
())*+"($
())*+"&$
())*+"%$
())*+",$
())*+("$
())*+(!$
()),+"&$
%&'&$()*+,-$
Figure 4.18: Forward training results over spoken transcripts.
Evaluation data
Written
corpora
Manual
transcripts
ASR
transcripts
ME
HMM
ME
HMM
ME
HMM
Precision
96.2
94.9
94.3
91.9
83.9
77.8
Recall
81.6
88.5
82.4
84.9
73.1
75.3
F-measure
88.3
91.6
88.0
88.2
78.1
76.5
SER
20.8
15.3
22.2
22.2
40.4
45.5
Table 4.17: Comparing two approaches for capitalizing the English Language.
talization using an HMM-based tagger, as implemented by the disambig tool from the SRILM
toolkit (Stolcke, 2002). This generative approach makes use of trigram language models, created using backoff estimates, as implemented by the ngram-count tool from the same toolkit,
without n-gram discounts.
Table 4.17 shows results achieved with both methods, using the same training and evaluation sets for English. The table shows that the HMM-based approach produces better results
for written corpora, while the ME approach works better with the speech transcripts, confirming the observations also made, in Section 4.5, for the Portuguese language. Two main reasons
explain the better values of the HMM-based approach for the written corpora: i) The restricted
training conditions used for limiting computational resources; ii) the information expressivity not being the same in both methods: while the HMM-based approach uses all the context
of a word, the features used in the ME approach may not express that complete context. For
example, ME experiments here described do not use the information concerning the previous
word (wi−1 ) as an isolated feature, while that information is available in the 3-gram LM used
by the HMM-based approach. The impact of the recognition errors is bigger when the HMM-
4.7. SUMMARY
87
based approach is used, because different words may cause completely different search paths.
Finally, is interesting to notice that the ME approach achieves best precision values, while the
HMM-based approach achieves a better recall.
Comparing these results with those presented in Tables 4.12 and 4.13, one can verify as
well that the capitalization task performs better for the Portuguese language, given the corpora
sets in use. A possible explanation for the bigger difference between the methods in the Portuguese speech data may be related with the proportion of spontaneous/prepared speech in
both corpora. We know that Portuguese transcripts contain a high percentage of spontaneous
speech (35%), much higher than our data for the Spanish BN (11%), but, unfortunately, this
information is not available for the English data. Nevertheless, at this point we do not have a
conclusive answer for this difference.
These results are difficult to compare to other related work, mainly because of the different
evaluation sets, but also because of the different evaluation metrics and applied criteria. For example, sometimes it is not clear whether the evaluation takes into consideration the first word
of each sentence. However, these results are consistent with the work reported by Gravano
et al. (2009), which achieves 88.5% F-measure (89% prec., 88% recall) on written corpora (Wall
Street Journal), and 83% F-measure (83% prec., 83% recall) on manual transcripts.
4.7
Summary
This chapter described the set of experiments performed on the scope of automatic capitalization task for both written corpora and speech transcripts. Section 4.2 compared the MEbased approach with two generative approaches, concluding that the generative methods produce better results for written corpora, while the ME approach works better with the speech
transcripts. The impact of the recognition errors is stronger when generative approaches are
used, because different words may cause completely different search paths.
Section 4.3 analysed the impact of the language variations in the capitalization task. Maximum entropy models proved to be suitable to perform the capitalization task, specially when
dealing with language dynamics. This approach provides a clean framework for learning with
new data, while slowly discarding unused data. It also enables the combination of different
data sources and exploration of different features. The analysis has been performed with BN
data, automatically produced by a speech recognition system. In fact, subtitling of BN has
led into using a baseline vocabulary of 100K words combined with a daily modification of the
vocabulary (Martins et al., 2007b) and a re-estimation of the language model. This dynamic
vocabulary proved to be an interesting scenario for these experiments. In terms of language
variation, results suggest that different capitalization models should be used for different time
periods. Capitalization results for broadcast news transcriptions have been presented. The
performance evolution was analyzed for three test subsets taken from different time periods.
88
CHAPTER 4. CAPITALIZATION RECOVERY
Capitalization results of manual and automatic transcriptions have been presented, revealing
the impact of the recognition errors on this task. For both types of transcription, the capitalization results show evidence that the performance is affected by the temporal distance between
training and testing sets.
Section 4.4 described the work on updating the capitalization module. Three different
approaches were proposed and evaluated, and the most promising approach consists of iteratively retraining a baseline model with the new available data, using small corpora subsets.
When dealing with manual transcripts the performance grows about 1.6%. Results reveal that
producing capitalization models on a daily basis does not lead to a significant improvement.
Therefore, the adaptation of capitalization models on a periodic basis is the best choice. The
small improvements gained in terms of capitalization lead us to believe that dynamically updated models may play a small role, but the updating does not need to be done daily, a fact
that is also according to intuition. Moreover, the update interval may be chosen dynamically,
according to the frequency of new words appearing in the current data.
Section 4.5 presented the most recent experiments on automatic capitalization, with more
accurate results. Results achieved using the ME-based approach are compared with the most
recent results achieved using HMMs and CRFs. The HMM-based approach turned out to be
better for written corpora, while the ME approach was significantly better for speech transcripts. The WER impact was also bigger for the HMM-based approach, supporting the conclusions previously achieved in Section 4.2. Besides the automatic transcripts, experiments with
force aligned transcriptions were also included. Experiments have shown that CRF (using the
output bigram) outperforms ME (without the output bigram), due to a significant increase in
the precision. The best results, achieved when the output bigram is enabled, support the idea
that capitalized words tend to appear connected with the words around.
Section 4.6 reported experiments on extending this work to other languages. The effect of
language variation over time is again studied for the English and Spanish data, confirming that
the interval between the training and testing periods is relevant for the automatic capitalization
performance. The capitalization task performs better for the Portuguese language, given the
corpora sets in use. The bigger difference between the methods in the Portuguese speech data
may be related with the proportion of spontaneous/prepared speech in both corpora.
Punctuation Recovery
5
This chapter addresses the punctuation task, covering the three most frequent punctuation
marks: full stop, comma, and question mark. Detecting full-stops and commas depends mostly
on a local context, usually two or three words, and corresponds to detecting sentence boundaries. The idea consists of jointly detecting the sentence boundaries and predicting the type of
punctuation mark for each sentence boundary. On the other hand, in Portuguese as in other
languages, most interrogative sentences, specially the wh-questions, depend on words that are
used in the beginning or/and at the end of the sentence (e.g., quem disse o quê?/who said what?),
which means that the sentence boundaries must be previously known. Notice, however, that
this is not true for all the languages: for example, Chinese interrogatives are marked with a special character at the end of the sentence, which means that, in this case, question marks can be
treated just like full-stops and commas. Anyway, two separated sub-tasks are distinguished here:
the first, using local contexts, for detecting full-stops and commas; and the second for detecting
question marks, using the whole sentence properties as features.
As previously stated in Section 4.5, in the context of the capitalization task, the work here
described has been conducted during several years, depending from data and software under development. Therefore, comparing results obtained in different periods is most of the
times a difficult task. That compelled a set of new experiments, most of them repeating old
experiments, that should reflect more accurate conclusions, given the improved conditions in
terms of data, software and computational resources. As a result, all the important experiments concerning punctuation were again performed using the most recent conditions, being
now possible to compare between all the results. For that reason, all results presented in this
chapter reflect the most recent conditions, which consider not only the automatic transcripts,
but also force aligned transcripts, produced by the speech recognition system.
Most of the experiments are performed using the maximum entropy-based approach, described in Section 2.3, and also used to perform the capitalization task. This approach is a good
framework for combining the large number of features that can be computed from the speech
transcript and/or from the speech signal itself.
Although the relationship of time effects and punctuation conventions may be considered
interesting, the time effect analysis has been conducted exclusively for the capitalization task,
since named entities are more prone to be influenced by short-time effects than punctuation
90
CHAPTER 5. PUNCTUATION RECOVERY
Language
Portuguese
English
Corpus
PUBnews
Europarl
WSJ
Europarl
Tokens
150M
30M
42M
29M
Full-stop
3.2%
3.3%
4.2%
3.7%
Comma
6.3%
6.8%
4.7%
4.7%
Q-mark
0.11%
0.13%
0.04%
0.12%
Excl-mark
0.03%
0.04%
0.01%
0.03%
Table 5.1: Frequency of each punctuation mark in written newspaper corpora. Wall Street
Journal (WSJ) results extracted from Beeferman et al. (1998).
conventions. This has to do with several reasons. Firstly, time effects in punctuation usually
take into account texts from several decades (or even centuries), instead of short periods of
time, like the ones reported in our data. For instance, in 1838, Alexandre Herculano, a famous
Portuguese writer1 , described punctuation conventions used in his time that are considered
ungrammatical in contemporary Portuguese (e.g., a long subject is separated from the predicate by a comma; a restrictive relative clause is separated from the antecedent by a comma)
Duarte (2000). Secondly, changes in the conventional usages of punctuation marks, reported in
recent years, are mainly associated with semicolon usage – a punctuation mark with residual
frequencies across corpora (BN 0.2%; newspapers 0.7%; and university lectures 0.1%). Thirdly,
punctuation is diverse across corpora from the same period of time. However, that was not a
major issue here, since only BN are being analyzed.
This chapter is structured as follows: Section 5.1 starts by analysing the occurrence of the
different punctuation marks, considering written corpora and speech transcripts, and different
languages. Following an historical perspective, Section 5.2 describes initial sentence segmentation experiments, which accounted only for the two most frequent punctuation marks: full
stop and comma. Section 5.3 reports recent experiments on extending the initial punctuation
model to accommodate prosodic features and to also detect question marks. Section 5.4 reports
the most recent bilingual experiments performed with Portuguese and English, and compares
the punctuation performance in these two languages. Section 5.5 presents some conclusions
concerning the adopted approach and the obtained results.
5.1
Punctuation Analysis
In order to better understand the usage of each punctuation mark, their occurrence was
counted in written newspaper corpora, using PUBnews, the Europarl (Koehn, 2005) corpus –
a multilingual parallel corpus covering 11 languages and extracted from the proceedings of
the European Parliament, and published statistics from WSJ (Wall Street Journal). Results are
shown on table 5.1, revealing that comma is the most common punctuation mark in written
corpora of both languages, and is even more frequent for Portuguese. The full-stop frequency
1 Alexandre Herculano, Opúsculo V, edição crítica de [critical edition by] J. Custódio and J. M. Garcia. Lisboa,
Presença. 1986.
5.1. PUNCTUATION ANALYSIS
Full-stop
91
Comma
Question mark
Exclamation mark
Comma avg
8%
Frequency
7%
6%
5%
4%
3%
2%
1%
0%
nl
sv
en
fr
it
es
el
de
pt
da
fi
Language
Figure 5.1: Punctuation marks frequency in Europarl.
Broadcast News Transcript
LDC98T28 (Hub4 English)
LDC98T29 (Hub4 Spanish)
TVE (Spanish)
ALERT-SR (Portuguese)
Tokens
854k
350k
221K
920k
Full-stop
5.1%
4.0%
4.0%
4.6%
Comma
3.5%
5.1%
5.8%
6.8%
Q-mark
0.29%
0.14%
0.15%
0.24%
Excl-mark
0.00%
0.00%
0.00%
0.01%
Table 5.2: Frequency of each punctuation mark in broadcast news speech transcriptions.
is lower for Portuguese, suggesting that the Portuguese written language contains longer sentences when compared to English. The question mark turned out to be the third most frequent
punctuation mark, but its frequency is highly dependent on the corpus domain.
This study has been extended to all the 11 languages covered by the Europarl corpus. Figure 5.1 presents the corresponding results, revealing that comma is the most frequent punctuation mark for most languages, and achieves one of the highest frequency scores for Portuguese
(6.75%). It also confirms that, from all languages, the Portuguese language contains the lowest
percentage of full stops (3.30% vs. 3.56% for English). All other punctuation marks have shown
lower and similar frequencies for all languages.
The previous study has been extended also to BN transcriptions. Table 5.2 shows the corresponding results, where the revised version of the ALERT-SR corpus, described in Section
3.1.1.1, was used. The most frequent punctuation mark for Portuguese and Spanish is also
comma, however, this is not the case for English where the full-stop punctuation mark is now
the most frequent. The Portuguese BN transcripts present the highest frequency of comma, in
concordance with the written corpora. The full-stop frequency is approximately the same for
English and Portuguese BN transcriptions, and about 1% lower for the Spanish language. It is
interesting to observe that the comma is the most frequent punctuation mark in the Portuguese
92
CHAPTER 5. PUNCTUATION RECOVERY
78119:;<=#
><??+#
@8/:A<-#?+*B#
2CD1+?+A<-#?+*B#
($"#
!"#$%#&'()
(!"#
'"#
&"#
%"#
$"#
!"#
)*+,-#
./0/1#
20+1#
320+1#
4)5!6#
4)5!'#
*+",+"-).%/.#0)
Figure 5.2: Punctuation marks frequency in the ALERT-SR corpus (old version).
corpora, while the full-stop is the most frequent punctuation mark in English. This is consistent
with the widespread notion that sentences are longer in written Portuguese. The frequency of
other punctuation marks on BN corpora is very low.
Previous analyses confirm that spoken text sentences, corresponding to utterances or SUs,
are much smaller than written text sentences, specially for the Portuguese language. Intrasentence punctuation marks also occur more often in spoken texts, specially in Portuguese.
Despite the clear difference in the usage of punctuation marks in the different languages, it
is also important to stress the importance of the annotation criteria. Concerning the Portuguese
SR corpus, the first version of this corpus was annotated in different time periods by different
people, with possibly different criteria. It has recently been revised, as explained in Section
3.1.1.1, and differences in terms of punctuation are significant. Figure 5.2 shows the punctuation statistics for the first version of the corpus. While the full-stop has a similar frequency in
each subset, the comma usage differs from the first to the last subsets. This observation suggests that a manual verification of this data would possibly provide more accurate evaluation
results. Figure 5.3 presents the punctuation statistics for the revised version of the same corpus.
5.2
Early Work using Lexical and Acoustic Features
The punctuation task benefits from lexical and acoustic information found in speech transcripts but unavailable in written corpora. Features, such as pause duration and pitch contour,
may be used together with linguistic information in order to provide clues for punctuation
insertion. Experiments described in this section correspond to the initial experiments on this
subject, and use only spoken data for training.
5.2. EARLY WORK USING LEXICAL AND ACOUSTIC FEATURES
78119:;<=#
><??+#
@8/:A<-#?+*B#
93
2C>1+?+A<-#?+*B#
($"#
!"#$%#&'()
(!"#
'"#
&"#
%"#
$"#
!"#
)*+,-#
./0/1#
20+1#
320+1#
4)5!6#
4)5!'#
*+",+"-).%/.#0)
Figure 5.3: Punctuation marks frequency in the ALERT-SR corpus (revised version).
5.2.1
Features
These experiments use real valued features for expressing information, such as word identification, morphological class, pauses, speaker gender and speaker clusters, sometimes combined as bigrams or trigrams. The following features are used for a given word w in position i
of the corpus:
Word: Captures word identification.
Used features: wi , wi+1 , 2wi−2 , 2wi−1 , 2wi , 2wi+1 , 3wi−2 , 3wi−1 , where wi is the current
word (the word preceding the event), wi+1 is the word that follows, and nwi± x is the
n-gram of words that starts x positions after or before the position i.
POS tag: Captures part-of-speech information.
Used features: pi , pi+1 , 2pi−2 , 2pi−1 , 2pi , 2pi+1 , 3pi−2 , 3pi−1 , where pi is the part-of-speech
of the word at position i, and npi± x is the n-gram of part-of-speech of words that starts x
positions after or before the position i.
Speaker change: Captures whenever a new speaker cluster begins.
Used feature: SpeakerChgi+1 , true if wi+1 starts a different speaker cluster.
Gender change: Captures speaker gender changes.
Used feature: GenderChgi+1 , true if speaker gender changes before wi+1 .
Time: Captures time difference between words.
Used feature: TimeGapi+1 , the time lapse between the word wi and the word wi+1 .
3,*$.4'5%
94
CHAPTER 5. PUNCTUATION RECOVERY
-./"
*+*,#!+("
)%(,*+)"
(!$,)%'"
&+%,(!#"
&!&,&+$"
%%#,&!%"
$)#,%%!"
$$$,$)!"
#*$,$$#"
#&+,#*#"
#$$,#&*"
#!!,#$#"
*$,++"
(),*#"
'',(("
&','&"
%),&&"
%!,%("
$',$+"
$#,$&"
#),$!"
!"
#!!"
$!!"
%!!"
&!!"
'!!"
(!!"
)!!"
*!!"
+!!"
#!!!"
##!!"
#$!!"
!"#$%&'(%)$*+$$,%+-./0%1#02%
Figure 5.4: Converting time gap values into binary features using intervals.
The first two features, involving word information and part-of-speech information, are lexical
features. The word information features were selected according to the performance achieved
by a number of initial experiments for sentence boundary detection. These features are similar
to features reported by Cuendet et al. (2007) and Liu et al. (2006), with the following differences: Cuendet et al. (2007) also use wi−1 , but do not use 2wi−2 , 2wi+1 , 3wi−2 ; Liu et al. (2006)
use 3wi , but do not report the usage of 2wi−2 , 2wi+1 , 3wi−1 . According to Stolcke and Shriberg
(1996), part-of-speech information helps improving the sentence boundary detection performance. The remaining features are not exactly acoustic, but lacking a better description for this
heterogeneous set, these features will be henceforth designated as acoustic. All but TimeGap are
binary features by nature. The TimeGap is a continuous value that measures the amount of time
between the end of a word and the start of the following word and must be binarized. To make
it binary, TimeGap values have been associated to logarithm intervals, according to the formula:
int(ln(v) ∗ s))
s
+ 1
I (v) = int e
(5.1)
where v is the time gap value (in milliseconds), int is a function that returns the integer part of a
number, and s is a smooth value. Experiments reported here use s = 5, which turned out to be
the best, given the performance achieved with experiments performed with different smooth
values. Smaller time gaps, with less than 20ms, were not used, and time gaps with durations
greater than one second were represented by the feature TimeGap:big. Figure 5.4 illustrates
5.2. EARLY WORK USING LEXICAL AND ACOUSTIC FEATURES
Background focus
planned, clean
F0
planned, noise
F40
all planned
F0, F40
spontaneous, clean F1
spontaneous, noise F41
all spontaneous
F1, F41
All
Cor
2411
4441
6852
855
2053
2908
10794
Ins
2463
5073
7536
2435
5134
7569
16386
Del
447
916
1363
416
832
1248
2920
Precision
49.5%
46.7%
47.6%
26.0%
28.6%
27.8%
39.7%
Recall
84.4%
82.9%
83.4%
67.3%
71.2%
70.0%
78.7%
95
F
62.4%
59.7%
60.6%
37.5%
40.8%
39.7%
52.8%
SER
101.8%
111.8%
108.3%
224.3%
206.8%
212.2%
140.8%
Table 5.3: Recovering sentence boundaries over the ASR output, using the APP segmentation.
defined intervals using 5 as the smoothing value.
The confidence score of each word, given by the ASR module, is used for both Word and POS
unigram-based features. The confidence score for speaker and gender changes is provided by
the APP module and is also being used. For all other features a score of 1.0 is being used.
5.2.2
Sentence Boundary Detection
The output of a speech recognition system consists of a stream of text, sometimes grouped
into segments purely based on the acoustic properties of the signal. Detecting sentence boundaries over such data is a way of enriching such transcripts with metadata, which serves as
a starting point for creating data more adequate for human and machine further processing.
The L2 F broadcast news processing system, described in Section 1.2, benefits from the correct
sentence segmentation for correctly performing some of its tasks, such as: subtitling, topic indexation and summarization. The initial experiments carried out in the scope of this thesis
aimed at providing the correct sentence boundaries to the original transcript.
The initial system used the APP (Audio Pre-Processing) segmentation as the only clue for
marking the sentence boundaries. Table 5.3 shows the system performance, considering that
each sentence boundary corresponds to one of the following reference punctuation marks: “.”,
“:”, “;”, “!”, “?” and “...” (all punctuation marks except comma are being used). The most
recent revision of the speech Portuguese corpus (Section 3.1.1.1) has been used. The number
of Correct (Cor), Inserted (Ins), and Deleted (Del) slots, are shown together with the standard
evaluation measures, to better clarify the magnitude of the errors. The planned speech results
are much better than spontaneous speech, but no significant difference occurs from clean to
noisy speech. Those results succeed in terms of recall, but the low precision causes an overall
SER above 100%. Those results provide the baseline for the following experiments that aim at
automatically detecting the sentence boundaries.
The ME modeling approach, described in Section 2.3, and also used for the capitalization
task, has been adopted for this task as well. However, this is a binary problem, much easier
to perform. The optimization for the following experiments was performed using 10,000 iter-
96
CHAPTER 5. PUNCTUATION RECOVERY
Training data
Background focus
F0
F40
F0, F40
F1
F41
F1, F41
All
Prec
89.2%
86.3%
87.3%
74.0%
79.6%
77.8%
84.2%
All data
Rec
F
73.4% 80.6%
71.2% 78.0%
72.0% 78.9%
69.4% 71.6%
69.7% 74.3%
69.6% 73.5%
70.8% 76.9%
SER
35.4%
40.1%
38.5%
55.0%
48.1%
50.2%
42.5%
Planned speech only
Prec
Rec
F
SER
86.2% 77.4% 81.6% 35.0%
83.3% 77.0% 80.1% 38.4%
84.3% 77.2% 80.6% 37.2%
67.6% 70.5% 69.0% 63.4%
72.6% 70.4% 71.5% 56.1%
71.0% 70.4% 70.7% 58.3%
79.8% 74.4% 77.0% 44.4%
Table 5.4: Recovering sentence boundaries in the force aligned data.
Training data
Background focus
F0
F40
F0, F40
F1
F41
F1, F41
All
Prec
83.2%
77.8%
79.7%
59.9%
64.3%
62.9%
74.4%
All data
Rec
F
64.2% 72.5%
61.0% 68.4%
62.1% 69.8%
46.3% 52.3%
47.9% 54.9%
47.4% 54.1%
57.4% 64.8%
SER
48.7%
56.4%
53.7%
84.7%
78.7%
80.5%
62.3%
Planned speech only
Prec
Rec
F
SER
80.8% 69.1% 74.5% 47.3%
75.1% 65.6% 70.0% 56.2%
77.0% 66.8% 71.5% 53.1%
52.9% 48.9% 50.9% 94.6%
58.4% 51.2% 54.6% 85.3%
56.7% 50.5% 53.4% 88.1%
70.3% 61.3% 65.5% 64.7%
Table 5.5: Recovering sentence boundaries directly in the ASR output.
ations. Table 5.4 presents the results achieved for the force aligned transcripts, combining all
features previously described in this section. Results from the left side of the table were obtained from models built from all data, while results from the right side were produced models
trained only with planned speech. Results confirm the expectation that sentence boundaries
are easier to detect in planned speech. While the difference between planned speech and spontaneous speech is significant, results for speech with and without noise are quite similar. Better
results could be expected by reducing the frequency of some phenomena, such as disfluencies, from the training data, but results from the right side of the table do not support this
assumption, since the overall performance decreased. The decreased performance is due to the
reduced training material, corresponding to about 56% of all available training material, and
also because removing the spontaneous part of the training corpus, caused some spontaneous
speech phenomena not to be captured.
Force aligned transcripts do not contain recognition errors, for that reason Table 5.4 provides the upper-bound estimate for a real system. The second set of experiments is performed
directly on the automatic speech recognition (ASR) output. Table 5.5 shows the corresponding
results, where the training data consists of automatic speech transcriptions. A number of additional experiments, not presented here, revealed that models trained with automatic speech
transcripts are better suitable for the ASR output than the corresponding models trained with
force aligned transcripts, despite containing recognition errors. That is due to the fact that train-
5.2. EARLY WORK USING LEXICAL AND ACOUSTIC FEATURES
+),-)"%$'),./'01$.%
97
2-$#3)4/%$'),./'01$.%
(!!"#
!"#$%&''#'%()$*%
'!"#
&!"#
%!"#
$!"#
!"#
)**#+),)#
-*)../+#
0-1.,)./120#
)**#3/),24/0#
*/567)*#1.*8#
)**#+),)#
-*)../+#
0-1.,)./120#
)712097#1.*8#
Figure 5.5: Impact of each feature type in the SU detection performance.
ing and testing data share the same conditions. The worst results produced by models trained
only with planned speech confirms that is better to use all training data, even if contains an
increased number of recognition errors.
The impact of the recognition errors can be calculated by comparing Tables 5.4 and 5.5.
When all the data is used for training the overall SER impact is 19.8% absolute (62.3%-42.5%),
reflecting the WER (Word Error Rate) of about 19.5% in the evaluation data, for the current
speech recognition system version (Sep. 2010). Results also show a bigger impact for spontaneous speech, where the number of recognition errors is much higher.
5.2.2.1
Feature Contribution Analysis
Previous results were produced by combining all features described previously in this Section. The following experiments try to assess the contribution of each individual feature and
also of groups of features, allowing to distinguish the most interesting features and, possibly,
to allow discarding features that may not contribute for better results. Figure 5.5 reveals the
first results, concerning the influence of the lexical and acoustic features, by background focus
condition. The best results were consistently produced by the combination of lexical and acoustic features. Nevertheless, results show that lexical features have lesser impact than acoustic
features on the final performance. These conclusions are similar to those reported by Batista
et al. (2008b) concerning our previous work, despite a different corpora revision and different
versions of most of the tools, including the speech recognition, being used by that time.
The contribution of each one of the five features, presented in Section 5.2.1, was also studied, and is illustrated in figure 5.6. The figure shows results when using all but a given feature,
98
CHAPTER 5. PUNCTUATION RECOVERY
+),-)"%$'),./'01$.%
2-$#3)4/%$'),./'01$.%
(!!"#
!"#$%&''#'%()$*%
'!"#
&!"#
%!"#
$!"#
!"#
)**#+),)#
.1#314+#
-*)../+#
.1#567/8)-#
0-1.,)./120#
.1#9:;#
)**#+),)#
.1#;-/)</4=>?0#
-*)../+#
.1#8/.+/4=>?0#
0-1.,)./120#
)**#@/),24/0#
Figure 5.6: Impact of each individual feature in the SU detection performance.
where: no word means results achieved without word information related features; no TimeGap
means results excluding time intervals between words; no POS means results achieved without
part-of-speech related features; no SpeakerChg means results excluding speaker change information; and no GenderChg means results excluding speaker gender change information. Again,
results reveal that the combination of all features consistently produces the best results both for
manual and automatic transcripts. Overall, the biggest contribution comes from the TimeGap
information, except for the spontaneous speech, where the contribution of each feature is not
so clear. TimeGap information becomes more important when dealing with planned speech,
suggesting that pauses between words are relevant for sentence boundary detection of planned
speech, specially when in the presence of recognition errors where the acoustic information is
the most important information. The most important conclusion arising from this study is that
all the proposed features contribute to a better sentence boundary detection performance.
5.2.3
Segmentation into Chunk Units, Delimited by Punctuation Marks
Previous subsection considers that a sentence boundary corresponds to a punctuation
mark, where commas were excluded. Most of the literature considers commas as intra-sentence
punctuation marks, and do not use them for sentence boundary detection. However, after observing the speech recognition output it was verified that most APP segments correspond to
commas. Moreover, when addressing the sentence boundary detection problem, sometimes it
is not always clear whether a comma is or not used to delimit a SU boundary. For example, Liu
et al. (2006) is not clear about this subject, however, the proportion of reported SUs (8%) resembles the proportion of full-stops and commas calculated for English (8.6%) in Table 5.2, which
suggests the usage of the comma for delimiting the SUs.
5.2. EARLY WORK USING LEXICAL AND ACOUSTIC FEATURES
Background focus
planned, clean
F0
planned, noise
F40
all planned
F0, F40
spontaneous, clean F1
spontaneous, noise F41
all spontaneous
F1, F41
All
Cor
3364
6760
10124
1770
3848
5618
17276
Ins
1507
2735
4242
1505
3317
4822
9831
Del
2068
4710
6778
2861
5576
8437
16847
Precision
69.1%
71.2%
70.5%
54.0%
53.7%
53.8%
63.7%
99
Recall
61.9%
58.9%
59.9%
38.2%
40.8%
40.0%
50.6%
F
65.3%
64.5%
64.8%
44.8%
46.4%
45.9%
56.4%
SER
65.8%
64.9%
65.2%
94.3%
94.4%
94.3%
78.2%
Table 5.6: Recovering chunks over the ASR output, using only the APP segmentation.
Training data
Background focus
F0
F40
F0, F40
F1
F41
F1, F41
All
Prec
84.3%
86.2%
85.6%
79.5%
81.2%
80.6%
83.5%
All data
Rec
F
74.9% 79.3%
72.4% 78.7%
73.2% 78.9%
64.7% 71.4%
65.1% 72.3%
65.0% 72.0%
69.3% 75.8%
SER
39.0%
39.2%
39.2%
52.0%
50.0%
50.6%
44.3%
Planned speech only
Prec
Rec
F
SER
83.8% 74.8% 79.1% 39.6%
86.2% 72.5% 78.8% 39.1%
85.4% 73.3% 78.9% 39.2%
77.3% 60.3% 67.7% 57.4%
78.8% 59.8% 68.0% 56.3%
78.3% 60.0% 67.9% 56.7%
82.5% 67.0% 74.0% 47.1%
Table 5.7: Recovering chunk units in the force aligned data.
This section aims at identifying units, also referred as chunks, that are delimited by any
punctuation mark, including commas. Those units may also be useful for tasks like summarization, machine translation, NER, etc.. Table 5.6 shows the system performance, considering
all the punctuation marks in the reference, namely: “.”, “:”, “;”, “!”, “?”, “...”, “,”, and “-”.
Comparing these results with results from Table 5.3, one can see that the APP segmentation
pinpoints the position of a punctuation mark, but most of the times does not correspond to a
sentence boundary. Precision has increased considerably while recall decreased. That means
that most of the commas are still not identified, but most APP segments correspond to commas.
The challenge proposed in this subsection is very similar to what has been performed in
the previous subsection. The same ME modeling approach is used, as well as the same number
of optimization iterations (10K). The upper-bound estimate is again provided using the force
aligned transcripts. Table 5.7 presents the results combining all features. Similarly to previous
subsection, results from the left side of the table were obtained from models built from all data,
while results from the right side were produced models trained only with planned speech.
Again, planned speech achieves the best performance, the noise impact is insignificant, and
using more data achieves the best results. The corresponding results for the ASR output are
shown in Table 5.8, where the training data consists also of automatic speech transcriptions.
The impact of the recognition errors is about 11.7% SER (absolute) and can be calculated by
comparing the two tables. Results again show a bigger difference for spontaneous speech,
where the number of recognition errors is higher.
100
CHAPTER 5. PUNCTUATION RECOVERY
Training data
Background focus
F0
F40
F0, F40
F1
F41
F1, F41
All
Prec
81.7%
82.0%
81.9%
73.6%
73.0%
73.2%
78.3%
All data
Rec
F
70.1% 75.5%
67.1% 73.8%
68.1% 74.4%
52.9% 61.6%
52.6% 61.1%
52.7% 61.3%
61.0% 68.5%
SER
45.6%
47.6%
47.0%
66.0%
66.9%
66.6%
56.0%
Planned speech only
Prec
Rec
F
SER
81.7% 70.2% 75.5% 45.5%
82.4% 66.8% 73.8% 47.5%
82.2% 67.9% 74.4% 46.8%
73.0% 49.8% 59.2% 68.7%
72.4% 49.2% 58.6% 69.6%
72.6% 49.4% 58.8% 69.3%
78.2% 59.3% 67.5% 57.2%
Table 5.8: Recovering chunk units directly in the ASR output.
Despite the corpora being different, and results not being directly comparable, the achieved
results are similar to the state-of-the-art results reported by Liu et al. (2006), concerning sentence boundary detection for English broadcast news. The authors evaluate different modeling
approaches, alone and in combination, reporting about 47% to 52% WER (equivalent to NIST
SU Error Rate) for manual transcriptions, and 57% to 61% for automatic transcriptions. The
experiments here reported achieved about 44.3% WER (2.7% difference) for the manual transcripts and 56% (1% difference) for the automatic transcripts. The authors report about 9%
difference between manual and automatic transcriptions. These experiments reveal a difference of 11.7%, which can be explained because of the large percentage of spontaneous speech
in the corpus (34%), and because of the WER differences. Whereas the WER of speech recognition system used by Liu et al. (2006) is 11.7%, the WER of our recognition system is above
15.1%. The authors also study the effect of the recognition errors in the SU performance and
they conclude that, for example, a WER increase of 3.3% leads to a 3.2% worse performance for
SU detection.
5.2.3.1
Feature Contribution Analysis
Figure 5.7 shows the results concerning the influence of the lexical and acoustic features,
by background focus condition. Again, the best results were consistently produced by the
combination of lexical and acoustic features. But, in contrast with results achieved in the previous subsection, lexical features have now more impact than acoustic features on the final performance, specially when considering spontaneous speech. The only exception concerns the
planned speech portion of the automatic transcripts, where the recognition errors may have
lowered the performance of the lexical features. Notice that the previous results presented for
sentence boundary detection have shown an opposite result, where the acoustic features turned
out to be the most important.
The contribution of each one of the five features was also studied, and is illustrated in figure 5.8. The figure shows results when using all but a given feature. Again, results reveal that
the combination of all features consistently produces the best results both for manual and auto-
5.2. EARLY WORK USING LEXICAL AND ACOUSTIC FEATURES
Manual transcripts
101
Automatic transcripts
100%
Slot Error Rate
80%
60%
40%
20%
0%
all data
planned
spontaneous
all features
lexical only
all data
planned
spontaneous
acoustic only
Figure 5.7: Impact of each feature type in the chunk detection performance.
+),-)"%$'),./'01$.%
2-$#3)4/%$'),./'01$.%
(!!"#
!"#$%&''#'%()$*%
'!"#
&!"#
%!"#
$!"#
!"#
)**#+),)#
.1#314+#
-*)../+#
.1#567/8)-#
0-1.,)./120#
.1#9:;#
)**#+),)#
.1#;-/)</4=>?0#
-*)../+#
.1#8/.+/4=>?0#
0-1.,)./120#
)**#@/),24/0#
Figure 5.8: Impact of each individual feature in the chunk detection performance.
102
CHAPTER 5. PUNCTUATION RECOVERY
.
:
Symbol
; ! ? ...
, -
Replacement
full-stop
comma
Table 5.9: Punctuation mark replacements.
matic transcripts. Overall, the biggest contribution comes from the word information, followed
by the TimeGap information. Again, TimeGap information, corresponding to pauses between
words, is more important for punctuation detection of planned speech. In spontaneous speech,
pauses are likely to to be associated with disfluency phenomena, because people tend to construct the sentences while thinking (Clark and Fox Tree, 2002). The importance of pauses is not
surprising, taking into account that punctuation was originally used for marking breaths, and
that such function is expected to remain part of its basic usage Kowal and O’Connell (2008).
The POS information is the third most important feature, despite the part-of-speech tagger
not having been specially trained for dealing with speech transcripts. Information concerning
speaker change and gender change have shown little impact on the results. One possible explanation is that they have a lower occurrence in the corpus, and therefore a lower impact on the
results. The second possible explanation is that they provide some redundant information, for
example, whenever the gender changes the speaker also changes. A third explanation is that,
most often, speaker changes are accompanied by time gaps, because otherwise the current APP
module does not detect them. Again, all the proposed features have shown to contribute to a
better punctuation performance.
The next section uses the same feature set to achieve results for punctuation mark recovery,
distinguishing between two different sentence boundary markers: full-stop and comma.
5.2.4
Recovering full-stop and comma Simultaneously
The following experiments distinguish between two more frequent punctuation marks:
full-stops and commas, which depend on local features. As only full-stops and commas are being
considered, all the other punctuation marks have been converted into one of these, in accordance with the replacements described in Table 5.9. This task uses the same approach previously used for sentence segmentation, as well as the same feature set. The most significant
difference arises from the fact that instead of facing a binary classification problem, one now
faces a multiclass problem. As mentioned in Section 2.2, comma is one of the most frequent and
unpredictable punctuation marks. Its use is highly dependent of the corpus and there is weak
human agreement on a given annotation. Therefore, a lower performance is expected for this
punctuation mark.
The following experiments use the ME modeling approach, described in Section 2.3. For
multiclass problems, however, sometimes the implemented algorithm considers the optimiza-
5.2. EARLY WORK USING LEXICAL AND ACOUSTIC FEATURES
Focus
planned
spontan.
all
Prec
82.5
69.9
78.4
full-stop
Rec
F
77.6 80.0
71.8 70.8
75.3 76.8
SER
38.9
59.2
45.4
Prec
64.2
70.9
67.5
comma
Rec
F
39.0 48.5
39.9 51.1
39.4 49.7
SER
82.7
76.5
79.6
103
all punctuation
Prec Rec
F
SER
75.1 57.9 65.4 51.6
70.4 49.5 58.1 62.5
73.3 54.0 62.2 56.5
Table 5.10: Recovering full-stop and comma in force aligned transcripts.
Focus
planned
spontan.
all
Prec
73.8
53.4
67.2
full-stop
Rec
F
70.7 72.2
53.8 53.6
65.3 66.2
SER
54.4
93.1
66.7
Prec
61.1
64.5
62.8
comma
Rec
F
30.0 40.3
27.9 39.0
28.9 39.6
SER
89.1
87.4
88.2
all punctuation
Prec Rec
F
SER
69.3 49.8 58.0 60.7
59.0 35.6 44.4 78.7
65.4 43.5 52.2 68.8
Table 5.11: Recovering full-stop and comma in automatic transcripts.
tion converges before it actually does, so the optimization has been forced to be repeated several times. Each epoch was retrained 25 times, using 100 iterations.
The first set of experiments consisted of recovering the punctuation over force aligned transcripts, which provide the upper-bound performance for speech transcripts. Table 5.10 presents
the corresponding results, where each value is a percentage. Results concerning each punctuation mark are presented individually, together with the overall results, which also consider
the number of substitutions between the two punctuation marks. From the results, it is clear
that the full-stop detection is easier than comma detection. However, it is surprising that, while
recovering the full-stop is easier to perform in planned speech, comma is easier to recover in
spontaneous speech. The performance of detecting full-stops in planned speech is about 20%
better than the performance in spontaneous speech, which is significant. The same tendency
is not achieved for the comma detection, where differences are relatively much smaller and the
best results were in fact achieved for spontaneous speech.
The overall results presented in the last four columns of Table 5.10, as well as the values
from the first four columns of Table 5.7 concern the performance of detecting all punctuation
marks. However, Table 5.10 also considers the number of substitutions, i.e., mistakes confusing full-stops and commas. In terms of SER performance, results from Table 5.10 are about 9.4%
worse, which reflects not only the Substitutions impact, but also passing from a binary to a
multiclass problem. By considering the number of Substitutions (which corresponds to about
9% of the Correct slots) as correct slots, the performance increases about 9%, but the final SER
performance is still about 0.4% worse that performance from Table 5.7, suggesting that the optimization method converges better when dealing with binary problems. This is also according
to the work reported by Matusov et al. (2006), which states that it is much easier to predict
segment boundaries than to predict whether a specific punctuation mark has to be inserted or
not at a given word position in the transcript.
104
CHAPTER 5. PUNCTUATION RECOVERY
Table 5.11 shows the performance on recovering punctuation marks over automatic transcripts, which is the ultimate challenge of this study. These results concerning automatic transcripts support all the conclusions that have already been presented for the manual transcripts.
The performance of detecting full-stops in planned speech is about 39% better than the performance in spontaneous speech, which is even more significant than the results achieved for
manual transcripts (Table 5.10). The comma detection performance follows the same tendency
observed for manual transcripts, and it does not vary significantly from planned to spontaneous speech. The overall impact of the recognition errors, achieved by comparing manual
to automatic transcripts, is about 12.3% absolute, but it is much greater for full-stop detection,
where such a difference increases to about 21.3%. The difference between planned and spontaneous speech increases from 20% in manual transcripts to 39% in automatic transcripts, which
corroborates that recognition errors have a strong impact on full-stop detection. The comma
detection performance is only about 8.6% worse for automatic transcripts, which is relatively
marginal.
A number of additional experiments have shown that training with all data is better than
training with planned speech only, and that no significant performance differences exist between noisy and clean speech. The same conclusions were achieved for sentence boundary
detection, in Section 5.2.3, and for that reason results are not presented here.
Our results are consistent with other related work. For example, (Christensen et al., 2001)
use statistical models of prosodic features for recovering punctuation marks over the Hub-4
broadcast news corpora. Results vary from 41% to 79% SER for full-stop detection, and above
81% for comma detection.
The evaluation presented here, however, may not reflect the real achievements of this work,
and would benefit from a human evaluation (Beeferman et al., 1998). This is also supported by
the analysis on the user annotation agreement, reported in Section 3.1.1.1.
5.2.4.1
Feature Contribution for Punctuation Recovery
Similarly to the analysis on the feature impact, previously performed for SU detection,
the following experiments assess the impact of each feature on the performance of recovering punctuation marks. The first results, concerning the influence of the lexical and acoustic
features are presented in Figure 5.9. Once again, results indicate that the combination of all
features lead to significant better results. Considering the overall results on both punctuation
marks, lexical features have more impact than acoustic features, both for manual and automatic
transcripts. Nonetheless, separate results for full-stop detection reveal that acoustic features
have a stronger impact on recovering this punctuation mark, which is specially notorious on
automatic transcripts.
The impact of each individual feature was also analysed and the corresponding results are
5.2. EARLY WORK USING LEXICAL AND ACOUSTIC FEATURES
Manual transcripts
105
Automatic transcripts
100%
Slot Error Rate
80%
60%
40%
20%
0%
all
full-stop
comma
all features
lexical only
all
full-stop
comma
acoustic only
Figure 5.9: Impact of lexical and acoustic features in the punctuation detection.
Manual transcripts
Automatic transcripts
100%
Slot Error Rate
80%
60%
40%
20%
0%
all
no word
full-stop
no TimeGap
comma
no POS
all
no SpeakerChgs
full-stop
no GenderChgs
comma
all features
Figure 5.10: Impact of each individual feature in the punctuation detection performance.
106
CHAPTER 5. PUNCTUATION RECOVERY
(!!"#
'!"#
&!"#
%!"#
=>??*9@A2#
BACC4#
$!"#
12345367809(#
:3;<367809(#
()*$!#
$(*$%#
$+*$,#
-!*-&#
-)*%%#
%+*+%#
++*&&#
&)*'(#
'$*,,#
(!!*($(#
($$*(%'#
(%,*('(#
('$*$$(#
$$$*$)!#
$)(*--!#
--(*%!-#
%!%*%,$#
%,-*&!(#
&!$*)-+#
)-&*',)#
','*(!,&#
./0#
!"#
;A2>;B@#
!"#$%&'(&)*(+,$(-+,$.(&/-012/(3$&+0.$1(
Figure 5.11: Relation between the acoustic features and each punctuation mark.
presented in Figure 5.10. Identical to the approach used in Figure 5.8 for SU detection, the performance was evaluated when using all but a given feature. Unlike the conclusions previously
achieved for SU detection, the best results are not achieved by combining all features. In fact,
only marginal improvements can be achieved by removing the GenderChg feature.
The classification probabilities, given by the punctuation model created from the automatic
transcripts, were analysed for each acoustic feature. The corresponding results are illustrated
in Figure 5.11, where the first columns correspond to the binarized TimeGap intervals and the
last two columns correspond to the features SpeakerChg and GenderChg. This figure reveals
the probability of choosing each punctuation mark, given the feature weights provided by the
punctuation model. Results show that each one of these features is likely to be associated with
full-stops. However, for small time gaps, there are increased chances of having a comma or not
having a punctuation mark at all. This also happens for time gaps with about 1 second and
whenever the APP module indicates a speaker or gender change. That helps explaining the
reason for the better performance without the GenderChg feature, even if it is only marginal. An
important overall conclusion arising from this study is that the largest contribution to detecting full-stops and commas comes from the word information, followed by the TimeGap information. The latter information becomes more important when dealing with full-stop detection,
suggesting that pauses between words are indeed relevant as clues for punctuation recovery.
Information concerning speaker change and gender change have shown little impact on the
results, perhaps because some of that information is also encoded by pause durations.
To conclude this study we have analysed the proportion of pauses that correspond to a
punctuation mark, and the proportion of each punctuation mark that corresponds to pauses.
5.3. EXTENDED PUNCTUATION MODULE
107
Considering pauses of at least 20ms and the training and development sets from the speech
data, we have verified that about 40% of all pauses correspond to full-stops, and another 24%
correspond to commas. About 88% of the full-stops correspond to a pause, but only about 38%
of the commas correspond to pauses. From all punctuated locations where a pause did not
occur, about 87% correspond to commas. Kowal and O’Connell (2008) report similar results for
German (95%) and for American English (82%), and use them to support the idea that pauses
are not “the oral equivalent of commas” and that commas do not “signal” pauses. Our results
for Portuguese also substantiate that statement.
5.3
Extended Punctuation Module
The performance results presented and analysed in previous section, concerning full-stop
and comma detection, correspond to the first implemented version of the European Portuguese
punctuation module, which explored a limited set of features, mostly lexical and acoustic, simultaneously targeting at low latency, and language independence. The aim of the following
experiments is to improve the performance of the punctuation module, first by exploring additional features, namely prosodic; then, by exploring the use of punctuation information that can
be found in large newspaper corpora; finally, by weighting the impact of lexical and prosodic
features on the baseline system when encompassing interrogatives. This study constituted the
first step towards the goal of adding the identification of interrogatives to the punctuation module. It was a joint effort, only possible because it involved a number of researchers from the L2 F
laboratory with different background.
5.3.1
Improving full stop and comma Detection
This section addresses two ways of improving the initial results for full-stop and comma: i)
adding prosodic features, besides the existing lexical, time-based and speaker-based features;
ii) making use of punctuation information that can be found in large written corpora.
5.3.1.1
Introducing Prosodic Information
The first strategy for improving the initial results consisted of adding prosodic features,
besides the existing lexical, time-based and speaker-based features. We do know that there
is no one-to-one mapping between prosody and punctuation (Viana et al., 2003; Kowal and
O’Connell, 2008). Silent pauses, for instance, can not be directly transformed into punctuation
marks for different reasons, e.g., prosodic constraints regarding the weight of a constituent,
speech rate, style, different pragmatic functions, such as emphasis, emotion, on-line planning.
However, prosodic information can be used to improve the punctuation detection. For example, Kim and Woodland (2001) concludes that F-measure can be improved by 19% relative.
108
CHAPTER 5. PUNCTUATION RECOVERY
Type of
added
Info
features
Baseline
Word
f0
Based
E
f0 + E
Syllables
f0
&
E
Phones
D
f 0 +E+D
All Combined
Prec
78.4
81.5
78.4
81.2
79.5
78.3
78.2
78.4
79.8
full-stop
Rec
F
75.3 76.8
77.2 79.3
77.5 77.9
78.5 79.8
77.2 78.3
76.8 77.5
76.8 77.5
78.2 78.3
79.9 79.8
SER
45.4
40.3
43.9
39.7
42.7
44.5
44.6
43.3
40.4
Prec
67.5
68.1
67.9
68.0
68.5
68.3
68.4
69.1
69.6
comma
Rec
F
39.4 49.7
42.0 51.9
40.1 50.4
44.1 53.5
41.7 51.9
40.3 50.7
41.7 51.9
40.8 51.3
43.6 53.6
SER
79.6
77.7
78.8
76.7
77.4
78.4
77.5
77.4
75.5
Prec
73.3
75.0
73.5
74.7
74.2
73.6
73.6
74.1
74.9
All
Rec
F
54.0 62.2
56.3 64.3
55.3 63.1
58.0 65.3
56.1 63.9
55.1 63.0
56.0 63.6
56.0 63.8
58.4 65.6
SER
56.5
54.2
55.6
53.3
54.4
55.5
54.9
54.5
52.7
Table 5.12: Punctuation results over manual transcripts, combining prosodic features.
The feature extraction stage involved several steps, as described in Section 3.4. The first
step consisted of extracting the pitch and the energy from the speech signal. Durations of
phones, words, and interword-pauses have been extracted from the recognizer output. By
combining the pitch values with the phone boundaries, micro-intonation and octave jump effects have been removed from the pitch track. Another important step consisted of marking the
syllable boundaries as well as the syllable stress, as described in Section 3.4.3. Finally, the maximum, minimum, median and slope values for pitch and energy have been calculated in each
word, syllable, and phone. Duration was also calculated for each one of the previous units.
As previously mentioned, these experiments aim at analyzing the weight and contribution
of each prosodic feature per se and the impact of the combination of prosodic features. Underlying the feature extraction process are linguistic evidences that pitch contour, boundary
tones, energy slopes, and pauses are crucial to delimit sentence-like units across languages.
The first experiment aimed at testing whether the features would perform better on different
units of analysis: phones, syllables and/or words. The linguistic findings for EP (Viana, 1987;
Mata, 1999; Frota, 2000), prospect that the stressed and post-stressed syllables would be relevant units of analysis to automatically identify punctuation marks. When considering the
word as a window of analysis, we are also accounting for the information in the pre-stressed
syllables as well.
Features are calculated for each word transition, with or without a pause, using the same
analysis window as Shriberg et al. (2009). The following features have been used: f 0 and energy
slopes in the previous and following words, with or without a silent pause; f 0 and energy
differences between these units; and also duration of the last syllable and the last phone. With
this set of features, the aim is to capture nuclear and boundary tones, amplitude, pitch reset,
and final lengthening.
Tables 5.12 and 5.13 show the performance results for full stop and comma recovery, where
each prosodic parameter was analyzed separately. The baseline corresponds to a punctuation
model created using only lexical and acoustic information, and is represented in the first row
of each table. These results represent significant gains relatively to the previous results for
5.3. EXTENDED PUNCTUATION MODULE
type of
added
Info
features
Baseline
Word
f0
Based
E
f0 + E
Syllables
f0
&
E
Phones
D
f 0 +E+D
All Combined
Prec
67.2
71.3
66.8
71.2
67.6
67.2
68.2
67.0
70.9
full-stop
Rec
F
65.3 66.2
67.3 69.2
65.7 66.2
67.5 69.3
67.3 67.5
65.5 66.4
64.3 66.2
68.4 67.7
68.3 69.6
SER
66.7
59.8
67.0
59.8
64.9
66.5
65.7
65.3
59.8
109
Prec
62.8
63.8
62.6
62.0
64.1
62.1
61.3
63.6
62.6
comma
Rec
F
28.9 39.6
31.6 42.2
27.8 38.5
34.8 44.6
27.5 38.5
30.0 40.5
31.0 41.2
29.3 40.1
33.0 43.2
SER
88.2
86.4
88.8
86.5
87.9
88.3
88.5
87.5
86.7
Prec
65.4
68.0
65.1
66.9
66.2
65.0
65.1
65.6
67.2
All
Rec
F
43.5 52.2
45.9 54.8
43.1 51.8
47.9 55.8
43.5 52.5
44.3 52.7
44.4 52.8
45.0 53.4
47.2 55.4
SER
68.8
66.0
69.2
65.8
68.3
68.6
68.5
67.8
65.8
Table 5.13: Punctuation performance over automatic transcripts, combining prosodic features.
both types of transcripts, and both punctuation marks, ranging from 3% to 7% SER (absolute).
The best results are mainly related with the full stop, where the pitch value of words turned
out to be the most relevant prosodic feature. This model was further improved by adding
the energy value of words. The syllables and phone-based features did not constitute a substantial improvement. Moreover, combining words and syllables achieved results similar to
using only word-based features. The duration parameter is of interest in EP, since three particular strategies are used at the end of an intonational phrase: epenthetic vowel, elongated
segmental material or elision of post-stressed segmental material. To the best of our knowledge, no quantifiable measures were reported for the Portuguese language and little has been
said about these strategies so far. Then, not surprisingly, the durational parameter did not add
a substantial improvement to our model, although it did contribute to a slightly better result in
the spontaneous speech data. In this specific set of data, there is a tendency to elongate the last
phone or the last syllable of the word in a potential location for a punctuation mark, making
duration an informative cue for this specific context.
Results in this study partially agree with the ones reported in Shriberg et al. (2009), regarding the contribution of each prosodic parameter and also the set of discriminative features
used, where the most important feature turned out to be f 0 slope in the words and between
word transitions. These features are language independent; still, language specific properties
in the data are related with different durational patterns at the end of an intonational unit and
also with different pitch slopes that may be associated with discourse functionalities beyond
sentence-form types.
5.3.1.2
Retraining from a Written Corpora Model
Another idea for improving the baseline punctuation results consisted of making use of
the punctuation information that can be found in large written corpora. For that purpose,
first a punctuation model has been trained with the written corpus, and then a new model was
trained with the training transcriptions, bootstrapping from the initial training with newspaper
110
CHAPTER 5. PUNCTUATION RECOVERY
Type of
added
Info
features
Lexical + Acoustic
Word
f0
Based
E
f0 + E
Syllables
f0
&
E
Phones
D
f 0 +E+D
All Combined
Prec
78.3
79.6
76.0
81.6
80.1
78.9
77.2
78.8
78.6
full-stop
Rec
F
77.4 77.9
80.6 80.1
81.3 78.5
79.4 80.5
78.6 79.3
78.9 78.9
79.1 78.1
79.8 79.3
81.6 80.0
SER
44.0
40.0
44.4
38.5
40.9
42.3
44.3
41.6
40.7
Prec
68.1
68.4
70.3
68.2
69.2
69.5
67.8
69.4
68.2
comma
Rec
F
51.8 58.8
53.9 60.3
49.4 58.0
54.6 60.6
52.0 59.4
51.0 58.8
53.0 59.5
52.1 59.5
54.6 60.6
SER
72.4
71.0
71.5
70.9
71.1
71.3
72.2
70.9
70.8
Prec
72.9
73.6
73.2
74.3
74.4
74.0
72.2
73.9
73.1
All
Rec
F
62.2 67.1
64.7 68.9
62.3 67.3
64.7 69.1
62.8 68.1
62.3 67.7
63.6 67.6
63.4 68.2
65.5 69.1
SER
50.6
49.0
50.3
48.7
49.4
49.7
50.5
49.3
48.9
Table 5.14: Results for manual transcripts, bootstrapping from a written corpora model.
Type of
added
Info
features
Lexical + Acoustic
Word
f0
Based
E
f0 + E
Syllables
f0
&
E
Phones
D
f 0 +E+D
All Combined
Prec
67.9
71.7
67.5
72.1
69.1
68.4
68.5
68.4
71.9
full-stop
Rec
F
66.7 67.3
68.1 69.9
68.1 67.8
68.3 70.1
68.3 68.7
66.4 67.4
66.2 67.3
69.3 68.8
68.6 70.2
SER
64.8
58.8
64.7
58.2
62.3
64.3
64.3
62.8
58.2
Prec
62.5
62.3
63.3
63.5
62.2
61.1
62.2
63.5
61.3
comma
Rec
F
38.1 47.4
41.5 49.8
37.1 46.7
39.7 48.8
39.1 48.0
39.4 47.9
38.9 47.8
37.6 47.3
41.9 49.8
SER
84.7
83.6
84.4
83.2
84.6
85.7
84.8
84.0
84.5
Prec
65.3
66.9
65.5
67.8
65.7
64.8
65.4
66.1
66.4
All
Rec
F
49.6 56.4
52.2 58.6
49.5 56.4
51.2 58.3
50.9 57.4
50.3 56.6
49.8 56.6
50.4 57.2
52.7 58.7
SER
65.1
63.1
65.0
62.8
64.3
65.4
64.9
64.1
63.2
Table 5.15: Results for automatic transcripts, bootstrapping from a written corpora model.
text. The first results were encouraging, which motivated testing this strategy to the combination of each prosodic feature. The results obtained in this way are presented in Tables 5.14 and
5.15. They correspond to the best results achieved so far. Combining all features still achieves
better results than the baseline, but the best results are obtained by combining lexical, acoustic
and word-based prosodic features, putting aside the syllable-based and phone-based features,
whose impact is only marginal.
Additionally to the above bootstrapping method for improving the transcripts model, an
alternative was also tested. The idea consisted of using the prediction of the written corpora
model as a complement to the transcripts data. Three different features (C OMMA, F ULL S TOP,
SPACE) were appended to the feature vector of each event in the transcripts data, with the
corresponding probabilities, provided by the written corpora model. Models trained with the
improved data achieve better performances than using solely information coming from the
transcripts. Nevertheless, in general, this method is still worse than the first method tested,
based on bootstrapping.
5.3. EXTENDED PUNCTUATION MODULE
5.3.2
111
Extension to Question Marks
This section concerns the automatic detection of question marks, which corresponds to detecting which sentences are interrogatives. This is an extension to the punctuation module,
which was initially designed to deal only with full stop and comma. Detecting full-stops and
commas depends mostly on a local context, usually two or three words, and corresponds to
detecting sentence boundaries. However, most interrogatives, specially the wh-questions, hinge
on words that are used in the beginning and/or at the end of the sentence, implying that sentence boundaries must be previously known. Experiments here reported are based on a manual
sentence segmentation and identify which sentences are interrogative. The same acoustic and
prosodic features used for full stop and comma were also applied to question mark. However,
lexical features were extracted from the whole sentence, and each event corresponds to a sentence instead of a word. The same ME-based approach will be followed, but now reducing the
classification to a binary problem.
Results concerning the ALERT-SR corpus consider only the Eval and JEval evaluation subsets, because the remaining evaluation sets were not completely revised by the time these experiments started. In order to evaluate if the features would be dependent on the nature of the
corpus, besides the ALERT-SR corpus, a corpus collected within the Lectra project (Trancoso
et al., 2006, 2008) has also been analysed. The corpus, henceforth denoted of LECTRA corpus, is
aimed at transcribing university lectures for e-learning/e-inclusion applications, namely, making the transcribed lectures available for hearing impaired students, and it offers a different
perspective on the question marks recovery. The corpus has a total of 75h, corresponding to 7
different courses, of which only 27h were orthographically transcribed (totaling 155k words)
so far.
In a previous study (Moniz et al., 2011) different corpora have been analysed in order to
see if the weight of the features was dependent on the nature of the corpus and on the most
characteristic types of interrogatives in each. The study concluded that the percentage of interrogatives was, in fact, highly dependent on the nature of the corpus. For the university lectures
corpus, interrogatives represent 20.4% of all the punctuation marks, and similar values (22.0%)
have been found also in a map-task corpus (Trancoso et al., 1998) - in both corpora, the proportion is ten times more than in broadcast news (2.1%). This difference is related not only
with the percentage of interrogatives across different corpora, but also with their subtypes. In
broadcast news, interrogatives are almost exclusively found in interviews, and in transitions
from anchormen to reporters. In broadcast news yes/no questions account for 47.0% of all interrogatives, wh-questions for 40.4%, while tags and alternative questions account only 10.0% and
2.6%, respectively. These percentages compare well with the ones for newspapers, but not with
the ones of the other corpora analysed. The highest percentage of tag questions is found in the
university lecture corpus (40.4%), interpretable by the teacher’s need to confirm if the students
are understanding what is being said; and the highest percentage of yes/no questions occur in
112
CHAPTER 5. PUNCTUATION RECOVERY
Evaluation corpus
PUBnews
ALERT-SR - Manual transcripts
ALERT-SR - Automatic transcripts
LECTRA - Manual transcripts
correct
1100
128
84
157
wrong
236
25
27
31
missed
1740
287
305
221
Prec
82.3
83.7
75.7
83.5
Rec
38.7
30.8
21.6
41.5
F
52.7
45.1
33.6
55.5
SER
69.6
75.2
85.3
66.7
Table 5.16: Recovering question marks using a written corpora model.
the map-task corpus (73.6%), related mostly with the description of a map made by a giver and
the need to ask if the follower is understanding the instructions.
The baseline experiments were performed using only lexical information. The following
features were used for a given sentence: wi , wi+1 , 2wi−2 , 2wi−1 , 2wi , 2wi+1 , 3wi−2 , 3wi−1 , start_x,
x_end, len, where: wi is a word in the sentence, wi+1 is the word that follows and nwi± x is the
n-gram of words that starts x positions after or before the position i. start_x and x_end features
were used for identifying word n-grams occurring either at the beginning or at the end of the
sentence. len corresponds to the number of words in the sentence. A discriminative model
has been created using the PUBnews newspaper corpora, described in section 3.2.1. Table 5.16
shows the results of applying the model directly to different evaluation sets, where correct is
the number of correctly identified interrogatives, wrong corresponds to false acceptances or
insertions, and missed corresponds to the missing slots or deletions. The table values reveal a
precision around 80%, but a small recall. The main conclusion is that the recall values using this
limited set of features are correlated with the identification of a specific type of interrogative,
wh- questions. Recall values are comparable to the ones of the wh- question distribution across
corpora. As for yes/no and tag questions, they are only residually identified.
The subsequent experiments aimed at analyzing the weight and contribution of different
feature classes and the impact of their combination. Features were calculated for each sentence
transition, with or without a pause, using the same analysis scope as Shriberg et al. (2009)
(last word, last stressed syllable and last voiced phone from the current boundary, and the first
word, and first voiced phone from the following boundary). The following set of features has
been used: f 0 and energy slopes in the words before and after a silent pause, f 0 and energy
differences between these units and also duration of the last syllable and the last phone. With
this set of features, we aimed at capturing nuclear and boundary tones, amplitude, pitch reset,
and final lengthening. This set of prosodic features already proved useful for the detection of
the full stop and comma, in the ALERT-SR corpus, outperforming results achieved using lexical
and acoustic features only.
Combining the previously created model from written corpora, with transcripts data was
also a major issue. Different approaches have been tested, and the best approach consisted of
using the large written corpora model to perform automatic classification on the transcripts
training data and then use the assigned probabilities as features for training a new model from
the transcripts training data. Only two features were added, with the corresponding probabil-
5.3. EXTENDED PUNCTUATION MODULE
Evaluation
ALERT-SR
manual
transcripts
ALERT-SR
automatic
transcripts
LECTRA
manual
transcripts
Transcripts info
lexical
lexical + acoustic
lexical + acoustic + WB
lexical + acoustic + SYB
all features
lexical
lexical + acoustic
lexical + acoustic + WB
lexical + acoustic + SYB
all features
lexical
lexical + acoustic
lexical + acoustic + WB
lexical + acoustic + SYB
all features
correct
143
144
147
148
146
74
75
76
71
75
267
268
276
266
274
113
wrong
24
27
28
29
31
27
25
22
26
23
51
54
52
52
53
missed
272
271
268
267
269
315
314
313
318
314
111
110
102
112
104
Prec
85.6
84.2
84.0
83.6
82.5
73.3
75.0
77.6
73.2
76.5
84.0
83.2
84.1
83.6
83.8
Rec
34.5
34.7
35.4
35.7
35.2
19.0
19.3
19.5
18.3
19.3
70.6
70.9
73.0
70.4
72.5
F
49.1
49.1
49.8
50.0
49.3
30.2
30.7
31.2
29.2
30.8
76.7
76.6
78.2
76.4
77.7
SER
71.3
71.8
71.3
71.3
72.3
87.9
87.1
86.1
88.4
86.6
42.9
43.4
40.7
43.4
41.5
Table 5.17: Performance results recovering the question mark in different corpora.
ities, provided by the initial model. The performance of the resultant models is better than: i)
using only the information coming from the transcripts; ii) using the bootstrapping method,
applied in previous sections, because it is an easier problem (binary), and the reduced number
of question marks found in the BN corpora cause the method to converge too fast, loosing most
of the information given by the initial model.
The results of recovering question marks over the LECTRA and the ALERT-SR corpus are
presented in Table 5.17, where different combination of features were added to a standard
model, which uses lexical features only. For each corpus, the first row was achieved using only
lexical features, the second also uses acoustic features and the last three lines combine lexical,
acoustic an prosodic information, either using word-based (WB) prosodic features, syllable and
phone-based (SYB) prosodic features, or all the prosodic features combined. Combining the
written corpora model with lexical information coming from the speech transcripts seems to
be significantly important for manual transcripts, where the performance increases about 3.9%
for ALERT-SR and 21.2% for the LECTRA corpus, when comparing with results from Table
5.16. However, the impact is negative for automatic transcripts, where recognition errors cause
changes on the lexical features. Besides, acoustic information seems to be useful for automatic
transcripts, but its impact is negative for the manual transcripts. The combination of wordbased (WB) prosodic features seems to lead to the best results, but syllable and phone-based
(SLB) prosodic features have not shown a positive contribution.
Based on language dependency effects (fewer lexical cues in EP than in other languages,
such as English) and also on the statistics presented, one can say that, ideally, around 40.0% of
all interrogatives in broadcast news would be mainly identified by lexical cues – corresponding to wh-questions – while the remaining ones would imply the use of prosodic features to be
114
CHAPTER 5. PUNCTUATION RECOVERY
correctly identified. Results pointed out in this direction. A recent study focusing on the detection of question marks in meeting transcriptions (Boakye et al., 2009) analysed the relevance
of various features in this task, showing that the lexico-syntactic features are the most useful.
Like stated by Moniz et al. (2011), when training only with lexical features, wh- questions are
significantly identified, whereas tag questions and yes/no questions are quite residual, exception made in the latter case for the bigram acha que/do you think. There are still wh- questions
not accounted for, mainly due to very complex structures hard to disambiguate automatically.
When training with all the features, yes/no and tag questions are better identified. It was also
verified that prosodic features increase the identification of interrogatives in BN spontaneous
speech.
These results are encouraging, but still far from the ones obtained for full stop and comma.
Nevertheless, other related papers also show a lower performance in the detection of question
marks. For example, Gravano et al. (2009) report about 47% precision and 24% recall for English
BN, using lexical features only, but training with a very large written corpus.
The results in this study partially agree with the ones reported in Shriberg et al. (2009), regarding the contribution of each prosodic parameter, and also the set of discriminative features
used, where the most important feature turned out to be the f 0 slope in the last word of the
current boundary and between word transitions (last word of the current boundary and the
starting word of the following boundary).
It was also verified that prosodic features increase the identification of interrogatives in
Portuguese BN spontaneous speech, e.g., yes/no questions with a request to complete a sentence
(e.g., recta das?/lines of?), tag questions (such as não é?/isn’t it?), and alternative questions as
well (contava com esta decisão ou não?/Did you expect this decision or not?). Even when all the information is combined, we still have questions that are not well identified, due to the following
aspects:
i) a considerable amount of questions is made in the transition between newsreader and
reporter with noisy background (such as war scenarios);
ii) frequent elliptic questions with reduced contexts, e.g., eu?/me? or José?;
iii) sequences with disfluencies, e.g., <é é é> como é que se consegue?, contrasted with a similar
question without disfluencies that was identified: Como é que conseguem isso?/how do you
manage that?;
iv) sequences starting with the copulative conjunction e/and or the adversative conjunction
mas/but, which usually do not occur at the start of sentences;
v) false insertions of question marks in sequences with subordinated questions, which are
not marked with a question mark;
5.4. EXTENSION TO OTHER LANGUAGES
Type of
Transcripts
Manual
Automatic
Prec
79.2
71.1
full-stop
Rec
F
70.8 74.7
64.6 67.7
SER
47.8
61.7
Prec
66.2
65.1
115
comma
Rec
F
16.1 25.9
16.1 25.8
SER
92.1
92.6
Prec
76.7
69.9
all
Rec
F
45.1 56.8
41.7 52.3
SER
60.5
68.1
Table 5.18: Punctuation results for English BN transcripts.
vi) sequences with more than one consecutive question, randomly chosen, e.g., ... nascem
duas perguntas: quem? e porquê?/ ...two questions arise: who? and why?;
vii) sequences integrating parenthetical comments or vocatives, e.g., Foi acidente mesmo ou
atentado, Noé?/Was it an accident or an attack, Noé?.
5.4
Extension to other Languages
This section extends some of the previously described experiments to other languages,
particularly to English. The three most frequent and important punctuation marks are considered: full-stop, comma, and question marks. However, similarly to what has been done for
Portuguese, the detection of question marks is separated from the other two, since detecting fullstops and commas depend mostly on a local context, but most interrogative sentences, specially
wh-questions, depend on the information used in the beginning and at the end of the sentence
(global context). Detecting full-stops and commas corresponds to detecting sentence boundaries,
which in our case corresponds to distinguish between two types of sentence boundaries. On the
other hand, detecting interrogative sentences uses properties of the whole sentence as features,
given the sentence boundaries.
An analysis of Tables 3.1 and 3.5 reveal that the English training data has almost twice the
size of the Portuguese data. Nonetheless, English corpora are heterogeneous, comprising five
different corpora. Therefore, better performances may not necessarily be achieved.
5.4.1
Recovering Full-stop and Comma
The baseline performance for the English data was achieved using the feature set described
in Section 5.2.1: wi , wi+1 , 2wi−2 , 2wi−1 , 2wi , 2wi+1 , 3wi−2 , 3wi−1 , pi , pi+1 , 2pi−2 , 2pi−1 , 2pi ,
2pi+1 , 3pi−2 , 3pi−1 , GenderChg1 , SpeakerChg1 , and TimeGap1 . Table 5.18 shows the punctuation
results achieved for the English data, using these baseline features. The manual transcript
results were achieved using force aligned data, produced using the L2 F speech recognition
system. These results are quite similar to the results achieved for Portuguese using the same
set of features, and presented in Tables 5.10 and 5.11. Like it was concluded for the Portuguese
data, the overall performance is also affected by the comma detection performance, which is
significantly lower in terms of SER. Precision is consistently better than recall, confirming that
116
CHAPTER 5. PUNCTUATION RECOVERY
the system usually prefers to avoid mistakes than to add incorrect slots. The WER impact, in
terms of SER, is about 12% for Portuguese and about 8% for English (absolute values).
The recent study reported by Gravano et al. (2009) considers different text-based n-gram
language models, ranging from 58 million to 55 billion of training words, trained from Internet
news articles. For the task of recovering punctuation over broadcast news data, the smaller LM
achieved an F-score of 37% for comma and 46% for full stop, while the bigger LM achieved 52%
(14% absolute increase) for the comma and 63% for the full stop (17% absolute increase). The
significant performance increase suggests that our results, that use less than one million words
of speech transcripts, can be much improved by using larger training sets. Their best F-score
concerning the full stop (62%) is lower than results presented here for English, but authors do
not make use of any acoustic information, as it as been used here.
The following subsections analyse two ways for improving the baseline punctuation results: the first making use of the punctuation information that can be found in large written
corpora, and the second adding prosodic features, besides the existing lexical and acoustic features.
5.4.1.1
Introducing Prosodic Information
The first strategy for improving the baseline results, consisted of adding prosodic features,
besides the existing lexical and speaker dependent features. An important issue for providing
prosodic information consisted of marking the syllable boundaries as well as the syllable stress.
The tsylb2 (Fisher, 1996), an automatic phonological-based syllabication algorithm, has been
used for this purpose. Similarly to what has been done for Portuguese, the maximum, minimum,
median, and slope values for pitch and energy were calculated in each word, syllable, and phone.
Duration was also calculated for each one of the previous units. Features were calculated for
each word transition, with or without a pause, using: the last word, last stressed syllable and
last voiced phone from the current word, and the first word, and first voiced phone from the
following word. The following set of features has been used: f 0 and energy slopes in the words
before and after a silent pause, f 0 and energy differences between these units and also duration
of the last syllable and the last phone.
Tables 5.19 and 5.20 show the achieved results for manual and automatic transcripts, respectively, outlining the contribution of each prosodic feature per se. Despite being less significant than the corresponding results for Portuguese (Section 5.3.1.1), these results exhibit gains
relatively to the previous results, for both types of transcripts, and for both punctuation marks,
and they are again mainly related with the full stop. The word-based features turned out to be
the most reliable ones, whereas syllable-based features achieved only small gains relatively to
previous results. The best results were always achieved either combining all the features or
using the word-based features alone. These results, together with results from Section 5.3.1.1,
partially agree with the ones reported in Shriberg et al. (2009), regarding the contribution of
5.4. EXTENSION TO OTHER LANGUAGES
type of
Info
Word
Based
added
features
f0
E
f0 + E
Syllables
f0
&
E
Phones
D
f 0 +E+D
All Combined
Prec
78.1
78.3
79.4
78.1
78.3
78.1
78.3
78.4
full-stop
Rec
F
75.2 76.6
73.5 75.8
74.4 76.9
74.0 76.0
72.6 75.3
72.4 75.1
74.0 76.1
74.0 76.1
SER
45.9
46.8
44.8
46.8
47.5
47.9
46.5
46.4
117
Prec
66.6
69.4
67.5
69.2
68.4
69.0
68.8
67.2
comma
Rec
F
14.9 24.4
16.6 26.7
15.3 25.0
13.3 22.3
15.6 25.4
14.6 24.2
13.7 22.9
13.7 22.7
SER
92.6
90.7
92.0
92.6
91.6
91.9
92.5
93.0
Prec
76.1
76.7
77.3
76.7
76.6
76.6
76.8
76.6
All
Rec
F
46.9 58.1
46.8 58.1
46.7 58.3
45.5 57.1
45.9 57.4
45.3 56.9
45.8 57.3
45.7 57.3
SER
59.1
59.0
58.9
60.1
60.0
60.5
59.9
60.1
Table 5.19: Punctuation results for English BN manual transcripts, adding prosody.
type of
Info
Word
Based
added
features
f0
E
f0 + E
Syllables
f0
&
E
Phones
D
f 0 +E+D
All Combined
Prec
72.9
70.9
72.7
72.5
71.0
70.9
71.2
74.5
full-stop
Rec
F
66.8 69.7
60.6 65.3
65.8 69.1
61.0 66.3
55.7 62.4
58.5 64.1
62.0 66.3
66.1 70.0
SER
58.0
64.3
58.9
62.1
67.1
65.5
63.1
56.6
Prec
66.0
65.1
62.6
64.4
63.7
59.5
63.4
65.0
comma
Rec
F
17.6 27.7
7.8 14.0
15.3 24.6
9.7 16.9
4.5
8.4
7.8 13.8
7.5 13.5
17.9 28.1
SER
91.5
96.4
93.9
95.7
98.1
97.5
96.8
91.7
Prec
71.5
70.3
70.8
71.4
70.5
69.5
70.4
72.4
All
Rec
F
43.6 54.2
35.7 47.4
42.0 52.7
36.9 48.6
31.6 43.6
34.6 46.2
36.3 47.9
43.4 54.2
SER
65.7
71.6
67.1
70.3
74.9
73.0
71.1
65.3
Table 5.20: Punctuation results for English BN automatic transcripts, adding prosody.
each prosodic parameter, and also the set of used features, where the most important feature
turned out to be the f 0 slope in the words and between word transitions.
Individual results concerning the 1998 Hub-4 evaluation data (LDC2006S86) corpus were
calculated, in order to establish a parallel with other related work. Christensen et al. (2001) and
Kim and Woodland (2001) report results on the LDC2006S86 corpus, making use of lexical and
prosodic features for recovering full stop, comma and question mark. The first paper describes
a set of experiments using statistical finite-state models and reports a SER of 89%, when all
punctuation marks are combined, the best performance reported is 89% SER. The paper also
reports results on individual punctuation marks, achieving from 41% to 79% SER for the full
stop, and from 81% to 110% SER for the comma, but these results are insufficient for drawing
further conclusions. The results for automatic transcripts presented in this study are similar to
the ones reported by Kim and Woodland (2001). However, results cannot be directly compared,
because i) the paper uses only a portion of the LDC2006S86 data for evaluation; ii) the paper
results take into account the question mark detection; and iii) the WER is different.
5.4.1.2
Retrain from a Written Corpora Model
Similarly to experiments performed for Portuguese, another way of improving the current
punctuation results consisted of using punctuation information extracted from written cor-
118
Type of
added
Info
features
Lexical + Acoustic
Word
f0
Based
E
f0 + E
Syllables
f0
&
E
Phones
D
f 0 +E+D
All Combined
CHAPTER 5. PUNCTUATION RECOVERY
Prec
77.1
79.8
78.1
78.8
77.2
79.2
78.3
77.7
79.3
full-stop
Rec
F
74.8 75.9
75.3 77.5
74.7 76.4
76.5 77.6
75.3 76.2
71.4 75.1
73.1 75.6
75.4 76.5
75.7 77.5
SER
47.5
43.7
46.3
44.1
46.9
47.4
47.2
46.2
44.0
Prec
64.2
64.2
64.6
64.2
63.3
61.3
64.9
64.5
63.0
comma
Rec
F
22.6 33.5
25.0 36.0
23.0 33.9
24.2 35.1
20.8 31.3
26.9 37.4
21.8 32.7
23.4 34.3
24.8 35.6
SER
90.0
88.9
89.6
89.3
91.2
90.1
90.0
89.5
89.8
Prec
73.9
75.7
74.8
75.1
74.0
73.8
75.1
74.4
74.9
All
Rec
F
50.3 59.9
51.7 61.4
50.4 60.2
52.0 61.4
49.7 59.5
50.5 60.0
49.1 59.3
51.0 60.5
51.9 61.3
SER
57.5
55.7
57.1
55.8
57.8
57.6
58.1
56.7
55.8
Table 5.21: Punctuation for force aligned transcripts, bootstrapping from a written corpora
model.
Type of
added
Info
features
Lexical + Acoustic
Word
f0
Based
E
f0 + E
Syllables
f0
&
E
Phones
D
f 0 +E+D
All Combined
Prec
69.7
71.4
69.5
71.3
71.4
71.0
69.6
71.9
71.3
full-stop
Rec
F
61.2 65.2
63.4 67.2
62.8 66.0
63.5 67.1
62.0 66.4
61.1 65.7
60.1 64.5
62.5 66.9
64.0 67.4
SER
65.4
61.9
64.8
62.1
62.8
63.8
66.1
62.0
61.8
Prec
56.5
54.4
58.9
59.2
58.7
59.1
57.4
57.7
56.4
comma
Rec
F
16.4 25.4
20.2 29.5
15.6 24.7
15.8 25.0
15.5 24.5
15.4 24.5
16.0 25.0
15.7 24.6
16.6 25.7
SER
96.2
96.8
95.3
95.1
95.4
95.2
95.9
95.8
96.2
Prec
66.7
66.8
67.3
68.7
68.7
68.5
66.9
68.8
67.9
All
Rec
F
40.1 50.1
43.1 52.4
40.6 50.6
41.0 51.4
40.1 50.6
39.6 50.2
39.4 49.6
40.5 50.9
41.7 51.6
SER
70.5
68.3
69.7
68.4
69.3
69.7
70.8
69.0
68.6
Table 5.22: Punctuation for automatic transcripts, bootstrapping from a written corpora model.
pora. For that purpose, we have firstly trained a punctuation model using written corpora,
and then trained a new punctuation model with transcripts, bootstrapping from the written
corpora model. These experiments use the NYT (New York Times) portion of the LDC corpus
LDC1998T30 (North American News Text Supplement), described in Section 3.2.3. The original
texts were normalized and all the punctuation marks removed, making them close to speech
transcripts. Models suitable for speech transcripts were then trained using transcripts training
data, bootstrapping from the initial written corpora model.
Tables 5.21 and 5.22 present the obtained results that can be directly compared with results from Table 5.18. From the comparison, regular trends are found: i) bootstrapping does
not promote better results for automatic transcripts; ii) the performance for the force aligned
data is improved; iii) comma detection improves in both conditions. These findings, together
with the conclusions derived in Section 5.3.1.2, support two basic ideas: results are better for
Portuguese, probably because the English data is quite heterogeneous and has a higher WER;
the most significant gains concerning comma derive from the fact that this specific punctuation
mark depends more on lexical features (e.g., ..., por exemplo/for instance, ...), similar to observations from Favre et al. (2009).
5.4. EXTENSION TO OTHER LANGUAGES
Evaluation corpus
NYT evaluation
Manual transcripts
Automatic transcripts
correct
993
145
101
wrong
81
30
51
119
missed
668
119
151
Prec
92.5
82.9
66.4
Rec
59.8
54.9
40.1
F
72.6
66.1
50.0
SER
45.1
56.4
80.2
Table 5.23: Recovering question marks using a written corpora based model.
5.4.2
Detection of Question Marks
The following experiments concern the automatic detection of question marks. This corresponds to an extension of the punctuation module which was initially designed to deal only
with full stops and commas. The following experiments follow the same ME-based approach,
but this time each event corresponds to an entire sentence, previously marked using the reference data. In other words, this task tries to assess which sentences are interrogative. The
remaining of this section will firstly assess the performance of the module, where only lexical
information was used, learned from a large corpus of written data; and then it will study the
impact of introducing prosodic features, analysing the individual contribution of each prosodic
feature.
Initial experiments used only lexical information. A discriminative model has been created for English written corpora, using the NYT (New York Times) portion of the LDC corpus
LDC1998T30, described in Section 3.2.3. The features previously used in Section 5.3.2 were also
applied to each sentence: wi , wi+1 , 2wi−2 , 2wi−1 , 2wi , 2wi+1 , 3wi−2 , 3wi−1 , start_x, x_end, len.
Table 5.23 shows the results of applying the written corpora model to different evaluation sets
directly, where correct, wrong and missed correspond to the number of correct sentences, insertions and deletions, respectively. Results concerning the written corpora evaluation set (NYT)
are about 24% better than the corresponding results achieved for Portuguese newspaper data.
As expected, question marks are easier to detect for written English, since this language has
more lexical cues, mainly quite frequent n-grams related with “auxiliary verb plus subject inversion” (e. g. do you?, can you?, have you?). The difference of about 24% SER is mostly related
with the high number of deletions (non-identified sentences) for Portuguese. This is due to
the fact that yes/no questions, corresponding to almost 50% of all the questions in the corpus,
are mainly disambiguated from a declarative sentence using prosodic information. Concerning the force aligned transcripts, results are again better for English. The difference between
force aligned and automatic transcripts is bigger in English (21.4%) than in Portuguese (16.6%),
reflecting the impact of the recognition errors in this task. Although n-grams related with
“auxiliary verb plus subject inversion” are relevant features for correctly identifying question
marks in English, the auxiliary verbs (e.g., do, can, have) are often misrecognized, particularly
in spontaneous speech, causing that bigger impact.
When using only this limited set of features, the recall percentages are correlated with specific types of questions, namely, wh-questions for both languages; and yes/no questions almost
120
CHAPTER 5. PUNCTUATION RECOVERY
Evaluation
manual
transcripts
automatic
transcripts
Transcripts info
lexical
lexical + acoustic
lexical + acoustic + WB
lexical + acoustic + SYB
all features
lexical
lexical + acoustic
lexical + acoustic + WB
lexical + acoustic + SYB
all features
correct
155
151
152
151
149
100
103
100
100
102
wrong
22
21
21
19
19
27
26
31
27
33
missed
109
113
112
113
115
152
149
152
152
150
Prec
87.6
87.8
87.9
88.8
88.7
78.7
79.8
76.3
78.7
75.6
Rec
58.7
57.2
57.6
57.2
56.4
39.7
40.9
39.7
39.7
40.5
F
70.3
69.3
69.6
69.6
69.0
52.8
54.1
52.2
52.8
52.7
SER
49.6
50.8
50.4
50.0
50.8
71.0
69.4
72.6
71.0
72.6
Table 5.24: Recovering the question mark, adding acoustic and prosodic features.
exclusively for English. Due to language-specific properties, namely, “auxiliary verb plus subject inversion”, the recall percentages for English are always higher than for Portuguese. Not
surprisingly then, the bigram “do you”, for instance, is fairly well associated with a yes/no
question. For Portuguese, the recall percentage of the aligned data is comparable to the ones
of the wh-questions for BN and newspapers, although there is still a small percentage of this
type of interrogative not accounted for, mainly due to very complex structures hard to disambiguate automatically. As for tag and alternative questions in both languages they are not easily
identified with lexical features only.2
The subsequent experiments aimed at analyzing the weight and contribution of different
feature classes and the impact of their combination. The model previously created from written
corpora, was combined with transcripts data, using the approach also applied for Portuguese.
This consisted in using the written corpora model to perform automatic classification on the
training data and then use the assigned class as a feature for training a new model from the
transcripts training data. Results recovering question marks over the English data are presented
in Table 5.24, where different combination of features were added to a standard model that uses
lexical features only. For each corpus, the first row was achieved using only lexical features,
the second uses also acoustic features and the last three lines combine lexical, acoustic and
prosodic information, either using word-based (WB) prosodic features, syllable and phone-based
(SYB) prosodic features, or all the prosodic features combined. There is an effective gain for the
aligned English data, but results are not very significant, due to the relatively small number
of question marks found in the corpora. Combining the written corpora model with lexical
information coming from the speech transcripts seems to be significantly important both for the
manual and automatic transcripts, contrasting with the Portuguese results where the impact
was negative for the automatic transcripts. On the other hand, acoustic information seems
to be useful for the automatic transcripts, but its impact is negative for the manual transcripts,
similarly to conclusions achieved for Portuguese. The combination of prosodic features has not
2 Exception made for tag questions in the university lectures corpus used in the previous section, which has a high
percentage of this type of interrogatives in both train and test sets.
5.5. SUMMARY
121
shown a positive contribution for the English language, contrasting with the results achieved
for Portuguese. The impact of the recognition errors is about 16% (absolute) for Portuguese
Broadcast News and 21% for English, where the overall WER is higher.
As stated by Vassière (1983), these prosodic features are language-independent. Language
specific properties in the data are related with the fact that word-based features are more useful
for the Portuguese corpus, while syllable-based ones give the best results for the English data.
This result may be interpretable by language-specific syllabic properties, i.e., English allows
for more segmental material in all the syllabic skeleton. Thus, for Portuguese, the word-based
features provide more context. Moreover, different durational patterns were found at the end
of an intonational unit (e.g., in European Portuguese post-tonic syllables are quite often truncated); and also with different pitch slopes that may be associated with discourse functions
beyond sentence-form types.
5.5
Summary
This chapter reported experiments concerning the utomatic recovery of punctuation marks.
Section 5.1 reported an exploratory work analysing the occurrence of the different punctuation
marks for different languages. Such analysis, considering both written corpora and speech
transcripts, contributes to better understand the usage of each punctuation mark across languages.
Section 5.2 described initial experiments, using lexical and acoustic features, for basic sentence boundary detection, and for discriminating the two most frequent punctuation marks:
full stop and comma. Independent results were achieved for manual and automatic transcripts,
which allowed to assess the impact of the speech recognition errors on this task. Independent
results were also achieved for spontaneous and planned speech. The contribution of each one
of the features was analysed separately, making it possible to measure the influence of each
feature in the automatic punctuation. Results achieved provided the baseline for further punctuation experiments.
Section 5.3 described the efforts to improve the punctuation module to a better detection of
the basic punctuation marks, full stop and comma, and to deal with the question mark. Two ways
of improving the initial results were addressed: i) adding prosodic features, besides the existing
lexical, time-based and speaker-based features; ii) making use of punctuation information that
can be found in large written corpora. Reported experiments were performed both on manual
transcripts and directly over the automatic speech recognition output, using lexical, acoustic
and prosodic features. Results pointed out that combining all the features lead to the best
performance. Results also made possible to discriminate the most relevant prosodic features
for this task, those related to pitch being the most significant per se; however, the best results
were obtained when combining pitch and energy. The full stop detection consistently achieved
122
CHAPTER 5. PUNCTUATION RECOVERY
the best performance, followed by the comma, and finally by the question mark. The study of
the latter, however, is still in an early stage and results can be further improved either by using
larger training data or by extending the analysis of pitch slopes with discourse functionalities
beyond sentence-form types.
Section 5.4 reported experiments using English data, and compared the performance of
the punctuation recovery module when dealing with Portuguese and English. Results suggested that question marks are easier to detect for the English language. When using only lexical
and acoustic features, recall values are correlated with specific types of questions, namely, whquestions for both languages, and yes/no questions almost exclusively for English. Due to
language-specific properties, namely, auxiliary verb plus subject inversion, the recall for English is always higher than for Portuguese broadcast news. Not surprisingly then, the bigram
“do you”, for instance, is fairly well associated with yes/no questions. For Portuguese, the recall of the aligned data is comparable to the one of the wh-questions for broadcast news and
newspapers, although there is still a small percentage of this type of interrogatives that is not
accounted for, mainly due to very complex structures hard to disambiguate automatically. As
for tag and alternative questions in both languages, they are not easily identified with lexical
features only.
Conclusions and Future Directions
6
This chapter overviews the work reported in this thesis, presents the main conclusions,
enumerates the main contributions, and describes a number of possible directions for further
extending this research.
6.1
Overview
The quality of a speech recognition system should not be measured using only a single
performance metric, like the WER, as other important factors that can improve the human legibility and contribute for further automatic processing may be considered as well. For that
reason, rich transcription-related topics have been gaining increasing attention from the scientific community in the recent years. This study addressed two metadata annotation tasks
that take part in the production of rich transcripts: recovering punctuation marks and capitalization information on speech transcripts. Information concerning punctuation marks and
capitalization is critical for the legibility of speech transcripts and it is also important to other
downstream tasks that are usually also applied to written corpora, like named entity recognition, information extraction, extractive summarization, parsing, and machine translation.
The most relevant data used in the scope of this work was described in Chapter 3. Most of
the data has changed during this thesis time span, sometimes due to data revisions, but also due
to corrections in the corpora related tools. For example, the current version of the Portuguese
BN corpus has been completely revised recently by an expert linguist. This was particularly
important given that the previous version of this corpus was manually transcribed by different
annotators, who did not follow consistent criteria, specially in terms of punctuation marks. The
normalization-related tools have also been subject to several different improvements along this
thesis. For that reason, in order to compare different experimental configurations, a number of
experiments were repeated several times in different time periods.
Besides the Portuguese speech corpus, the study has been ported to other languages,
namely Spanish and English. The English BN corpus combines five different corpora subsets.
Each corpora subset has been produced in a different time period, built for a different purpose,
encoded with different annotation criteria, and was available in a different format. Combin-
124
CHAPTER 6. CONCLUSIONS AND FUTURE DIRECTIONS
ing these heterogeneous corpora demanded a normalization strategy specifically adapted for
each corpus, which combined a number of different existing tools with new tools developed
specifically to deal with this problem. The automatic transcripts for all the speech corpora
were produced by the L2 F recognition system. The reference punctuation and capitalization
for the automatic transcripts were provided by means of alignments between the manual and
the automatic transcripts. This is is not a trivial task because of the recognition errors. All the
information is encoded in XML format. It contains information coming from the speech recognition system, as well as other reference information coming from the manual transcripts. The
word boundaries that have been previously identified automatically by the speech recognition
system were adjusted by means of post-processing rules and prosodic features (pitch, energy
and duration). The final content is also available as an XML file and contains not only pitch
and energy, extracted directly from the speech signal, but also information concerning phones,
syllable boundaries and syllable stress.
Most of the experiments described in this study aim at processing broadcast news data,
but other information sources, like written newspaper corpora, have been used to complement
the relatively small size of the speech corpora. Written corpora contains information that is
specially important for capitalization. In fact, these corpora provide information concerning
the context where the capitalized words appear. The Portuguese corpora used in these experiments consist of online editions of Portuguese newspapers, collected from the web (at L2 F).
Some of the data has been collected during this thesis time spam, allowing to perform experiments with the most recent data. The English written corpus is the North American News
Text Supplement, available from the LDC. All the written corpora was normalized in order to
be closer to speech transcripts, and therefore to be used for training speech-like models. That
demanded substantial efforts in creating new tools, or adjusting and improving the existing
normalization tools, given that each corpus requires a specially designed (or at least adapted)
tool for dealing with specific phenomena.
The work on capitalization recovery, on both written corpora and speech transcripts, was presented in Chapter 4. As part of the early work, two generative methods have been compared
with the ME approach. Results suggest that generative methods produce better results for written corpora, while the maximum entropy approach works better with speech transcripts, also
suggesting that the impact of the recognition errors is stronger for these generative approaches.
The following step in this study consisted of analysing the impact of language variation in the
capitalization task. This was partly motivated by the daily BN Subtitling, which led the L2 F
speech group to use a baseline vocabulary combined with a daily modification of the vocabulary (Martins et al., 2007b) and a re-estimation of the language model. This dynamic language
modeling provided an interesting scenario for our capitalization experiments. Maximum entropy models proved to be suitable to perform the capitalization task, specially when dealing
with language dynamics. This approach provided a clean framework for learning with new
data, while slowly discarding unused data. It also enabled the combination of different data
6.1. OVERVIEW
125
sources and exploration of different features. In terms of language variation, results suggested
that different capitalization models should be used for different time periods.
Most of the experiments compared the capitalization performance when performed both in
written corpora and in speech transcripts. Individual results concerning manual and automatic
transcriptions were also presented, revealing the impact of the recognition errors on this task.
For both types of transcription, results show evidence that the performance is affected by the
temporal distance between training and testing sets. Such conclusions led to the proposal and
evaluation, in this study, of three different approaches for updating the capitalization module.
The most promising approach consisted of iteratively retraining a baseline model with the new
available data, using small corpora subsets, causing the performance to increase about 1.6%
when dealing with manual transcripts. Results reveal that producing capitalization models on
a daily basis did not lead to a significant improvement. Therefore, the adaptation of capitalization models on a periodic basis was the best choice. The small improvements gained in terms
of capitalization suggested that dynamically updated models may play a small role, but the
updating does not need to be done daily, a fact that is also according to our intuition.
A number of recent experiments on automatic capitalization, reflecting the most recent
training and testing conditions, with more accurate results, were also presented in Chapter 4.
The ME-based approach was again compared with HMMs and also with CRFs, using the most
recent data and increasing the number of training iterations. Besides the automatic transcripts,
experiments with force aligned transcriptions were also included. Later experiments confirmed
that the HMM-based method was suitable for dealing with written corpora, but ME and CRFs
achieved a better performance when applied to speech data. The most recent experiments
extending this work to other languages were also reported. The effect of the language variation
over time was again studied for the English and Spanish data, confirming that the interval
between the training and testing periods is relevant for the capitalization performance.
Chapter 5 reported experiments concerning the automatic recovery of punctuation marks.
As part of the early work, an exploratory work analysing the occurrence of the different punctuation marks for different languages has been performed. Such an analysis, considering both
written corpora and speech transcripts, contributed to better understand the usage of each
punctuation mark across languages. Results show that Portuguese broadcast news transcripts
have a higher number of commas when compared with English and Spanish. The BN data contains a greater number of sentences and more intra-sentence punctuation marks when comparing to newspaper written corpora, confirming that speech sentences are shorter.
Initial experiments concerning punctuation recovery were performed using lexical and
acoustic features, firstly for basic sentence boundary detection, and then for discriminating
the two most frequent punctuation marks: full stop and comma. The initial results were improved by adding prosodic features, besides the existing lexical, time-based and speaker-based
features; and by making use of punctuation information that can be found in large written cor-
126
CHAPTER 6. CONCLUSIONS AND FUTURE DIRECTIONS
pora. Independent results were achieved for manual and automatic transcripts, allowing to
assess the impact of the speech recognition errors on this task. Independent results were also
achieved for spontaneous and planned speech. The contribution of each feature was analysed
separately, making it possible to measure its influence on the automatic punctuation recovery.
The punctuation module was then extended to obtain a better detection of the basic punctuation marks, full stop and comma, and also to deal with the question mark. Reported experiments
were performed both on manual transcripts and directly over the automatic speech recognition
output, using lexical, acoustic and prosodic features. Results pointed out that combining all the
features usually conducts to the best performance.
6.2
Main Conclusions
This study addresses the tasks of recovering capitalization and punctuation marks from
spoken transcripts, produced by ASR systems. These two practical RT tasks were performed
using the same discriminative approach, based on maximum entropy, adequate for combining
different data sources and features for characterizing the data and for on-the-fly integration,
which is of great importance for tasks such as online subtitling, characterized by strict latency
requirements. Reported experiments were conducted both over Portuguese and English BN
data, allowing to compare the performance on the two languages. Both force aligned and automatic transcripts experiments were used, allowing to measure the impact of the recognition
errors.
Capitalized words and named entities are intrinsically related, and are influenced by time
variation effects. For that reason, the so-called language dynamics have been analyzed for the
capitalization task. The ME modeling approach provides a clean framework for learning with
new data, while slowly discarding unused data, making it interesting for addressing problems that comprise language variations in time. Language adaptation results clearly indicate,
for both languages, that the capitalization performance is affected by the temporal distance
between the training and testing data. Hence, our proposal states that different capitalization models should be used for different time periods. Capitalization experiments were also
performed with an HMM-based tagger, a common approach in this type of problem. While
the HMM-based approach captured the structure of written corpora better, the ME-based approach proved to be better suited for speech transcripts, which includes portions of spontaneous speech, characterized by a more flexible linguistic structure when compared to written
corpora, and also more robust to ASR errors.
In what regards the punctuation task, this paper covers the three most frequent punctuation marks: full stop, comma, and question mark. Detecting full stops and commas is performed
firstly, and corresponds to segmenting the speech recognizer output stream. Question marks are
detected afterwards, making use of the previously identified segmentation boundaries. Rather
6.3. CONTRIBUTIONS
127
than comparing with other approaches, reported punctuation experiments focused on the usage of additional information sources and diverse linguistic structures that can be found on the
speech data. Two different scenarios were explored for improving the baseline results for full
stop and comma. The first made use of the punctuation information that can be found in large
written corpora. The second consisted of introducing prosodic features, besides the initial lexical, time-based and speaker-based features. The first scenario yielded improved results for
all force aligned data, and for all the Portuguese data. The comma detection improved significantly, specially for Portuguese aligned data (7.8%). These findings support two basic ideas:
results are better for Portuguese, because our English data is quite heterogeneous and has a
higher WER; the most significant gains concerning comma derive from the fact that it depends
more on lexical features. The second scenario provided even better results, for both languages
and both punctuation marks, with improvements ranging from 3% to 8% (absolute). The best
results were again achieved for Portuguese, but this time they are mainly related with full stop.
We have concluded that, in both languages, the linguistic structure related with punctuation
marks is being captured in different ways regarding the distinct punctuation marks: commas are
being identified mostly by lexical features, while full stops depend more on prosodic ones. The
most significant gains come from combining all the available features. As for question marks,
there is a gain for the recognized Portuguese and for the aligned English data, but differences
are not significant, due to the relatively small number of question marks in the corpora.
6.3
Contributions
This study proposes a methodology for enriching speech transcriptions that has been successfully applied for recovering punctuation marks and capitalization on broadcast news corpora. A prototype module, based on the proposed methodology, and incorporating the two
rich transcription tasks, has been integrated in the L2 F broadcast news processing chain. The
broad set of lexical, acoustic, and prosodic information has been successfully combined to enhance the results of this punctuation module. Additional sources of information, in particular
written corpora, were used to provide additional information, specially for capitalization, thus
minimizing the effects of having small sized speech corpora. Finally, the most relevant experiments have been ported for other languages. Hence, the goals initially proposed for this study
were met. The following items outline the most important contributions achieved in the scope
of this work:
• A shared approach for punctuation and capitalization recovery: the proposed approach
proved to be suitable for dealing with speech data, allows to combine different levels of
information, and can be used for on-the-fly processing.
• To the best of our knowledge, the described experiments provide the first punctuation
and capitalization results for Portuguese. Most results concern broadcast news corpora,
128
CHAPTER 6. CONCLUSIONS AND FUTURE DIRECTIONS
which was this study application major domain, but a number of experiments were also
conducted on other corpora, including written newspaper corpora.
• Concerning the capitalization task, different methods have been compared, and reported
experiments suggest that generative methods seem more appropriate for written corpora
while discriminative methods are more suitable for speech data. To the best of our knowledge, this is the first study achieving such conclusion.
• The impact of the language variation in time has been analysed for the capitalization task,
in line with other related work. Reported experiments suggest that the capitalization performance decays over time, as an effect of language variation. That is also in agreement
with other reported work on NER. A number of different adaptation strategies have been
analysed, suggesting that the capitalization models must be updated on a periodic basis.
• Concerning the punctuation task, independent results were achieved for manual and automatic transcripts, allowing to assess the impact of the speech recognition errors. Independent results were also achieved for spontaneous and planned speech. Results pointed
out that combining all the features lead to the best performance, but the contribution of
each one of the features was analysed separately, making it possible to measure its influence on the automatic punctuation task. The linguistic structure related with punctuation
marks is being captured in different ways: commas are being identified mostly by lexical
features, while full stops depend more on prosodic features. The relatively small number
of question marks in the corpora prevented from drawing significant conclusions for this
punctuation mark. The full stop detection achieved the best performance, followed by the
comma, and finally by the question mark.
• Most of these experiments were ported to other languages, in particular to English,
thereby allowing to compare properties of the different languages, and also to confirm a
number of conclusions firstly achieved for Portuguese. Despite not being able to directly
compare the achieved results with the related work, the performance here achieved concerning both punctuation and capitalization may be considered similar to state-of-the-art
results reported in the literature.
• An on-the-fly module for punctuation and capitalization recovery has been developed.
This module is an important asset in the in-house automatic subtitling system, and it has
been included in the fully automatic subtitling module for broadcast news, deployed at
the national television broadcaster since March 2008, in the scope of the national TECNOVOZ1 project. The two modules also provide important information for an automatic
multimedia content dissimination system. Results of the offline processing of each BN
show are also shown daily on the web2 .
1 http://www.tecnovoz.com.pt/
2 https://tecnovoz.l2f.inesc-id.pt/demos/asr/legendagem/
6.4. FUTURE WORK
129
• Improved version of the existing normalization tool for Portuguese. The existing normalization tool was deeply revised to correctly address phenomena, such as: date and time
expressions, ordinals, numbers, abreviations, money quantities, and a number of other
expressions that were found in real text.
• Creation and integration of different tools for English corpora normalization. A number of tools have been created and integrated into pipeline processing chains, specially
adapted to deal with each one of the five broadcast news corpora subsets, as well as with
the English newspaper data.
The work performed in the scope of this thesis has been disseminated by means of a journal
publication, a book chapter3 , and by a number of other publications on international conferences and workshops. The following publications focus the capitalization task as an isolated
task: Batista et al. (2007b) compares different capitalization approaches for the capitalization
of Portuguese broadcast news; Batista et al. (2008e) performs a study of the impact of language dynamics on written corpora; such study was then extended to the broadcast news corpora and described in Batista et al. (2008d) and Batista et al. (2008c); the impact of dynamic
model adaptation beyond speech recognition is reported in Batista et al. (2008a). The following
publications focus on the punctuation recovery task as an isolated task: Batista et al. (2007a)
describes the initial approach to a punctuation module together with the first performance results achieved for sentence boundary detection; recent experiments, porting this work to other
languages have been reported together with capitalization results for the same corpora (Batista
et al., 2009b,a); the most recent papers Batista et al. (2010); Moniz et al. (2010, 2011) focus on the
prosody usage. Papers involving both punctuation and capitalization include the following:
(Batista et al., 2008b) presents the experiments concerning both punctuation and capitalization
of broadcast news data; Batista et al. (2009b,a) report the most recent experiments comparing
the performance in different languages. (Batista et al., 2011) describes the most recent work,
reporting bilingual experiments comparing the Portuguese and English languages.
6.4
Future Work
The contributions of this thesis correspond to the first steps in enriching the speech recognition output, and much work remains to be done. The following items pinpoint a number of
possible directions for the future:
• Adapt the current part-of-speech tagger to deal with speech transcripts. Despite the partof-speech tagger not having been specially trained for dealing with speech transcripts,
POS information is still an important feature according to results presented in Section
5.2.3.1, where it was shown to be the third most important feature;
3 Springer
book, containing extended versions of the best selected papers from a workshop.
130
CHAPTER 6. CONCLUSIONS AND FUTURE DIRECTIONS
• Perform these tasks over lattices and/or confusion networks, thus enriching information
suitable for tasks such as speech translation;
• Analyse the impact of the punctuation and capitalization tasks in machine translation,
machine summarization, and question answering; and contribute to a better quality of
each one of these tasks, supported by a number of studies found in the literature (Matusov et al., 2006; Ostendorf et al., 2008).
• Use unlabeled data for improving the existing models, by means of a semi-supervised
training method, such as co-training (Blum and Mitchell, 1998);
• In terms of capitalization, an interesting future direction would be the fusion of the generative and the discriminative approaches, since they perform better for written corpora
and speech transcripts, respectively;
• Additional features, widely used in named entity extraction, can be also used for capitalization restoration. Features such as word prefix and suffix can be easily extracted and
may contribute for the capitalization performance.
• In terms of punctuation, there are many interesting research directions, particularly in
what concerns prosodic features (for instance, by using pseudo-syllable information directly derived from the audio data). Extending this study on interrogatives to other domains, besides BN, will allow to better model different types of interrogatives, not well
represented in this corpus;
• Further experiments must be performed in order to assess to what extend our prosodic
features are language based or language independent features. Extending this study to
other languages, besides Portuguese and English, will certainly provide challenging scenarios in the future;
• Interrogatives are still in an early stage and can be further improved either by using larger
training data and by extending the analysis of pitch slopes with discourse functionalities
beyond sentence-form types;
• Consider alternate evaluation strategies for both tasks, keeping in mind the human performance. According to Kowal and O’Connell (2008), each transcriber can, deliberately
or involuntary, delete, add, substitute, and/or relocate utterances or parts of utterances
in a transcript. However, these decisions are not always a matter of error, leading Kowal
and O’Connell (2008) to consider them as changes rather than errors. Similarly, one can
think that the classification proposed by an automatic system may sometimes be an alternate way of producing the transcript. Therefore, a binary decision comprising only the
reference and the automatic classification, may not reflect the real achievement;
• Port these modules to other varieties of Portuguese (spoken in South America and Africa)
Bibliography
Abad, A. and Neto, J. (2008). Incorporating acoustical modelling of phone transitions in an
hybrid ANN/HMM speech recognizer. In Proc. of the 9th Annual Conference of the International
Speech Communication Association (Interspeech 2008), Brisbane, Australia.
Agbago, A., Kuhn, R., and Foster, G. (2005). Truecasing for the portage system. In Proc. of the International Conference on Recent Advances in Natural Language Processing (RANLP’05), Borovets,
Bulgaria.
Amaral, R., Meinedo, H., Caseiro, D., Trancoso, I., and Neto, J. P. (2007). A prototype system
for selective dissemination of broadcast news in European Portuguese. EURASIP Journal on
Advances in Signal Processing, 2007(37507).
Amaral, R. and Trancoso, I. (2008). Topic segmentation and indexation in a media watch system. In Proc. of the 9th Annual Conference of the International Speech Communication Association
(Interspeech 2008), Brisbane, Australia. ISCA.
Baldwin, T. and Joseph, M. P. (2009). Restoring punctuation and casing in english text. In
Proceedings of the 22nd Australasian Joint Conference on Advances in Artificial Intelligence, AI ’09,
pages 547–556, Berlin, Heidelberg. Springer-Verlag.
Barras, C., Geoffrois, E., Wu, Z., and Liberman, M. (2001). Transcriber: Development and use of
a tool for assisting speech corpora production. Speech Communication, 33(1-2):5 – 22. Speech
Annotation and Corpus Tools.
Batista, F., Amaral, R., Trancoso, I., and Mamede, N. (2008a). Impact of dynamic model adaptation beyond speech recognition. In Proc. of the IEEE Workshop on Spoken Language Technology
(SLT 2008), Goa, India.
Batista, F., Caseiro, D., Mamede, N., and Trancoso, I. (2008b). Recovering capitalization and
punctuation marks for automatic speech recognition: Case study for Portuguese broadcast
news. Speech Communication, 50(10):847–862.
Batista, F., Caseiro, D., Mamede, N. J., and Trancoso, I. (2007a). Recovering punctuation marks
for automatic speech recognition. In Proc. of the 8th Annual Conference of the International Speech
Communication Association (Interspeech 2007), pages 2153 – 2156, Antwerp, Belgium.
132
BIBLIOGRAPHY
Batista, F., Mamede, N., Caseiro, D., and Trancoso, I. (2007b). A lightweight on-the-fly capitalization system for automatic speech recognition. In Proc. of the International Conference on
Recent Advances in Natural Language Processing (RANLP’07).
Batista, F., Mamede, N., and Trancoso, I. (2008c). The impact of language dynamics on the
capitalization of broadcast news. In Proc. of the 9th Annual Conference of the International Speech
Communication Association (Interspeech 2008).
Batista, F., Mamede, N., and Trancoso, I. (2008d). Language dynamics and capitalization using
maximum entropy. In Proc. of 46th Annual Meeting of the Association of Computational Linguistics: Human Language Technologies (ACL-08): HLT, Short Papers, pages 1–4. ACL.
Batista, F., Mamede, N. J., and Trancoso, I. (2008e). Temporal issues and recognition errors on
the capitalization of speech transcriptions. Lecture Notes in Artificial Intelligence, 5246:45–52.
Batista, F., Moniz, H., Trancoso, I., and Mamede, N. J. (2011). Bilingual experiments on automatic recovery of capitalization and punctuation of automatic speech transcripts. IEEE
Transaction on Audio, Speech and Language Processing, Special Issue on New Frontiers in Rich
Transcription. (to be accepted).
Batista, F., Moniz, H., Trancoso, I., Meinedo, H., Mata, A. I., and Mamede, N. J. (2010). Extending the punctuation module for European Portuguese. In Proc. of the 11th Annual Conference
of the International Speech Communication Association (Interspeech 2010), Mukari, Japan.
Batista, F., Trancoso, I., and Mamede, N. J. (2009a). Automatic recovery of punctuation marks
and capitalization information for iberian languages. In I Joint SIG-IL/Microsoft Workshop on
Speech An Language Technologies for Iberian Languages, pages 99 –102, Porto Salvo, Portugal.
Batista, F., Trancoso, I., and Mamede, N. J. (2009b). Comparing automatic rich transcription for
Portuguese, Spanish and English broadcast news. In Proc. of the Automatic Speech Recognition
and Understanding Workshop (ASRU 2009), Merano, Italy.
Baum, L. E. and Petrie, T. (1966). Statistical inference for probabilistic functions of finite state
markov chains. The Annals of Mathematical Statistics, 37(6):1554–1563.
Beeferman, D., Berger, A., and Lafferty, J. (1998). Cyberpunc: a lightweight punctuation annotation system for speech. In Proc. of the IEEE International Conference on Acoustics, Speech, and
Signal Processing (ICASSP ’98), pages 689–692.
Berger, A. L., Pietra, S. A. D., and Pietra, V. J. D. (1996). A maximum entropy approach to
natural language processing. Computational Linguistics, 22(1):39–71.
Bikel, D. M., Miller, S., Schwartz, R., and Weischedel, R. (1997). Nymble: a high-performance
learning name-finder. In Proceedings of the fifth conference on Applied natural language processing,
pages 194–201, San Francisco, CA, USA. Morgan Kaufmann Publishers Inc.
BIBLIOGRAPHY
133
Blaauw, E. (1995). On the Perceptual Classification of Spontaneous and Read Speech. Research
Institute for Language and Speech.
Blum, A. and Mitchell, T. (1998). Combining labeled and unlabeled data with co-training. In
Proceedings of the eleventh annual conference on Computational learning theory, COLT’ 98, pages
92–100, New York, NY, USA. ACM.
Boakye, K., Favre, B., and Hakkani-Tür, D. (2009). Any Questions? Automatic Question Detection in Meetings. In Proc. of the Automatic Speech Recognition and Understanding Workshop
(ASRU 2009), Merano, Italy.
Brown, E. and Coden, A. (2002). Capitalization recovery for text. Information Retrieval Techniques
for Speech Applications, pages 11–22.
Brown, P. F., Pietra, V. J. D., Mercer, R. L., Pietra, S. A. D., and Lai, J. C. (1992). An estimate of
an upper bound for the entropy of english. Computational Linguistics, 18(1):31–40.
Campione, E. and Véronis, J. (2002). A large-scale multilingual study of silent pause duration.
In Speech prosody.
Carletta, J. (1996). Assessing agreement on classification tasks: The kappa statistic. Computational Linguistics, 22:249–254.
Cattoni, R., Bertoldi, N., and Federico, M. (2007). Punctuating confusion networks for speech
translation. In Proc. of the 8th Annual Conference of the International Speech Communication Association (Interspeech 2007), pages 2453–2456.
Chelba, C. and Acero, A. (2004). Adaptation of maximum entropy capitalizer: Little data
can help a lot. In Proc. of the Conference on Empirical Methods in Natural Language Processing
(EMNLP ’04).
Chen, C. J. (1999). Speech recognition with automatic punctuation.
ROSPEECH’99, pages 447–450.
In Proc. of EU-
Chen, S. F., Kingsbury, B., Mangu, L., Povey, D., Saon, G., Soltau, H., and Zweig, G. (2006).
Advances in speech transcription at IBM under the DARPA EARS program. IEEE Transactions
on Audio, Speech & Language Processing, 14(5):1596–1608.
Christensen, H., Gotoh, Y., and Renals, S. (2001). Punctuation annotation using statistical
prosody models. In Proc. of the ISCA Workshop on Prosody in Speech Recognition and Understanding, pages 35–40.
Clark, H. and Fox Tree, J. (2002). Using uh and um in spontaneous speaking. Cognition, (84).
Collins, M. and Singer, Y. (1999). Unsupervised models for named entity classification. In Proc.
of the Joint SIGDAT Conference on EMNLP.
134
BIBLIOGRAPHY
Cuendet, S., Hakkani-Tur, D., Shriberg, E., Fung, J., and Favre, B. (2007). Cross-genre feature
comparisons for spoken sentence segmentation. International Journal of Semantic Computing,
1(3):335–346.
Daumé III, H. (2004). Notes on CG and LM-BFGS optimization of logistic regression.
http://hal3.name/megam/.
Duarte, I. (2000). Língua Portuguesa, Instrumentos de Análise. Universidade Aberta, Lisboa.
Favre, B., Hakkani-Tür, D., Petrov, S., and Klein, D. (2008). Efficient Sentence Segmentation
Using Syntactic Features. In Spoken Languge Technologies (SLT), Goa, India.
Favre, B., Hakkani-Tur, D., and Shriberg, E. (2009). Syntactically-informed Models for Comma
Prediction. In Proc. of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP ’09), Taipei, Taiwan.
Fisher, B. (1996). The tsylb2 program. National Institute of Standards and Technology Speech.
Frota, S. (2000). Prosody and Focus in European Portuguese. Phonological Phrasing and Intonation.
Garland Publishing, New York.
Furui, S. (2005). 50 years of progress in speech and speaker recognition. In Proc. SPECOM 2005,
pages 1 – 9, Patras, Greece.
Furui, S. and Kawahara, T. (2008). Springer Handbook of Speech Processing, chapter 32 - Transcription and Distillation of Spontaneous Speech. Springer Berlin Heidelberg.
Gotoh, Y. and Renals, S. (2000). Sentence boundary detection in broadcast speech transcripts.
In Proc. of the ISCA Workshop: Automatic Speech Recognition: Challenges for the new Millennium
ASR-2000, pages 228–235.
Gravano, A., Jansche, M., and Bacchiani, M. (2009). Restoring punctuation and capitalization
in transcribed speech. In Proc. of the IEEE International Conference on Acoustics, Speech, and
Signal Processing (ICASSP ’09), Taipei, Taiwan.
Gravier, G., Bonastre, J.-F., Geoffrois, E., Galliano, S., Tait, K. M., and Choukri, K. (2004). The
ESTER evaluation campaign for the rich transcription of french broadcast news. In Proc.
LREC 2004.
Harper, M., Dorr, B., Hale, J., Roark, B., Shafran, I., Lease, M., Liu, Y., Snover, M., Yung, L.,
Krasnyanskaya, A., and Stewart, R. (2005). Parsing and spoken structural event detection. In
2005 Johns Hopkins Summer Workshop Final Report.
Heeman, P. and Allen, J. (1999). Speech repairs, intonational phrases and discourse markers:
Modeling speakers’ utterances in spoken dialogue. Computational Linguistics, 25:527–571.
BIBLIOGRAPHY
135
Huang, J. and Zweig, G. (2002). Maximum entropy model for punctuation annotation from
speech. In Proc. of the 7th International Conference on Spoken Language Processing (INTERSPEECH 2002), pages 917 – 920.
Jelinek, F., Mercer, R. L., Bahl, L. R., and Baker, J. K. (1977). Perplexity – a measure of the
difficulty of speech recognition tasks. Journal of the Acoustical Society of America, 62:S63. Supplement 1.
Jones, D., Gibson, E., Shen, W., Granoien, N., Herzog, M., Reynolds, D., and Weinstein, C.
(2005a). Measuring human readability of machine generated text: three case studies in
speech recognition and machine translation. In Proc. of the IEEE International Conference on
Acoustics, Speech, and Signal Processing (ICASSP ’05), volume 5, pages v/1009–v/1012.
Jones, D., Shen, W., Shriberg, E., Stolcke, A., Kamm, T., and Reynolds, D. (2005b). Two experiments comparing reading with listening for human processing of conversational telephone
speech. In Proc. of Eurospeech - 9th European Conference on Speech Communication and Technology
(Interspeech 2005), Lisbon, Portugal.
Jurafsky, D. and Martin, J. H. (2000). Speech and Language Processing: An Introduction to Natural
Language Processing, Computational Linguistics, and Speech Recognition. Prentice Hall PTR.
Jurafsky, D. and Martin, J. H. (2009). Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition. Prentice Hall PTR,
second edition.
Kahn, J. G., Ostendorf, M., and Chelba, C. (2004). Parsing conversational speech using enhanced segmentation. In Proc. of HLT/NAACL.
Khare, A. (2006). Joint learning for named entity recognition and capitalization generation.
Master’s thesis, University of Edinburgh.
Kim, J., Schwarm, S. E., and Ostendorf, M. (2004). Detecting structural metadata with decision
trees and transformation-based learning. In Proc. HLT-NAACL, pages 137–144.
Kim, J. and Woodland, P. C. (2001). The use of prosody in a combined system for punctuation
generation and speech recognition. In Proc. of Eurospeech, pages 2757–2760.
Kim, J.-H. and Woodland, P. C. (2003). A combined punctuation generation and speech
recognition system and its performance enhancement using prosody. Speech Communication,
41(4):563 – 577.
Kim, J.-H. and Woodland, P. C. (2004). Automatic capitalisation generation for speech input.
Computer Speech & Language, 18(1):67–90.
Koehn, P. (2005). Europarl: A parallel corpus for statistical machine translation. MT Summit
2005.
136
BIBLIOGRAPHY
Kowal, S. and O’Connell, D. C. (2008). Communicating with One Another: Toward a Psychology
of Spontaneous Spoken Discourse. Cognition and Language: A Series in Psycholinguistics.
Springer New York.
Kubala, F., Schwartz, R., Stone, R., and Weischedel, R. (1998). Named entity extraction from
speech. In in Proceedings of DARPA Broadcast News Transcription and Understanding Workshop,
pages 287–292.
Lafferty, J., McCallum, A., and Pereira, F. (2001). Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of ICML-01, pages 282–289.
Levenshtein, V. I. (1966). Binary codes capable of correcting deletions, insertions and reversals.
Soviet Physics Doklady, 6:707–710. (English translation).
Lita, L. V., Ittycheriah, A., Roukos, S., and Kambhatla, N. (2003). tRuEcasIng. In Proc. of the 41st
annual meeting on ACL, pages 152–159, USA. ACL.
Liu, Y. and Shriberg, E. (2007). Comparing evaluation metrics for sentence boundary detection.
In Proc. of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP
’07), Honolulu, Hawaii.
Liu, Y., Shriberg, E., Stolcke, A., Hillard, D., Ostendorf, M., and Harper, M. (2006). Enriching
speech recognition with automatic detection of sentence boundaries and disfluencies. IEEE
Transaction on Audio, Speech and Language Processing, 14(5):1526–1540.
Liu, Y., Shriberg, E., Stolcke, A., Hillard, D., Ostendorf, M., Peskin, B., and Harper, M. (2004).
The ICSI-SRI-UW metadata extraction system. In Proc. of INTERSPEECH 2004 - ICSLP - 8th
International Conference on Spoken Language Processing, pages 577–580, Jeju, Korea.
Liu, Y., Shriberg, E., Stolcke, A., Peskin, B., Ang, J., Hillard, D., Ostendorf, M., Tomalin, M.,
Woodland, P., and Harper, M. (2005). Sructural metadata research in the EARS program. In
Proc. of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP
’05), Philadelphia, USA.
Lu, W. and Ng, H. T. (2010). Better punctuation prediction with dynamic conditional random
fields. In Proceedings of EMNLP10 (The 2010 Conference on Empirical Methods in Natural Language Processing), MIT, Massachusetts.
Makhoul, J., Baron, A., Bulyko, I., Nguyen, L., Ramshaw, L., Stallard, D., Schwartz, R., and
Xiang, B. (2005). The effects of speech recognition and punctuation on information extraction. In Proc. of Eurospeech - 9th European Conference on Speech Communication and Technology
(Interspeech 2005), pages 57–60, Lisbon, Portugal.
Makhoul, J., Kubala, F., Schwartz, R., and Weischedel, R. (1999). Performance measures for
information extraction. In Proc. of the DARPA Broadcast News Workshop, Herndon, VA.
BIBLIOGRAPHY
137
Manning, C., Prabhakar, R., and Hinrich, S. (2008). Introduction to Information Retrieval. Cambridge University Press, 1 edition.
Martin, A., Doddington, G., Kamm, T., Ordowski, M., and Przybocki, M. (1997). The DET
Curve in Assessment of Detection Task Performance. In Proc. Eurospeech ’97, pages 1895–
1898, Rhodes, Greece.
Martinez, R., da Silva Neto, J. P., and Caseiro, D. A. (2008). Statistical machine translation of
broadcast news from Spanish to Portuguese. In PROPOR 2008 - 8th International Conference
on Computational Processing of the Portuguese Language. Springer.
Martins, C., Teixeira, A., and Neto, J. (2007a). Vocabulary selection for a broadcast news transcription system using a morpho-syntatic approach. In Proc. of the 8th Annual Conference of the
International Speech Communication Association (Interspeech 2007).
Martins, C., Teixeira, A., and Neto, J. P. (2007b). Dynamic language modeling for a daily broadcast news transcription system. In Proc. of the Automatic Speech Recognition and Understanding
Workshop (ASRU 2007).
Mata, A. I. (1999). Para o Estudo da Entoação em Fala Espontânea e Preparada no Português Europeu.
PhD thesis, University of Lisbon.
Mateus, M. H., Brito, A., Duarte, I., Faria, I. H., Frota, S., Matos, G., Oliveira, F., Vigário, M.,
and Villalva, A. (2003). Gramática da Língua Portuguesa. Caminho, Lisbon, Portugal.
Matusov, E., Mauser, A., and Ney, H. (2006). Automatic sentence segmentation and punctuation prediction for spoken language translation. In International Workshop on Spoken Language
Translation, pages 158–165, Kyoto, Japan.
McCallum, A., Freitag, D., and Pereira, F. C. N. (2000). Maximum entropy markov models
for information extraction and segmentation. In ICML ’00: Proceedings of the Seventeenth International Conference on Machine Learning, pages 591–598, San Francisco, CA, USA. Morgan
Kaufmann Publishers Inc.
Medeiros, J. C. (1995). Processamento morfológico e correcção ortográfica do português. Master’s thesis, IST/ UTL, Portugal.
Meinedo, H., Abad, A., Pellegrini, T., Neto, J., and Trancoso, I. (2010). The L2F broadcast news
speech recognition system. In Proc. of the VI Jornadas en Tecnología del Habla and II Iberian
SLTech Workshop (FALA 2010).
Meinedo, H., Caseiro, D., Neto, J. P., and Trancoso, I. (2003). AUDIMUS.media: A broadcast
news speech recognition system for the European Portuguese language. In PROPOR’2003,
volume 2721 of LNCS, pages 9–17. Springer.
138
BIBLIOGRAPHY
Meinedo, H. and Neto, J. P. (2003). Audio segmentation, classification and clustering in a broadcast news task. In Proc. of the IEEE International Conference on Acoustics, Speech, and Signal
Processing (ICASSP ’03), Hong Kong, China.
Meinedo, H., Viveiros, M., and Neto, J. (2008). Evaluation of a live broadcast news subtitling
system for Portuguese. In Proc. of the 9th Annual Conference of the International Speech Communication Association (Interspeech 2008), Brisbane, Australia.
Mikheev, A. (1999). A knowledge-free method for capitalized word disambiguation. In Proc. of
the 37th annual meeting of the ACL, pages 159–166, Morristown, NJ, USA. ACL.
Mikheev, A. (2002). Periods, capitalized words, etc. Computational Linguistics, 28(3):289–318.
Moniz, H. (2006). Contributo para a caracterização dos mecanismos de (dis)fluência no Português Europeu. Master’s thesis, University of Lisbon.
Moniz, H., Batista, F., Meinedo, H., Abad, A., Trancoso, I., Mata, A. I., and Mamede, N. (2010).
Prosodically-based automatic segmentation and punctuation. In Proc. of 5th International Conference on Speech Prosody, Chicago, Illinois.
Moniz, H., Batista, F., Trancoso, I., and Mata, A. I. (2011). Toward Autonomous, Adaptive, and
Context-Aware Multimodal Interfaces: Theoretical and Practical Issues, volume 6456 of Lecture
Notes in Computer Science, chapter Analysis of interrogatives in different domains, pages 136–
148. Springer Berlin / Heidelberg, Caserta, Italy, 1st edition edition.
Mota, C. and Grishman, R. (2008). Is this NE tagger getting old? In ELRA, editor, Proc. of the
LREC’08.
Mota, C. and Grishman, R. (2009). Updating a name tagger using contemporary unlabeled
data. In Proceedings of the ACL-IJCNLP 2009 Conference Short Papers, pages 353–356, Suntec,
Singapore. Association for Computational Linguistics.
Mrozinsk, J., Whittaker, E. W., Chatain, P., and Furui, S. (2006). Automatic sentence segmentation of speech for automatic summarization. In Proc. of the IEEE International Conference on
Acoustics, Speech, and Signal Processing (ICASSP ’06).
Neto, J., Meinedo, H., Viveiros, M., Cassaca, R., Martins, C., and Caseiro, D. (2008). Broadcast
news subtitling system in Portuguese. In Proc. of the IEEE International Conference on Acoustics,
Speech, and Signal Processing (ICASSP ’08), pages 1561–1564.
Neto, J. P., Meinedo, H., Amaral, R., and Trancoso, I. (2003). A system for selective dissemination of multimedia information. In Proc. of the ISCA MSDR 2003.
Niu, C., Li, W., Ding, J., and Srihari, R. K. (2004). Orthographic case restoration using supervised learning without manual annotation. INTERNATIONAL JOURNAL ON ARTIFICIAL
INTELLIGENCE TOOLS, 13, part 1:141–156.
BIBLIOGRAPHY
139
Ostendorf, M., Favre, B., Grishman, R., Hakkani-Tür, D., Harper, M., Hillard, D., Hirschberg,
J., Ji, H., Kahn, J. G., Liu, Y., Maskey, S., Matusov, E., Ney, H., Rosenberg, A., Shriberg, E.,
Wang, W., and Wooters, C. (2008). Speech segmentation and spoken document processing.
IEEE Signal Processing Magazine, 25(3):59–69.
Ostendorf, M. and Hillard, D. (2004). Scoring structural MDE: Towards more meaningful error
rates. In Proc. of EARS RT-04 Workshop.
Ostendorf, M., Shriberg, E., and Stolcke, A. (2005). Human language technology: Opportunities
and challenges. In Proc. of the IEEE International Conference on Acoustics, Speech, and Signal
Processing (ICASSP ’05), Philadelphia.
Palmer, D. D. and Hearst, M. A. (1994). Adaptive sentence boundary disambiguation. In Proc.
of the fourth conference on Applied natural language processing, pages 78–83, San Francisco, CA,
USA. Morgan Kaufmann Publishers Inc.
Palmer, D. D. and Hearst, M. A. (1997). Adaptive multilingual sentence boundary disambiguation. Computational Linguistics, 23(2):241–267.
Reynar, J. C. and Ratnaparkhi, A. (1997). A maximum entropy approach to identifying sentence
boundaries. In Proc. of the fifth conference on Applied natural language processing, pages 16–19,
San Francisco, CA, USA. Morgan Kaufmann Publishers Inc.
Reynolds, D. and Torres-Carrasquillo, P. (2005). Approaches and applications of audio diarization. In Proc. of the IEEE International Conference on Acoustics, Speech, and Signal Processing
(ICASSP ’05), volume 5, pages 953–956.
Ribeiro, R., Mamede, N. J., and Trancoso, I. (2004). Language Technology for Portuguese: shallow
processing tools and resources, chapter Morpho-syntactic Tagging: a Case Study of Linguistic
Resources Reuse, pages 31–32. Edições Colibri, Lisbon.
Ribeiro, R. and Matos, D. (2007). Extractive summarization of broadcast news: Comparing
strategies for European Portuguese. In Text, Speech and Dialogue, 10th International Conference,
TSD 2007, volume 4629 of Lecture Notes in Computer Science, ISBN 978-3-540-74627-0, pages
115–122. Springer.
Ribeiro, R. and Matos, D. (2008). Mixed-source multi-document speech-to-text summarization.
In MMIES-2: Multi-source, Multilingual Information Extraction and Summarization (COLING
2008), The 22nd International Conference on Computational Linguisti, pages 33–40. Coling
2008 Organizing Committee.
Ribeiro, R., Oliveira, L., and Trancoso, I. (2003). Using Morphossyntactic Information in TTS
Systems: comparing strategies for European Portuguese. In In Computational Processing of the
Portuguese Language: 6th International Workshop, PROPOR 2003, pages 26–27. Springer.
140
BIBLIOGRAPHY
Schmid, H. (1994). Probabilistic part-of-speech tagging using decision trees. In Proceedings of the
International Conference on New Methods in Language Processing, Manchester, United Kingdom.
Shieber, S. M. and Tao, X. (2003). Comma restoration using constituency information. In
NAACL ’03: Proc. of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology, pages 142–148, Morristown, NJ, USA.
Association for Computational Linguistics.
Shriberg, E. (2005). Spontaneous speech: How people really talk, and why engineers should
care. In Proc. of Eurospeech - 9th European Conference on Speech Communication and Technology
(Interspeech 2005), pages 1781 – 1784, Lisbon, Portugal.
Shriberg, E., Favre, B., Fung, J., Hakkani-Tur, D., and Cuendet, S. (2009). Prosodic similarities
of dialog act boundaries across speaking styles. Linguistic Patterns in Spontaneous Speech Language and Linguistics Monograph Series, 25:213–239.
Shriberg, E., Stolcke, A., Hakkani-Tür, D., and Tür, G. (2000). Prosody-based automatic segmentation of speech into sentences and topics. Speech Communications, 32(1-2):127–154.
Sjölander, K. and Beskow, J. (2000). Wavesurfer-an open source speech tool. In Sixth International Conference on Spoken Language Processing, pages 464–467.
Sjölander, K., Beskow, J., Gustafson, J., Lewin, E., Carlson, R., and Granström, B. (1998). Webbased educational tools for speech technology. In Proc. of ICSLP98, 5th Intl Conference on
Spoken Language Processing, pages 3217–3220, Sydney, Australia.
Soltau, H., Kingsbury, B., Mangu, L., Povey, D., Saon, G., and Zweig, G. (2005). The ibm
2004 conversational telephony system for rich transcription. In Proc. of the IEEE International
Conference on Acoustics, Speech, and Signal Processing (ICASSP ’05), pages 205 – 208.
Stevenson, M. and Gaizauskas, R. (2000). Experiments on sentence boundary detection. In
Proc. of the sixth conference on Applied natural language processing, pages 84–89, San Francisco,
CA, USA. Morgan Kaufmann Publishers Inc.
Stolcke, A. (2002). SRILM - An extensible language modeling toolkit. In Proc. of the fourth
International Conference on Spoken Language Processing (ICSLP ’02), volume 2, pages 901–904,
Denver, CO.
Stolcke, A. and Shriberg, E. (1996). Automatic linguistic segmentation of conversational speech.
In Proc. of the fourth International Conference on Spoken Language Processing (ICSLP ’96), volume 2, pages 1005–1008, Philadelphia, PA.
Stolcke, A., Shriberg, E., Bates, R., Ostendorf, M., Hakkani, D., Plauche, M., Tur, G., and Lu,
Y. (1998). Automatic detection of sentence boundaries and disfluencies based on recognized
words. In Proc. of the fourth International Conference on Spoken Language Processing (ICSLP ’98),
volume 5, pages 2247–2250, Sydney.
BIBLIOGRAPHY
141
Strassel, S. (2004). Simple Metadata Annotation Specification V6.2. Linguistic Data Consortium.
Strassel, S., Miller, D., Walker, K., and Cieri, C. (2003). Shared resources for robust speech-totext technology. In Eurospeech 2003.
Stüker, S., Fügen, C., Hsiao, R., Ikbal, S., Kraft, F., Paulik, M., Raab, M., Tam, Y.-C., and Wölfel,
M. (2006). The ISL TC-STAR spring 2006 ASR evaluation systems. In Proc. of the TC-STAR
Workshop on Speech-to-Speech Translation, Barcelona, Spain.
Trancoso, I., do Céu Viana, M., Duarte, I., and Matos, G. (1998). Corpus de diálogo CORAL. In
PROPOR’98, Porto Alegre, Brasil.
Trancoso, I., Martins, R., Moniz, H., Mata, A. I., and Viana, M. C. (2008). The Lectra corpus classroom lecture transcriptions in European Portuguese. In LREC 2008 - Language Resources
and Evaluation Conference, Marrakesh, Morocco.
Trancoso, I., Nunes, R., Neves, L., do Céu Viana Ribeiro, M., Moniz, H., Caseiro, D., and
da Silva, A. I. M. (2006). Recognition of classroom lectures in European Portuguese. In Proc.
of the 9th International Conference on Spoken Language Processing (Interspeech 2006 – ICSLP).
Ulusoy, I. and Bishop, C. M. (2005). Generative versus discriminative methods for object recognition. In CVPR ’05: Proceedings of the 2005 IEEE Computer Society Conference on Computer
Vision and Pattern Recognition (CVPR’05) - Volume 2, pages 258–265, Washington, DC, USA.
IEEE Computer Society.
Vassière, J. (1983). Language-independent prosodic features. In Cutler, A. and Ladd, R., editors,
Prosody: models and measurements, pages 55–66. Berlin: Springer.
Viana, M. C. (1987). Para a Síntese da Entoação do Português. PhD thesis, University of Lisbon.
Viana, M. C., Oliveira, L. C., and Mata, A. I. (2003). Prosodic phrasing: Machine and human
evaluation. International Journal of Speech Technology, 6(1):83–94.
Wang, D. and Narayanan, S. S. (2004). A multi-pass linear fold algorithm for sentence boundary
detection using prosodic cues. In Proc. of the IEEE International Conference on Acoustics, Speech,
and Signal Processing (ICASSP ’04), volume 1, pages 525–528.
Wang, W., Knight, K., and Marcu, D. (2006). Capitalizing machine translation. In HLT-NAACL,
pages 1–8. ACL.
Wichmann, S. (2008). The emerging field of language dynamics. Language and Linguistics Compass, 2(3):442–455.
Yarowsky, D. (1994). Decision lists for lexical ambiguity resolution: Application to accent
restoration in Spanish and French. In Proc. of the 2nd Annual Meeting of the Association for
Computational Linguistics (ACL ’94), pages 88–95.
142
BIBLIOGRAPHY
Zechner, K. (2002). Automatic summarization of open-domain multiparty dialogues in diverse
genres. Computational Linguistics, 28(4):447–485.
Zimmermann, M., Hakkani-Tur, D., Fung, J., Mirghafori, N., Gottlieb, L., Shriberg, E., and
Liu, Y. (2006). The ICSI+ multi-lingual sentence segmentation system. In Proc. of the 9th
International Conference on Spoken Language Processing (Interspeech 2006 – ICSLP), pages 117–
120, Pittsburgh.
Nomenclature
This chapter presents some of the terminology used in this dissertation. This includes
the reference to the information sources and other parts of the document where the subject is
treated in more detail.
APP Audio Pre-Processing or Audio Segmentation.
An HMM can be considered as the simplest dynamic Bayesian network.
ASR Automatic Speech Recognition
L-BFGS (Limited memory BFGS). The LBFGS algorithm is a member of the
BFGS Broyden-Fletcher-Goldfarb-Shanno is
broad family of quasi-Newton optimizaa quasi-Newton method for solving nontion methods.
L-BFGS uses a limlinear optimization problems.
ited memory variation of the BroydenBN Broadcast News
Fletcher-Goldfarb-Shanno (BFGS).
capitalization consists of rewriting each word
Language Dynamics Everyday, new words
of an input text with its proper case inforare introduced and the usage of others
mation given its context
decays with time, despite the fact that
CART Classification and Regression Tree
most of the words and constructions of a
CRF Conditional Random Field
human language are kept in use for many
years or never change. Language dynamCTS Conversational Telephone Speech
ics correspond to these language variaDTD Data Type Definition
tions in time
EARS Effective,
Affordable,
Reusable
LDC Linguistic Data Consortium
Speech-to-Text
Edit distance see Levenshtein distance
EP European Portuguese
GMM Gaussian mixture model
HMM Hidden Markov Model
A hidden Markov model is a statistical
Markov model in which the system being modeled is assumed to be a Markov
process with unobserved (hidden) states.
Levenshtein distance The Levenshtein distance between two strings is defined as
the minimum number of edits needed to
transform one string into the other, with
the allowable edit operations being insertion, deletion, or substitution of a single
character. It is named after Vladimir Levenshtein, who considered this distance in
1965. The term edit distance is often used
144
NOMENCLATURE
to refer specifically to Levenshtein dis- STT Speech To Text
tance
SU Sentence-like Unit
LM Language Model
SVM Support Vector Machines
MDE Metadata Extraction
MEMM Maximum Entropy Markov Model
truecasing (see capitalization) consists of
rewriting each word of an input text
NIST National Institute of Standards and
with its proper case information given its
Technology
context
NLP Natural Language Processing
TTS Text To Speech
NER Named Entity Recognition
RT Rich Transcription
SER Slot Error Rate
WER Word Error Rate
WSJ Wall Street Journal
SNOR Speech Normalized Orthographical
Representation, is a standard speech XML Extensible Markup Language - is a
general-purpose specification for creatrecognition system output format. Does
ing custom markup languages, derived
not contain any punctuation marks, all
from SGML. XML is playing an impornumbers are spelled as words, and all intant role in the exchange of a wide variety
formation is represented in single case,
of data on the Web and elsewhere.
which means that no capitalization information is provided.
Zipf’s law is an empirical law that states that
In other words, a SNOR-normalized trangiven some corpus of natural language
scription consists of text strings made up
utterances, the frequency of any word is
of ASCII characters and has the followinversely proportional to its rank in the
ing contraints: (1) Whitespace separates
frequency table. Thus the most frequent
words for languages that use words; (2)
word will occur approximately twice as
The text is case-insensitive (usually in all
often as the second most frequent word,
upper case); (3) No punctuation is inthree times as often as the third most frecluded except apostrophes for contracquent word, etc.
tions; (4) Previously hyphenated words
A simple corollary is that while only a
are divided into their constituent parts
few words are used very often, many or
separated by whitespace.
most are used rarely.
Portuguese Text Normalization
A
Normalization is an important stage in the process of using written corpora for building
speech models. The normalization tool used in this process was deeply revised and has been
applied to all Portuguese written corpora. This anex shows a number of expressions, most of
them extracted from real text, that have been considered when revising the original normalization tool. Most of them did not produce the desired output in the original version.
A.1
Date and time
Text
Normalization Result
1h23m13s
uma hora vinte e três minutos e treze segundos
2h23m23,45s
duas horas vinte e três minutos vinte e três vírgula quarenta e
cinco segundos
23:59 | 23h59
vinte e três horas e cinquenta e nove minutos
23:59:00
vinte e três horas cinquenta e nove minutos e zero segundos
10/07/2010
10.07.2010
10-07-2010
dez de Julho de dois mil e dez
23:45:23,2342
vinte e três horas quarenta e cinco minutos vinte e três vírgula
dois três quatro dois segundos
10s | 10seg
dez segundos
10h20m10
10h20m10s
dez horas vinte minutos e dez segundos
10h20m10,12s
10h20m10,12
dez horas vinte minutos dez vírgula doze segundos
10.21:31,32s
dez horas vinte e um minutos trinta e um vírgula trinta e dois
segundos
10m5,1s
dez minutos cinco vírgula um segundos
10m1s | 10m1seg
dez minutos e um segundo
10h | 10H
dez horas
10h-12h
dez horas - doze horas
146
A.2
APPENDIX A. PORTUGUESE TEXT NORMALIZATION
Ordinals
Text
a 3ª vez
23ª operação
o 22º peso
a terceira vez
vigésima terceira operação
o vigésimo segundo peso
1º lugar
primeiro lugar
foi a 1ª
foi a primeira
14551ª
quarta milésima quingentésima quinquagésima primeira
123º
A.3
Normalization Result
centésimo vigésimo terceiro
Numbers
Text
Normalization Result
11900
onze mil e novecentos
11,12
onze vírgula doze
0,234
zero vírgula duzentos e trinta e quatro
965042221
nove seis cinco zero quatro dois dois dois um
213100203
dois um três um zero zero dois zero três
17km/h
dezassete quilómetros por hora
17 km/h
dezassete quilómetros por hora
17km/h.
dezassete quilómetros por hora .
123.342.122
cento e vinte e três milhões trezentos e quarenta e dois mil cento e
vinte e dois
são 100$00
são cem escudos
são 1000$00
são mil escudos
são 100,00$ ou
200$53
100,0€
1000.000€
nos 123Hz de
potência
(+351) 213133030
123,3433
são cem vírgula zero escudos ou duzentos vírgula cinquenta e
três escudos
cem euros
um milhão e zero euros
nos cento e vinte e três hertz
( mais trezentos e cinquenta e um ) dois um três um três três zero
três zero
cento e vinte e três vírgula três mil quatrocentos e trinta e três
A.4. OPTIONAL EXPRESSIONS
A.4
147
Optional Expressions
Text
Normalization Result
angl.-sax.
anglo-saxónico
mal.-jav.
malaio-javanês
dir. can.
direito canónico
dir. civ.
direito civil
dir. com.
direito comercial
dir. ecles.
direito eclesiástico
dir. rom.
direito romano
hist. nat.
história natural
hist. rel.
história religiosa
v. s. f. f.
volte se faz favor
deriv. regr.
m. q.
derivação regressiva
mesmo que
artº
artigo
art.
artigo
nrº
número
n.
número
drª
doutora
dra
doutora
drº
doutor
srª
senhora
srº
senhor
arqt.
arquitecto
arqtª
arquitecta
engº
engenheiro
v. ex.ª
v. excelência
exmsª
excelentíssimas
exª
excelência
exº
exemplo
dtº
direito
esqº
esquerdo
antº
antónio
stº
santo
pp.
página
prof.
op. cit.
professor
trabalho citado
148
A.5
APPENDIX A. PORTUGUESE TEXT NORMALIZATION
Money
Text
1.100.000$00
123.123.123$00
A.6
Normalization Result
um milhão e cem mil escudos
cento e vinte e três milhões cento e vinte e três mil cento e vinte e
três escudos
100$00
cem escudos
100,00
cem
100,23€
cem vírgula vinte e três euros
£100,23
cem vírgula vinte e três libras
100,23£
cem vírgula vinte e três libras
1002,22£
mil e dois vírgula vinte e dois libras
£1002,22
mil e duas vírgula vinte e duas libras
Abreviations
Text
Normalization Result
em Jan. fiz isto
em Janeiro fiz isto
em jan. fiz isto
em Janeiro fiz isto
o log. de um nº é
o prof.
o dr.
o Dr. Pedro
a Dra. Mª da silva
em km/h foram
feitos 100
o sp. club
a 100 km/h. nem
pensar
foram 10 l/h
o logaritmo de um número é
o professor
o doutor
o Doutor Pedro
a Doutora Maria da silva
em quilómetros por hora foram feitos cem
o Sporting club
a cem quilómetros por hora . nem pensar
foram dez litros por hora
a profª disse ao
prof.
a professora disse ao professor
V.Ex.ª disse que
sim?
Vossa Excelência disse que sim ?
A.7. OTHER
A.7
149
Other
Text
13 kms / 13kms
1km
1234.212.232km/h
Normalization Result
treze quilómetros
um quilómetros
um dois três quatro dois um dois dois três dois quilómetros por
hora
angl.-sax.
anglo-saxónico
o Mañolo
o Mañolo
terceiro lugar
(29.02 minutos),
por uma
terceiro lugar ( vinte e nove ponto zero dois minutos ) , por uma
entram em vigor
em 01 de janeiro
de 2010
entram em vigor em um de janeiro de dois mil e dez
passada, para os
432.000,
assinalando
passada , para os quatrocentos e trinta e dois mil , assinalando
caíram 22.000 na
semana passada
fixando-se nos
432.000.
caíram vinte e dois mil na semana passada fixando-se nos
quatrocentos e trinta e dois mil .
10 a 15 Km/h.
pág.20 / pág. 20
121.123.123$00km
1231222$00/km
pág.150 / pag.150
Engºs
o V.de e a Cª L.da
122342122111212121
12-16
5-7 5.7 5/7
dez a quinze quilómetros por hora .
página vinte
cento e vinte e um milhões cento e vinte e três mil cento e vinte e
três escudos quilómetros
um milhão duzentos e trinta e um mil duzentos e vinte e dois
escudos por quilómetro
página cento e cinquenta
Engenheiros
o Visconde e a Companhia Limitada
doze milhões duzentos e trinta e quatro mil duzentos e doze dois
um um um dois um dois um dois um
doze dezasseis
cinco a sete cinco ponto sete cinco barra sete
© Copyright 2026 Paperzz