PDF slides

Business cases for automatic metadata
extraction
Petr Vítek
Czech Television
Aleš Pražák, Jindřich Matoušek, Zdeněk Krňoul, Pavel Ircing
SpeechTech, s.r.o., Pilsen, Czech Republic
University of West Bohemia, Pilsen, Czech Republic
Research projects
• Research project ELJABR – Elimination of the Language Barriers Faced
by the Handicapped Viewers of Czech Television
• in cooperation with the University of West Bohemia in Pilsen, Faculty of
Applied Sciences, Department of Cybernetics
• funded by Ministry of education, Technology Agency and Czech Television
• 2006 – 2011 - ELJABR
• 2011 – 2016 - ELJABR II
• next phase in preparation 2016 – ???
Research areas
• automatic live subtitling
• automatic ‟clean audio“ track creation using TTS technology
• assisted metadata assignment
• signing avatar on TV screen
History (?) of live subtitling
• rapid typing using standard QWERTY keyboard
• up to 100 words per minute
• typists need to take turns after a relatively short period
• Velotype (Veyboard)
• “chord” keyboard - several keys pressed simultaneously,
producing syllables rather than letters
• up 200 words per minute
• requires intensive training of typists
• keyboards must be custom-made
for each new language - very expensive
4
History (?) of live subtitling
• Stenotype
• again chord keyboard, simultaneous pressing
of multiple keys produces phonetic transcript
- special software then translates it to ortographic
form
• up to 300 words per minute
• extremely long training of typists
- up to 5(!) years
• typists tire out quickly
5
Automatic live subtitling
• requirements
•
•
•
•
clear professional speech
non-overlapping speakers
clear acoustic background
limited language model domain
TV acoustic track
Subtitles
ASR
• cheap (no operating staff)
• in-house low-latency automatic speech recognition (ASR) framework
• possible utilization for TV environment
Automatic subtitling of parliament
meetings
• speaker-independent LVCSR – semi-supervised training
• vocabulary updating – names of representatives and new affairs
• automatic addition
of punctuation symbols
• problems with speech disorders
• accuracy over 90 %
• since 2010: over 2000 subtitled hours
Live subtitling through re-speaking
• associated development of speech recognition and live subtitling
• very close interaction between re-speaker and LVCSR system – one-man job
• speaker-dependent LVCSR
• automatic speaker adaptation
• semi-supervised training
• four-phase gradual „self-training“ system with supervision
• training plan from 2 to 3 months (100 training hours at minimum)
• recognition accuracy, syntactic correctness, semantic correctness
• we have 8 skilled re-speakers
• accuracy over 98 %
Re-speaker tasks
• listens to the original dialogues and dictates to the speech recognition
system
• rectifies or simplifies the speech, if necessary (rephrasing, condensing)
• instantly checks and corrects resulted subtitles
• inserts punctuation symbols by keyboard
• adds new words to the system vocabulary just during subtitling
• indicates speaker changes – speaker colouring
• handles one to two hours of live subtitling
Distributed
subtitling
platform
Live subtitling of non-sport TV programmes
• different vocabularies for each domain
(politics, entertainment, newscast)
• automatic vocabulary updating (3 times per day) based on internet
news
• political and economic debates, elections, charity programmes,
award ceremonies, Dancing with the Stars, …
• in 2014: over 500 subtitled hours
Live subtitling of sport TV programmes
• special vocabularies for different sport events
• language model adaptation for each programme
• names of sportsmen, teams and sport places in all grammatical cases
• football and ice-hockey leagues, Ice-hockey World Championship,
FIFA World Cup, Wimbledon, Olympic Games, …
• in 2014: over 750 subtitled hours
Supplementary subtitling services
• automatic daily updating of speech recognizer vocabulary
• language model adaptation for sport programmes
• utilization of programme scripts
• automatic re-timing of subtitles for programme rerun
• assisted offline corrections of live subtitles
• automatic subtitle composition with cut scenes
Individual solutions
• communication with broadcaster
(Czech Television)
• servers, protocols, control software
• variable subtitle reproduction on screen
• 1-line, multiple line
• pop-on, rolling, word-by-word
• in-house technologies
independent from language
Automatic ‟clean audio“ track creation
• aimed at hearing impaired viewers of Czech TV
• people who cannot follow complex audio track of modern TV broadcasting
• seniors, people with minor auditory impairments, …
• input requirements
• single-voice track
• subtitles
Subtitles
Clean audio
• multi-voice track
• subtitles + subtitles-to-characters assignment
Text-to-Speech (TTS)
• output
• clean speech-only audio track (no background noise, no music)
• in-house real-time text-to-speech system ‟subtitles-to-speech“
Clean audio
track creation
and distribution
plan
New software for preparation of
multi-voice subtitle files
• support for subtitle-to-character assignment
• automatic TTS-based too-long subtitles detection
• multi-platform web-based framework
• XML-based subtitle format
(support for export to ESUB-XF
and EBU-TT formats)
• support for automatic
voice-to-character assignment
Case study
• tested on Czech series ‟Hraběnky“ (‟Countesses“)
• very complex audio track full of background effects and music
• very ‟live“ dynamic dialogues with emotions
•
•
•
•
•
clean audio track created during digitization of the series
subtitle-to-character assignment manually added
voice-to-character assignment automatically added
four synthetic voices (two male and two female) used
almost 8.5 hours of speech (56k words) synthesised from Czech
TV subtitles
• clean audio track ready to be broadcast
Assisted metadata assignment
• automatic topic detection methods are used to suggest predefined
keywords for individual contributions of the main bulletin (news)
• the final decision about the actual assignments is left to archivists
• need for standardized topic list/hierarchy
• EBUContentGenre
• designed for categorizing the whole programme, finer granularity required
• SubjectCode defined by International Press Telecommunication Council (IPTC)
– see http://cv.iptc.org/newscodes/subjectcode/
• any other suggestions ???
Signing avatar on TV screen
• project deals with control protocol for signing avatar on TV screen
• assume translation to sing language (SL) by human instead of machine
translation that is still not accurate enough
• broadcast using DVB data service (e.g. teletext) or as hybrid broadcast service
(e.g. HbbTV)
• two concepts of data broadcasting are considered:
Concept
Data Broadcasting
Active SL
interpreting
Translation from
text to SL notation
Transmission
[kbps]
1
Text + SL translation data
NO
YES
1-2
2
Motion capture data
YES
NO
250-500
Text + SL translation data
• Translation to SL by human using extended notation system
• e.g. HamNoSys or SignWriting notation system extended to additional SL translation data
• translation of a broadcast using proprietary tool and human operator with knowledge of
SL
• Broadcasting only “text” information
• Conversion of the notation to signing avatar is in TV receiver
• it puts higher computational demands on TV receiver
• conversion algorithm is known
• however lower ACCEPTANCE by the deaf due to unnaturalness of the signing avatar
Text (subtitles)
Translation text
to SL by human
operator
3D avatar
animation in
TV
Motion capture data
• Experiments with state-of-the-art mocap system:
• Vicon (body motion) + Vicon Cara (facial expression) + CyberGlove (hand shapes) are available
@UWB
• it achieves the best possible accuracy (naturalness & intelligibility)
• high potential to use the data for TRANING of the system
• For running service we propose:
• capturing system with two RGBD cameras, face camera and data gloves
• conversion to skeleton data also on the side of the TV broadcaster
• broadcasting the skeleton data directly to TV
(1) control unit
(2,3) RGBD sensors
(4) face camera
(6,7) data gloves
(13) set-top-box