Interactive Systems Technical Design

Interactive Systems Technical Design
Seminar work: Audio / Speech
Ville-Mikko Rautio
Timo Salminen
Vesa Hyvönen
ISTD 2003, Audio / Speech
Introduction
• When gathering information about surrounding
environment, hearing is one basic sense for humans.
Therefore, usage of audio and speech as an alternative
input and output method can effort a lot to a user
experience in interactive systems and make it more
natural.
ISTD 2003, Audio / Speech
Motivation
• Building interactive systems, user
interface should behave according
to the expectations of the user
experiences of the real world.
• Generally, user interfaces today
are mainly based on keyboard and
screen. Feedback from system is
given basically only in visual form.
In computer-based systems, much
better user experience can be
achieved by offering information
using also other basic senses such
as hearing, sense of taste, touch
and smell.
ISTD 2003, Audio / Speech
Implementation
• Basically two components: Audio playback and
speech/audio recognition.
• Design issues:
• Audio can be speech / non-speech
• To whom are you designing for?
• Different users – different abilities
• Blind, old and disabled people
• Human diversity – physical, perceptual, cultural and
intellectual differences
• Mobile computing
• Limited input, limited output, slow processor, small memory, limited battery
life, slow network connection
• Communication protocol
• Speech recognition causes major problems
• Accuracy
• Usage in critical systems?
ISTD 2003, Audio / Speech
Applications
• MIT Media Lab –
Nomadic Radio: Wearable Audio Computing
• A client-server based messaging infrastructure
• utilizes spatialized audio, speech synthesis and recognition
• hourly news broadcasts, voice mail, email, calendar reminders,
weather forecasts, stock reports are delivered
• HP Labs – SpeechBot
• a search engine for audio & video content that is hosted and
played from other websites using speech recognition
• http://speechbot.research.compaq.com/
ISTD 2003, Audio / Speech
Nomadic Radio Network Architecture
ISTD 2003, Audio / Speech
ISTD 2003, Audio / Speech
Strengths / Advantages
• Data input possible
without keyboard.
• Mobile devices
• Excellent for hands/eyes
busy – situation.
ISTD 2003, Audio / Speech
Strengths / Advantages
•
•
•
•
People with visual or other disabilities
Natural way for humans to interface with the environment
Increase the bandwidth of communication
Devices with limited screen – need for additional output
method
• Technology available now
ISTD 2003, Audio / Speech
Limitations / Weaknesses
•
•
•
Input is error prone especially in noisy environments
Vocabulary size in recognition - Controlling objects and things is limited
Communication protocol needed
• “Computer! Shut down the lights!”
• Can lead to unnatural experience
• How to tell user what communication protocol is like:
• Explicit – tell exactly what to say (“Welcome to library, say “XXX” to ...”)
• Implicit – open ended, potential for errors (“Welcome to library, what would you like to
do….”).
ISTD 2003, Audio / Speech
Limitations / Weaknesses
• Speech output sounds unnatural
• Asymmetrical
• speech input is faster than typing
• speech output is slower than reading
• Feedback & latency
•
•
•
•
User needs to know if recognition was successful
Is system processing data or waiting input?
Time taken to recognise utterance
Pauses
ISTD 2003, Audio / Speech
Selected Industrial Players
• IBM
• Conversational Biometrics
• Combines multiple verification sources such as voice biometrics
with spoken knowledge.
• Embedded ViaVoice
• IBM speech technology to mobile devices
• Command and control (C&C)
• Text-to-Speech (TTS)
• Sony
• SDR-4X
• Prototype of entertainment robot using multi-modal human
interaction technology
• Individual person detection by the tone of voice
• Continuos speech recognition and unknown vocabulary
acquisition
• Speech synthesis and singing voice production
ISTD 2003, Audio / Speech
SDR-4X
ISTD 2003, Audio / Speech
Selected International Research Groups and Projects
• The MBROLA Project
• Develops speech engine which synthesizes written text for many
different languages
• Speech Engine core freely available!
• http://tcts.fpms.ac.be/synthesis/mbrola.html
• Stanford University – Interactive Workspaces
• Goal is to create interactive space where you can work
collaboratively using natural gestures
• http://iwork.stanford.edu/
• Speech Interface Group, MIT Media Laboratory
• Major player, numerous projects
• Example: Nomadic Radio: Wearable Audio Computing
• http://web.media.mit.edu/~nitin/NomadicRadio/
ISTD 2003, Audio / Speech
Selected International Research Groups and Projects
•MIT, PROJECT OXYGEN
•Pervasive, human-centered computing
•Integrated software system that will reside in the public
domain
•Speech and vision, provide the main modes of
interaction in Oxygen.
•Multilingual systems support dialog among
participants speaking different languages.
•The SpeechBuilder utility supports development of
spoken interfaces.
•http://oxygen.lcs.mit.edu/Overview.html
ISTD 2003, Audio / Speech
ISTD 2003, Audio / Speech
Selected Finnish Research Groups and Projects
• VTT,
Interactive Intelligent Electronics (IIE)
• User interface technologies for future home environments, The
Smart-Its Project, Beyond the GUI, …
• http://www.vtt.fi/ele/projects/iie/index.htm
• Helsinki University of Technology,
Neural Network Research Centre
• Adaptive Natural Language Processing
• http://www.cis.hut.fi/projects/natlang/
• Tampere University of Technology,
Speech-based and Pervasive Interaction Group
• USIX-Interact, Dumas, Mobile User Interfaces, …
• http://www.cs.uta.fi/research/hci/spi/
ISTD 2003, Audio / Speech
Companies and Research Groups in Oulu
• MediaTeam Oulu,
Language and Audio Technology
• CBIR – Content Based Information Retrieval
• Filling of the Semantic Gap in the Retrieval of Audio
and Video Recordings
• Multiparametric prosodic analysis of phonetic and
phonological correlates of emotions
• Vikings
• http://www.mediateam.oulu.fi/research/lat/?lang=en
ISTD 2003, Audio / Speech
Future Developments
• Multimodality
• Multilingual, natural
speech interaction
• Emotional state
• Biometrics
ISTD 2003, Audio / Speech