audio_speech

Audio, Speech and Music
CS 445/656 Computer & New Media
Topics for Monday & Wednesday
•
•
•
•
General Audio
Speech
Music
Music management support
General Audio
• Mapping audio cues to events
– Recognizing sounds related to particular
events (e.g. gunshot, falling, scream)
• Mapping events to audio cues
– Audio debugger to speed up stepping through
code
• Spatialized audio
– Provides additional geographic/navigational
channel
Background Audio in Games
• Immersion
– Most successful computer games have one
important element in common: the ability to draw
players in
– Sense of being “in a game”, where thoughts,
attention and goals are all focused in the game
• Background Audio
– All the sound including music and sound effects
– Communicate aspect of the narrative, convey
emotion, and enrich the experience
Background Audio in Games
• How to measure audio Immersion?
– Immersion questionnaire
– Psychological instruments
– Behavior during gameplay
– Functional Magnetic Resonance Imaging
(fMRI)
Lair of Beowulf
• The user should be able to navigate in a
sound mostly world, with number of caves,
with a certain theme
DigiWall
• Computer game interface in the form of a
climbing wall
– https://www.youtube.com/watch?v=mPkp8ziM
34M
• In both games, audio is used In ways to
create a sense of presence
– Communicate instructions, cues, clues,
feedback and results from the game
– Use sound to blur the boarders between
virtual reality and physical reality of the player
Ambience & Sound Effects
• Ambient sounds can be strong carriers of
emotion and mood
– Beowulf, air softly flowing through game world
– DigiWall, used to set basic mood and
encourage physical activity
• Sound effects for cues and clues
– Natural sounds to warn, attention, direction
– https://www.youtube.com/watch?v=LgTTMsjK38
Spatialized Audio
• The projection and localization of sound
sources in physical or virtual space or
sound's spatial movement in space.
• Beamforming
– Timing for constructive interference to create
stronger signal at desired location
• Crosstalk Cancellation
– Destructive interference to remove
parts of signal at desired location
Head-Related Transfer Function
(HRTF)
• Describes transformation of sound from
free-filed to ear
– Difference in timing and signal strength
determine how we identify position of sound
• The impulse response from the
source to the ear drum is called
the Head-Related Impulse Response (HRIR),
and its Fourier transform H(f)
is called the HRTF
Audio Signal Analysis
• Fast Fourier Transform (FFT) and Discrete Wavelet
Transform (DWT)
– Transforms commonly used on audio signals
– Allow for analysis of frequency features across time (e.g. power
contained in a frequency interval)
– FFTs have equal sized windows where wavelets can vary based
on frequency. Transform the view of the signal from time-base to
frequency-base.
Audio Signal Analysis
• Mel-frequency cepstral coeffients (MFCC)
– Based on FFTs
– Maps results into bands approximating human
auditory system
– Natural to use the mel-scale and log amplitude
since it relates to how we perceive sounds
– MFCCs are commonly used
as features in speech recognition systems, such
as the systems which can automatically
recognize numbers spoken into a telephone.
Echology
• An interactive soundscape
combining human
collaboration with aquarium
activity
• Engage visitors to spend
more time with (and learn
more about) Beluga whales
– Motion of each layer controls
one channel of sound
• Spatialized sound based on
whale activity and human
interaction
http://www.vanaqua.org/learn/see-and-learn/live-cams/beluga-cam
Echology
• Uses spatial sound as its core expressive
component that participants interact with
– Octophonic spatial sound allows participants to
experience the movement of sound in a plane formed
above their heads
– 8 buttons represents reflection
points on the edge of the
8 loudspeakers
– The movement of Beluga
whales across a layer controls
amplitude and triggers sounds
Echology: Interaction
• 4 full circles represent location and amplitude of a
layer of sound.
– Each circle fade in and out as level of activity of the
Belugas increases or decreases.
• 8 Blue pacman circles represent reflection points
and current reflection angle of the speaker.
– By hitting a button, participant change the direction of
the reflection angle.
– Default pattern is each pointing to its adjacent speaker
Echology Architecture
Speech
• Speaker segmentation
– Identify when a change in speaker occurs
– Useful for basic indexing or summarization of
speech content
• Speaker identification
– Identify who is speaking during a segment
– Enables search (and other features) based on
speaker
• Speech recognition
– Identify the content of speech
Speaker Segmentation
• Speaker Diarisation
– Partitioning an input audio stream into
homogeneous segments according to
speaker identity
– Bottom-up clustering
• Algorithms can start in splitting the full audio
content and progressively tries to merge the
redundant clusters to reach each corresponds to a
real speaker
– Top-down clustering
• Start with single cluster and split to reach clusters
equals to number of speakers
Speaker Segmentation
• Open source speaker diarisation software
– ALIZE speaker diarization
– SpkDiarization
– Audioseg
– SHoUT
Speech Recognition
• Start by segmenting utterances and
characterizing phonemes
– Use gaps to segment
– Group segments into words
– Classifiers for limited vocabulary (HMMs)
• Using Viterbi sampler and Baum-Welch re-estimation
• Continuous speech
– Language models for disambiguation
– Speaker dependent or not