Audio, Speech and Music CS 445/656 Computer & New Media Topics for Monday & Wednesday • • • • General Audio Speech Music Music management support General Audio • Mapping audio cues to events – Recognizing sounds related to particular events (e.g. gunshot, falling, scream) • Mapping events to audio cues – Audio debugger to speed up stepping through code • Spatialized audio – Provides additional geographic/navigational channel Background Audio in Games • Immersion – Most successful computer games have one important element in common: the ability to draw players in – Sense of being “in a game”, where thoughts, attention and goals are all focused in the game • Background Audio – All the sound including music and sound effects – Communicate aspect of the narrative, convey emotion, and enrich the experience Background Audio in Games • How to measure audio Immersion? – Immersion questionnaire – Psychological instruments – Behavior during gameplay – Functional Magnetic Resonance Imaging (fMRI) Lair of Beowulf • The user should be able to navigate in a sound mostly world, with number of caves, with a certain theme DigiWall • Computer game interface in the form of a climbing wall – https://www.youtube.com/watch?v=mPkp8ziM 34M • In both games, audio is used In ways to create a sense of presence – Communicate instructions, cues, clues, feedback and results from the game – Use sound to blur the boarders between virtual reality and physical reality of the player Ambience & Sound Effects • Ambient sounds can be strong carriers of emotion and mood – Beowulf, air softly flowing through game world – DigiWall, used to set basic mood and encourage physical activity • Sound effects for cues and clues – Natural sounds to warn, attention, direction – https://www.youtube.com/watch?v=LgTTMsjK38 Spatialized Audio • The projection and localization of sound sources in physical or virtual space or sound's spatial movement in space. • Beamforming – Timing for constructive interference to create stronger signal at desired location • Crosstalk Cancellation – Destructive interference to remove parts of signal at desired location Head-Related Transfer Function (HRTF) • Describes transformation of sound from free-filed to ear – Difference in timing and signal strength determine how we identify position of sound • The impulse response from the source to the ear drum is called the Head-Related Impulse Response (HRIR), and its Fourier transform H(f) is called the HRTF Audio Signal Analysis • Fast Fourier Transform (FFT) and Discrete Wavelet Transform (DWT) – Transforms commonly used on audio signals – Allow for analysis of frequency features across time (e.g. power contained in a frequency interval) – FFTs have equal sized windows where wavelets can vary based on frequency. Transform the view of the signal from time-base to frequency-base. Audio Signal Analysis • Mel-frequency cepstral coeffients (MFCC) – Based on FFTs – Maps results into bands approximating human auditory system – Natural to use the mel-scale and log amplitude since it relates to how we perceive sounds – MFCCs are commonly used as features in speech recognition systems, such as the systems which can automatically recognize numbers spoken into a telephone. Echology • An interactive soundscape combining human collaboration with aquarium activity • Engage visitors to spend more time with (and learn more about) Beluga whales – Motion of each layer controls one channel of sound • Spatialized sound based on whale activity and human interaction http://www.vanaqua.org/learn/see-and-learn/live-cams/beluga-cam Echology • Uses spatial sound as its core expressive component that participants interact with – Octophonic spatial sound allows participants to experience the movement of sound in a plane formed above their heads – 8 buttons represents reflection points on the edge of the 8 loudspeakers – The movement of Beluga whales across a layer controls amplitude and triggers sounds Echology: Interaction • 4 full circles represent location and amplitude of a layer of sound. – Each circle fade in and out as level of activity of the Belugas increases or decreases. • 8 Blue pacman circles represent reflection points and current reflection angle of the speaker. – By hitting a button, participant change the direction of the reflection angle. – Default pattern is each pointing to its adjacent speaker Echology Architecture Speech • Speaker segmentation – Identify when a change in speaker occurs – Useful for basic indexing or summarization of speech content • Speaker identification – Identify who is speaking during a segment – Enables search (and other features) based on speaker • Speech recognition – Identify the content of speech Speaker Segmentation • Speaker Diarisation – Partitioning an input audio stream into homogeneous segments according to speaker identity – Bottom-up clustering • Algorithms can start in splitting the full audio content and progressively tries to merge the redundant clusters to reach each corresponds to a real speaker – Top-down clustering • Start with single cluster and split to reach clusters equals to number of speakers Speaker Segmentation • Open source speaker diarisation software – ALIZE speaker diarization – SpkDiarization – Audioseg – SHoUT Speech Recognition • Start by segmenting utterances and characterizing phonemes – Use gaps to segment – Group segments into words – Classifiers for limited vocabulary (HMMs) • Using Viterbi sampler and Baum-Welch re-estimation • Continuous speech – Language models for disambiguation – Speaker dependent or not
© Copyright 2026 Paperzz