IMPLEMENTATION AND EVALUATION OF AUDITORY MODELS

Master Thesis
Electrical Engineering
Thesis no:
December 2015
IMPLEMENTATION AND EVALUATION OF AUDITORY
MODELS FOR HUMAN ECHOLOCATION
VIJAY KIRAN GIDLA
Department of Applied Signal Processing
Blekinge Institute of Technology
37179 Karlskrona
Sweden
This thesis is submitted to the Department of Applied Signal Processing at Blekinge Institute of
Technology in partial fulfillment of the requirements for the degree of Master of Science in Electrical
Engineering.
Contact Information
Author:
VIJAY KIRAN GIDLA
E-mail: [email protected]
University advisor:
Docent BO SCHENKMAN
Blekinge Institute of Technology
Department of Applied Signal Processing
Blekinge Institute of Technology
371 79 KARLSKRONA SWEDEN
Internet: www.bth.se/ing
Phone: +46 455 385000
SWEDEN
Abstract
Blind people use echoes to detect objects and to find their way, the ability being known as human
echolocation. Previous research have found some of the favorable conditions for the detection of the
object, with many factors yet to be analyzed and quantified. Studies have also shown that blind
people are more efficient than the sighted in echolocating, with the performance varying among the
individuals. This motivated the research in human echolocation to move in a new direction to get a
fuller understanding for the high detection of the blind. The psychoacoustic experiments solely cannot
determine how the superior echo detection of the blind listeners should be attributed to perceptual or
physiological causes. Along with the perceptual results it is vital to know how the sounds are processed
in the auditory system. Hearing research has led to the development of several auditory models by
combining the physiological and psychological results with signal analysis methods. These models
try to describe how the auditory system processes the signals. Hence, to analyze how the sounds are
processed for the high detection of the blind, auditory models available in the literature were used in
this thesis. The results suggest that repetition pitch is useful at shorter distances and is determined
from the peaks in the temporal profile of the autocorrelation function computed on the neural activity
pattern. Loudness attribute also plays a role in providing information for the listeners to echolocate at
shorter distances. At longer distances timbre aspects such as sharpness information might be used by
the listeners to detect the objects. It was also found that the repetition pitch, loudness and sharpness
attributes in turn depend on the room acoustics and type of the stimuli used. These results show the
fruitfulness of combining results from different disciplines through a mathematical framework given
by signal analysis.
Keywords: Human echolocation, Psychoacoustics, Physiology, Signal analysis, Auditory models.
i
Acknowledgment
Firstly, I would like to express my sincere gratitude to my advisor Docent Bo Schenkman who supported me throughout my master thesis. I would have not been able to complete my thesis without
his support, patience and motivation. His guidance helped me to think critically on the results that
I have found from the analysis in this thesis. His valuable comments during the writing of my thesis
helped me to order my analysis into a good framework. I am really grateful for having such an advisor
for my master thesis.
Beside my advisor, I would like to thank my examiner Sven Johansson who was patient and cooperative with the submission of my thesis. I would like to thank Professor Brian C. J. Moore and
Professor Jan Schnupp for allowing me to use the figures from their books in my thesis.
My sincere thanks also goes to my senior Abel Gladstone Mangam who suggested me to my advisor to perform the research in human echolocation. I thank the staff at the BTH library and IT
help desk who were very supportive in providing me with the literature and software I needed for my
thesis. Last but not the least, I would like to thank my parents who supported me throughout my
thesis.
ii
Contents
Abstract
i
Contents
iii
List of Figures
v
List of Tables
vii
Abbreviations
x
1 Introduction
1
2 Physiology and Perception
2.1 Physiology of hearing . . . . . . . . . . .
2.1.1 Auditory periphery . . . . . . . .
2.1.2 Central auditory nervous system
2.2 Perception . . . . . . . . . . . . . . . . .
2.2.1 Loudness . . . . . . . . . . . . .
2.2.2 Pitch . . . . . . . . . . . . . . . .
2.2.3 Timbre . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
4
4
4
7
8
8
9
11
3 Room acoustics
3.1 Review of studies analyzing acoustic signals
3.2 Sound recordings . . . . . . . . . . . . . . .
3.3 Signal analysis . . . . . . . . . . . . . . . .
3.3.1 Sound Pressure Level (SPL) . . . . .
3.3.2 Autocorrelation Function (ACF) . .
3.3.3 Spectral Centroid (SC) . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
12
12
13
14
14
15
16
4 Auditory models
4.1 Description of the auditory image model . . . . . . .
4.1.1 Pre Cochlear Processing (PCP) . . . . . . . .
4.1.2 Basilar Membrane Motion (BMM) . . . . . .
4.1.3 Neural Activity Pattern (NAP) . . . . . . . .
4.1.4 Strobe Temporal Integration (STI) . . . . . .
4.1.5 Autocorrelation Function (ACF) . . . . . . .
4.2 Auditory analysis . . . . . . . . . . . . . . . . . . . .
4.2.1 Loudness analysis . . . . . . . . . . . . . . .
4.2.2 Auto correlation analysis for pitch perception
4.2.3 Sharpness analysis for timbre perception . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
22
22
22
23
23
24
25
25
25
29
41
.
.
.
.
.
.
.
iii
5 Analysis of the perceptual results
5.1 Description of the non parametric modeling:
5.2 Analysis . . . . . . . . . . . . . . . . . . . .
5.2.1 Distance . . . . . . . . . . . . . . . .
5.2.2 Loudness . . . . . . . . . . . . . . .
5.2.3 Pitch . . . . . . . . . . . . . . . . . .
5.2.4 Sharpness . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
43
43
44
44
46
46
47
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
echolocation
. . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
49
49
50
50
51
51
52
53
7 General Conclusion
7.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.2 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
54
54
55
Bibliography
56
Appendices
60
A Room acoustics
A.1 Calibration Constant . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
A.2 Sound Pressure Level . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
A.3 Spectral Centroid . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
60
60
61
65
B Auditory models
B.1 Loudness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
B.2 Sharpness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
B.3 Pitch strength using strobe temporal integration . . . . . . . . . . . . . . . . . . . . .
73
73
76
79
6 Discussion
6.1 Echolocation and loudness . . . . . . . .
6.2 Echolocation and pitch . . . . . . . . . .
6.3 Echolocation and sharpness . . . . . . .
6.4 Echolocation and room acoustics . . . .
6.5 Echolocation and binaural information .
6.6 Advantages or disadvantages of auditory
6.7 Theoretical implications of thesis . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
model approach to
. . . . . . . . . . .
iv
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. . . .
. . . .
. . . .
. . . .
. . . .
human
. . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
List of Figures
2.1
2.2
2.3
2.4
2.5
2.6
2.7
Anatomy of the human ear. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Cochlea unrolled, in cross section. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Cross section of the cochlea, and the schematic view of the organ of corti. . . . . . . .
An illustration of the most important pathways and nuclei from the ear to the auditory
cortex. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Basic structure of the models used for the calculation of loudness. . . . . . . . . . . . .
A simulation of the basilar membrane motion for a 200 Hz sinusoid. . . . . . . . . . .
A simulation of the basilar membrane motion for a 500ms iterated ripple noise with
gain=1, delay=10ms and no of iterations = 2. . . . . . . . . . . . . . . . . . . . . . . .
3.1
3.2
Sound recordings made in the anechoic, conference and the lecture room. . . . . . . .
The autocorrelation function of a 5ms signal recorded in the anechoic chamber (Experiment 1) with reflecting object at 100cm. . . . . . . . . . . . . . . . . . . . . . . . . . .
3.3 The autocorrelation function of a 500ms signal recorded in the anechoic chamber (Experiment 1) with reflecting object at 100cm. . . . . . . . . . . . . . . . . . . . . . . . .
3.4 The autocorrelation function of a 5ms signal recorded in the conference room (Experiment 1) with reflecting object at 100cm. . . . . . . . . . . . . . . . . . . . . . . . . . .
3.5 The autocorrelation function of a 500ms signal recorded in the conference room (Experiment 1) with reflecting object at 100cm. . . . . . . . . . . . . . . . . . . . . . . . .
3.6 The autocorrelation function of a 5ms signal recorded in the lecture room (Experiment
2) with reflecting object at 100cm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.7 The autocorrelation function of a 500ms signal recorded in the lecture room (Experiment
2) with reflecting object at 100cm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.8 The mean of the spectral centroid for the 10 versions as a function of time of the left
ear 500ms recording in the anechoic chamber (Experiment 1). . . . . . . . . . . . . . .
3.9 The mean of the spectral centroid for the 10 versions as a function of time of the left
ear 500ms recording in the conference room (Experiment 1). . . . . . . . . . . . . . . .
3.10 The mean of the spectral centroid for the 10 versions as a function of time of the left
ear 500ms recording in the lecture room (Experiment 2). . . . . . . . . . . . . . . . . .
4.1
4.2
4.3
4.4
4.5
4.6
4.7
4.8
4.9
4.10
4.11
The frequency response used to design the gm2002 filter of the PCP module in the AIM.
The NAP of a 200 Hz signal in the 1209 Hz frequency channel. . . . . . . . . . . . . .
The Dual profile of a 5ms signal recorded in the anechoic room (Experiment 1). . . . .
The Dual profile of a 5ms signal recorded in the conference room (Experiment 1). . . .
The Dual profile of a 5ms signal recorded in the lecture room (Experiment 2). . . . . .
The Dual profile of a 50ms signal recorded in the anechoic room (Experiment 1). . . .
The Dual profile of a 50ms signal recorded in the conference room (Experiment 1). . .
The Dual profile of a 500ms signal recorded in the anechoic room (Experiment 1). . .
The Dual profile of a 500ms signal recorded in the conference room (Experiment 1). .
The Dual profile of a 500ms signal recorded in the lecture room (Experiment 2). . . .
An example to illustrate the pitch strength measure computed using the pitch strength
module of the AIM. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
v
4
5
6
7
8
9
10
13
17
17
18
18
19
19
20
21
21
22
24
31
32
33
34
35
36
37
38
39
5.1
5.2
5.3
The parametric (Weibull fit) and non parametric (Local linear fit) modeling of the mean
proportion of correct responses of the blind participants as a function of distance. (a)
For the 5ms recordings in anechoic chamber. (b) For the 5ms recording in conference
room. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
The parametric (Weibull fit) and non parametric (Local linear fit) modeling of the mean
proportion of correct responses of the blind participants as a function of distance. (a)
For the 50ms recordings in anechoic chamber. (b) For the 50ms recording in conference
room. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
The parametric (Weibull fit) and non parametric (Local linear fit) modeling of the
mean proportion of correct responses of the blind participants as a function of distance.
(a) For the 500ms recordings in anechoic chamber. (b) For the 500ms recording in
conference room. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
A.1 The spectral centroid as a function of time for the 10 versions (marked in different colors
for each subplot) of the left ear 5ms recording in the anechoic chamber (Experiment 1).
A.2 The spectral centroid as a function of time for the 10 versions (marked in different colors
for each subplot) of the right ear 5ms recording in the anechoic chamber (Experiment 1).
A.3 The spectral centroid as a function of time for the 10 versions (marked in different colors
for each subplot) of the left ear 5ms recording in the conference room (Experiment 1).
A.4 The spectral centroid as a function of time for the 10 versions (marked in different colors
for each subplot) of the right ear 5ms recording in the conference room (Experiment 1).
A.5 The spectral centroid as a function of time for the 10 versions (marked in different
colors for each subplot) of the left ear 5ms recording in the lecture room (Experiment 2).
A.6 The spectral centroid as a function of time for the 10 versions (marked in different
colors for each subplot) of the right ear 5ms recording in the lecture room (Experiment
2). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
A.7 The spectral centroid as a function of time for the 10 versions (marked in different colors
for each subplot) of the left ear 50ms recording in the anechoic chamber (Experiment 1).
A.8 The spectral centroid as a function of time for the 10 versions (marked in different colors
for each subplot) of the right ear 50ms recording in the anechoic chamber (Experiment
1). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
A.9 The spectral centroid as a function of time for the 10 versions (marked in different colors
for each subplot) of the left ear 50ms recording in the conference room (Experiment 1).
A.10 The spectral centroid as a function of time for the 10 versions (marked in different colors
for each subplot) of the right ear 50ms recording in the conference room (Experiment 1).
A.11 The spectral centroid as a function of time for the 10 versions (marked in different colors
for each subplot) of the left ear 500ms recording in the anechoic chamber (Experiment
1). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
A.12 The spectral centroid as a function of time for the 10 versions (marked in different colors
for each subplot) of the right ear 500ms recording in the anechoic chamber (Experiment
1). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
A.13 The spectral centroid as a function of time for the 10 versions (marked in different colors
for each subplot) of the left ear 500ms recording in the conference room (Experiment 1).
A.14 The spectral centroid as a function of time for the 10 versions (marked in different colors
for each subplot) of the right ear 500ms recording in the conference room (Experiment
1). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
A.15 The spectral centroid as a function of time for the 10 versions (marked in different colors
for each subplot) of the left ear 500ms recording in the lecture room (Experiment 2). .
A.16 The spectral centroid as a function of time for the 10 versions (marked in different colors
for each subplot) of the right ear 500ms recording in the lecture room (Experiment 2).
B.1 The temporal profiles of stabilised auditory image for a 500ms signal recorded in the
conference room (Experiment 1). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
vi
44
44
45
65
65
66
66
67
67
68
68
69
69
70
70
71
71
72
72
79
List of Tables
3.1
3.2
4.1
4.2
4.3
4.4
4.5
4.6
4.7
4.8
4.9
5.1
5.2
5.3
Mean of the sound pressure levels (dBA) for the left and right ears over the 10 versions
of the 500ms duration signals in the anechoic and conference room of Experiment 1. .
Mean of the sound pressure level (dBA) for the left and right ears over the 10 versions
of the 500ms duration signals in the lecture room of Experiment 2. . . . . . . . . . . .
Mean of the maximum of the Short Term Loudness in sones of 10 versions for the
recordings in anechoic conference and the lecture room with 5ms duration signal. . . .
Mean of the maximum of the Short Term Loudness in sones of 10 versions for the
recordings in anechoic conference and the lecture room with 50ms duration signal. . .
Mean of the maximum of the Short Term Loudness in sones of 10 versions for the
recordings in anechoic, lecture and conference room with 500ms duration signal. . . .
Mean of the pitch strength (autocorrelation index) of 10 versions for the recordings in
anechoic conference and the lecture room with 5ms duration signal. . . . . . . . . . . .
Mean of the pitch strength (autocorrelation index) of 10 versions for the recordings in
anechoic conference and the lecture room with 50ms duration signal. . . . . . . . . . .
Mean of the pitch strength (autocorrelation index) of 10 versions for the recordings in
anechoic, lecture and conference room with 500ms duration signal. . . . . . . . . . . .
Mean of the 10 versions of mean of median of the sharpness (acums) for the recordings
in anechoic, conference and the lecture room with 5ms duration signal. . . . . . . . . .
Mean of the 10 versions of mean of median of the sharpness (acums) for the recordings
in anechoic, conference and the lecture room with 50ms duration signal. . . . . . . . .
Mean of the 10 versions of mean of median of the sharpness (acums) for the recordings
in anechoic, conference and the lecture room with 500ms duration signal. . . . . . . .
15
15
28
28
29
40
40
40
41
42
42
Detection thresholds of object distance (cm) for duration, room, and listener groups. .
Threshold values of loudness (sones) for duration, room, and listener groups. . . . . .
Threshold values of the pitch strength (autocorrelation index) for duration, room, and
listener groups. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Threshold values of the mean of the mean of median sharpness (acums) for duration,
room, and listener groups. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
46
46
A.1 Calibrated levels with and without A weighting. . . . . . . . . . . . . . . . . . . . . .
A.2 SPL values (dBA) for 10 versions of the left ear recordings in the anechoic chamber
with 5ms duration signal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
A.3 SPL values (dBA) for 10 versions of the right ear recordings in the anechoic chamber
with 5ms duration signal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
A.4 SPL values (dBA) for 10 versions of the left ear recordings in the lecture chamber with
5ms duration signal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
A.5 SPL values (dBA) for 10 versions of the right ear recordings in the lecture chamber
with 5ms duration signal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
A.6 SPL values (dBA) for 10 versions of the left ear recordings in the conference chamber
with 5ms duration signal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
A.7 SPL values (dBA) for 10 versions of the right ear recordings in the conference chamber
with 5ms duration signal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
60
5.4
vii
47
48
61
61
61
61
62
62
A.8 SPL values (dBA) for 10 versions of the left ear recordings in the anechoic chamber
with 50ms duration signal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
A.9 SPL values (dBA) for 10 versions of the right ear recordings in the anechoic chamber
with 50ms duration signal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
A.10 SPL values (dBA) for 10 versions of the left ear recordings in the conference chamber
with 50ms duration signal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
A.11 SPL values (dBA) for 10 versions of the right ear recordings in the conference chamber
with 50ms duration signal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
A.12 SPL values (dBA) for 10 versions of the left ear recordings in the anechoic chamber
with 500ms duration signal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
A.13 SPL values (dBA) for 10 versions of the right ear recordings in the anechoic chamber
with 500ms duration signal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
A.14 SPL values (dBA) for 10 versions of the left ear recordings in the conference chamber
with 500ms duration signal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
A.15 RSPL values (dBA) for 10 versions of the right ear recordings in the conference chamber
with 500ms duration signal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
A.16 SPL values (dBA) for 10 versions of the left ear recordings in the lecture chamber with
500ms duration signal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
A.17 SPL values (dBA) for 10 versions of the right ear recordings in the lecture chamber
with 500ms duration signal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
B.1 Maximum of the Short Term Loudness in sones of 10 versions for the recordings in the
anechoic chamber with 5ms duration signal. . . . . . . . . . . . . . . . . . . . . . . . .
B.2 Maximum of the Short Term Loudness in sones of 10 versions for the recordings in the
anechoic chamber with 50ms duration signal. . . . . . . . . . . . . . . . . . . . . . . .
B.3 Maximum of the Short Term Loudness in sones of 10 versions for the recordings in the
anechoic chamber with 500ms duration signal. . . . . . . . . . . . . . . . . . . . . . . .
B.4 Maximum of the Short Term Loudness in sones of 10 versions for the recordings in the
conference room with 5ms duration signal. . . . . . . . . . . . . . . . . . . . . . . . . .
B.5 Maximum of the Short Term Loudness in sones of 10 versions for the recordings in the
conference room with 50ms duration signal. . . . . . . . . . . . . . . . . . . . . . . . .
B.6 Maximum of the Short Term Loudness in sones of 10 versions for the recordings in the
conference room with 500ms duration signal. . . . . . . . . . . . . . . . . . . . . . . .
B.7 Maximum of the Short Term Loudness in sones of 10 versions for the recordings in the
lecture room with 5ms duration signal. . . . . . . . . . . . . . . . . . . . . . . . . . . .
B.8 Maximum of the Short Term Loudness in sones of 10 versions for the recordings in the
lecture room with 5ms duration, 32 clicks signal. . . . . . . . . . . . . . . . . . . . . .
B.9 Maximum of the Short Term Loudness in sones of 10 versions for the recordings in the
lecture room with 5ms duration 64 clicks signal. . . . . . . . . . . . . . . . . . . . . . .
B.10 Maximum of the Short Term Loudness in sones of 10 versions for the recordings in the
lecture room with 500ms duration signal. . . . . . . . . . . . . . . . . . . . . . . . . .
B.11 Median of the sharpness in acums of 10 versions for the recordings in the anechoic room
(Experiment 1) with 5ms duration signal. . . . . . . . . . . . . . . . . . . . . . . . . .
B.12 Median of the sharpness in acums of 10 versions for the recordings in the anechoic room
(Experiment 1) with 50ms duration signal. . . . . . . . . . . . . . . . . . . . . . . . . .
B.13 Median of the sharpness in acums of 10 versions for the recordings in the anechoic room
(Experiment 1) with 500ms duration signal. . . . . . . . . . . . . . . . . . . . . . . . .
B.14 Median of the sharpness in acums of 10 versions for the recordings in the conference
room (Experiment 1) with 5ms duration signal. . . . . . . . . . . . . . . . . . . . . . .
B.15 Median of the sharpness in acums of 10 versions for the recordings in the conference
room (Experiment 1) with 50ms duration signal. . . . . . . . . . . . . . . . . . . . . .
B.16 Median of the sharpness in acums of 10 versions for the recordings in the conference
room (Experiment 1) with 500ms duration signal. . . . . . . . . . . . . . . . . . . . . .
viii
62
62
63
63
63
63
64
64
64
64
73
73
73
74
74
74
74
74
75
75
76
76
76
77
77
77
B.17 Median of the sharpness in acums of 10 versions for the recordings in the lecture room
(Experiment 2) with 5ms duration signal. . . . . . . . . . . . . . . . . . . . . . . . . .
B.18 Median of the sharpness in acums of 10 versions for the recordings in the lecture
room(Experiment 2) with 5ms duration, 32 clicks signal. . . . . . . . . . . . . . . . . .
B.19 Median of the sharpness in acums of 10 versions for the recordings in the lecture room
(Experiment 2) with 5ms duration, 64 clicks signal. . . . . . . . . . . . . . . . . . . . .
B.20 Median of the sharpness in acums of 10 versions for the recordings in the lecture room
(Experiment 2) with 500ms duration signal. . . . . . . . . . . . . . . . . . . . . . . . .
ix
78
78
78
78
Abbreviations
ACF
AIM-MAT
AIM
BMM
CC
ELC
ERB
FIR
GLM
H-C-L
H-L
HP-AF
ILD
IRN
ITD
MAF
MAP
NAP
PCP
PS
RMS
RP
SAI
SF
SPL
STI
TI
autocorr
cGC
dcGC
fMRI
gm2002
pGC
sf2003
ti2003
Autocorrelation Function.
Auditory Image Model in Matlab.
Auditory Image Model.
Basilar Membrane Motion.
Calibrated Constant.
Equal Loudness Contour.
Equivalent rectangular bandwidth.
Finite Impulse Response.
Generalized linear models.
Halfwave rectification, Compression, Lowpass
filtering.
Halfwave rectification, Lowpass filtering.
High Pass Asymmetric Function.
Interaural Level Difference.
Iterated Rippled Noise.
Interaural Time Difference.
Minimum Audible Field.
Minimum Audible Pressure.
Neural Activity Pattern.
Pre Cochlear Processing.
Pitch Strength.
Root Mean Square.
Repetition Pitch.
Stabilized Auditory Image.
Strobe Finding.
Sound Pressure Level.
Strobe Temporal Integration.
Temporal Integration.
Autocorrelation Module in Auditory Image
Model.
Compressive Gamma Chirp.
Dynamic Compressive Gamma Chirp.
Functional Magnetic Resonance Imaging.
Glasberg and Moore 2002.
Passive Gamma Chirp.
Strobe Finding 2003.
Temporal Integration 2003.
x
Chapter 1
Introduction
Human echolocation formerly known as “facial vision” or “obstacle sense” is the ability
of the blind to detect objects in their environment, audition being the sensory basis for
this ability (Dallenbach and Supa, 1944; Dallenbach and Cotzin, 1950). A blind person
may use his or her self-generated sounds, e.g. by the voice, but it is also usual to use
sounds generated by mechanical means such as the shoes, a cane, or some device like a
clicker to detect an object (Schenkman and Nilsson, 2010). There are different factors
that influence this ability of the blind and researchers over the years have performed
various experiments to understand this ability.
The discriminating power of this ability was initially studied and it was found that, both
the blind and the sighted listeners could detect and discriminate object’s (Kellogg, 1962;
Köhler, 1964; Rice, Feinstein, and Schusterman, 1965; as cited in Arias and Ramos,
1997). Later, the effect of various factors influencing the echolocation ability of the
blind was studied by e.g Schenkman (1985), concluding that self made vocalizations and
clicks were the most effective echolocation signals and an auditory analysis similar to
the autocorrelation function (Bilsen and Ritsma, 1969; Yost, 1996) could represent its
underlying psychophysical mechanism.
The influence of the precedence effect on human echolocation was investigated by Seki,
Ifukube, and Tanaka (1994) who performed the localization task in the vertical plane and
found that the blind were more resistant to the precedence effect and the performance
accuracy decreasing with the decreasing distance of the (reflected) sound source. Studies
were also made to find the influence of the exploratory movements in echolocation. It
was found that, for some distances, participants were somewhat more accurate when
moving than being stationary (Miura et al, 2008). Later studies by Rowan et al (2013);
Wallmeier, Geßele, and Wiegrebe (2013) also showed that binaural information is useful
in locating the objects, when using echolocation.
Experiments were done to find the environmental conditions and the type of signals
that would favour echolocation. Schenkman and Nilsson (2010) analysed the effect of
reverberation on the performance of the blind by using the signals recorded in an anechoic
and a conference room. They found that the blind performed better for longer distances
in the latter case. However, Kolarik et al (2014) say that the reverberation time in the
study of Schenkman and Nilsson (2010) was rather low (T60 = 0.4s), and it is possible that
longer reverberation times would lead to impaired rather than improved performance.
The effects of reverberation time on echolocation performance have yet to be quantified.
1
CHAPTER 1. INTRODUCTION
2
Regarding the type of signals favourable for echolocation, Rojas et al (2009, 2010) suggested that short sounds generated at the palate are the most effective for echolocation.
On the other hand Schenkman and Nilsson (2010) reported that longer duration signals are beneficial for echolocation. Therefore, to find the type of signals favorable for
echolocation Schenkman, Nilsson, and Grbic (2011) studied the influence of click trains
and longer duration noise signals on the echolocation performance. They found that the
detection of the object at 100cm was best with both 32 clicks/500ms and 500ms noise;
and at 150 cm with 32 clicks/500ms rather than the 500ms noise signal, contradicting the results of their previous experiment which favored the longer duration signals.
Schenkman, Nilsson, and Grbic (2011) assumed that the decrease in performance was
due to the difference in the experimental setup.
In order to clarify the cause for the decrease in the performance, a physical analysis was
made on the stimuli used in the experiments of Schenkman, Nilsson, and Grbic (2011)
and is presented in the room acoustic chapter of this thesis. Although the analysis was
made to explain the decrease in the performance, it is to be noted that the experiments
performed by Schenkman and Nilsson (2010); Schenkman, Nilsson, and Grbic (2011) excluded exploratory movements, which probably are considered to be advantageous for the
blind (Miura et al, 2008). Hence, more experimental testing is required by considering
all these factors to conclude which types of signals are favourable for echolocation.
Another aspect that has been the focus of the recent research in human echolocation
is the variability of echolocation ability among the blind and sighted. Several studies
have reported that blind participants have echolocation abilities superior to those of
sighted participants (Dufour et al, 2005; Schenkman and Nilsson, 2010; Schenkman and
Nilsson, 2011; Kolarik et al, 2013), with variability among the individuals (Schenkman
and Nilsson, 2010; Teng and Whitney, 2011; Teng et al, 2012). However, the results
from the psychoacoustic experiments could not explain whether the high echolocation
ability of the blind is due to their extensive practice or brain plasticity or both. In some
cases even the characteristics of the acoustic stimulus that determine the detection of
the blind is not known.
To discover whether the physiological differences are the cause for the high detection
of the blind several researchers have analyzed the brain activity of the participants.
Thaler, Arnott and Goodale (2011) conducted a study using functional magnetic resonance imaging (fMRI) in one early and one late blind participant and demonstrated that
echolocation activates occipital and not auditory cortical areas, with stronger activation
in the early blind participant. A more recent study by the same authors (Thaler et al,
2014), suggest that the echo-motion response in blind experts may represent reorganization rather than exaggeration of responses observed in sighted novices, and that there is
the possibility that this reorganization involves the recruitment of visual cortical areas.
However, the extent to which such recruitment contributes to the echolocation abilities
of the blind remains unclear and a combined study using the neuroimaging techniques
and psychoacoustic methods may give a clearer insight into the role of physiology in the
high echolocation ability of the blind.
Although it is expected that the combination of neuroimaging and psychoacoustic methods can give us some insight into the high echolocating ability of the blind, these do
not reveal the information in the acoustic stimulus that determines it (at least when
the information is not known) and how this information is represented in the human
auditory system. A reasonable solution to find the information necessary for the high
echolocation ability of the blind is by performing a signal analysis on the acoustic stimulus. However, such an analysis does not show us how the information is represented in
CHAPTER 1. INTRODUCTION
3
the human auditory system. To solve this problem, one may use auditory models in the
literature which try to mimic the human hearing. Analyzing the acoustic stimulus using
these models may give us insight into the causes for the high echolocation ability of the
blind.
It is vital to use signal analysis and the auditory models in order to understand the
differences between the listeners in human echolocation, since one needs to consider the
transmission of the acoustic sound from the source to the internal representation of the
listener. Initially, when the acoustic sound travels and undergoes transformation due
to the room acoustics, one should first understand which information is being received
at the human ear. This is where signal analysis comes into play, as we can analyze
the characteristics of the acoustic sound which are transformed due to various room
conditions. The second step is to analyze how the desired characteristic of the acoustic
sound that contains the information is represented in the auditory system. This is where
the auditory models come into play. The desired information is transformed in a similar
way to how the auditory system might processes it. Therefore by keeping track of the
information from the outer ear to the central nervous system one may understand the
cause for the differences between the participants and this is the research strategy of
this thesis.
To model the auditory analysis performed by the human auditory system the auditory
image model of Patterson, Allerhand, and Giguere (1995), loudness models of Glasberg
and Moore (2002, 2007) and the sharpness model of Fastl and Zwicker (2007) were
considered in this thesis. Matlab was chosen as the implementation environment. The
auditory image model was implemented in matlab by Bleeck, Ives, and Patterson (2004b)
and the current version is known as AIM-MAT. The loudness and the sharpness models were implemented in PsySound3 (Cabrera, Ferguson, and Schubert, 2007), a GUIdriven Matlab environment for analysis of audio recordings. AIM-MAT and PsySound3
were downloaded from https://code.soundsoftware.ac.uk/projects/aimmat and http://
www.psysound.org, respectively and used in the thesis.
AIMS OF THE THESIS:
(1) To find out the information in the acoustic stimulus, that determines the high echolocation ability of the blind.
(2) To find out how this acoustic information which determines high echolocation ability
of the blind might be represented in the human auditory system.
For this we use the recordings of Schenkman and Nilsson (2010) and Schenkman,
Nilsson, and Grbic (2011), denoted Experiment 1 and 2, respectively.
OUTLINE OF THE THESIS:
The thesis is formulated as follows: As the auditory models are developed based on
research in physiology and perception, initially a detailed review of relevant parts of
these subjects is presented in Chapter 2. In Chapter 3, the signal analysis done on the
recordings of Schenkman and Nilsson (2010) and Schenkman, Nilsson, and Grbic (2011)
to find out the information used to detect the objects is presented. Chapter 4 describes
how the auditory models were designed and implemented. The analysis of the recordings
of Schenkman and Nilsson (2010), and Schenkman, Nilsson, and Grbic (2011) using the
auditory models is also presented in this chapter. The results from the auditory models
are compared with the perceptual results in Chapter 5. A discussion of the results is
presented in Chapter 6 followed by the conclusion in Chapter 7.
Chapter 2
Physiology and Perception
A signal processing model of human auditory system is designed on the basis of research
in physiology and psychology of hearing. Therefore, it is vital to give a background to
the physiological and psychological aspects of hearing for understanding how the models
may explain human echolocation.
2.1
Physiology of hearing
The auditory system consists of the auditory periphery and the central nervous system
which encodes and processes the acoustic sound respectively. A brief description of how
this is done is presented below.
2.1.1
Auditory periphery
The peripheral part of the auditory system consists of the ear, which transduces the sound
waves from the environment into neural responses and strengthen the perception of the
sound. Figure 2.1 shows the structure of the human ear, which is further subdivided
Pinna
Stapes
(attached to
oval window)
Semicircular
Canals
Vestibular
Nerve
Incus
Malleus
Cochlear
Nerve
Concha
External
Auditory Canal
Cochlea
Tympanic
Cavity
Eustachian Tube
Tympanic
Membrane
Round
Window
Figure 2.1: Anatomy of the human ear, Figure adapted from, Chittka L, Brockmann A [CC-BY-2.5
(http:// creativecommons.org/licenses/by/2.5)], via Wikimedia Commons.
4
CHAPTER 2. PHYSIOLOGY AND PERCEPTION
5
into outer, middle and inner ear. Initially when the sound reaches the human ear, the
head, torso and pinna attenuate the sound in a frequency dependent manner in which
the sound pressure is decreased at high frequencies. After the attenuation due to the
head, torso and pinna, the sound travels through the auditory canal via the concha (the
cavity which helps to funnel sound into the canal). Since the resonance frequency of the
concha is closer to 5 kHz and the resonant frequency of external auditory canal is about
2.5 kHz, the concha and external auditory canal cause an increase in sound pressure level
(SPL) of about 10 to 15 dB in the frequency range 1.5 kHz to 7 kHz. The tympanic
membrane vibrates as a result of sound waves travelling in the external auditory canal,
and the vibrations are passed along the oscillating chain (Yost, 2007).
The middle ear consists of the ossicular chain (malleus, incus and stapes), which provide
effective means to deliver sound to the inner ear where the neural process of hearing
begins. Due to the difference in surface area between tympanic membrane and the
stapes foot plate, and also due to the lever action of the ossicles there is an increase in
the pressure level between the ear drum and the inner ear by 30 dB or more. The actual
pressure transformation depends on the frequency of the stimulus (Yost 2007, pp 75-79).
Thus the middle ear works a little bit like a thumbtack, collecting pressure over a large
area on the blunt, thumb end, and concentrating it on the sharp end (Schnupp, Nelken,
and King, 2011).
The vibratory patterns representing the acoustic message reach the cochlea via the
stapes. Along the entire length of the cochlea runs a structure known as the basilar
membrane, which is narrow and stiff at the basal end of the cochlea (i.e. near the oval
and round windows), but wide and floppy at the far, apical end. The basilar membrane
subdivides the fluid-filled spaces inside the cochlea into upper compartments (the scala
vestibuli and scala media) and lower compartments (the scala tympani). Thus the cochlea
is equipped with two sources of mechanical resistance, one provided by the stiffness of
the basilar membrane, the other by the inertia of the cochlear fluids.
The stiffness gradient decreases as we move farther away from the oval window, but the
inertial gradient increases. As the inertial resistance is frequency dependent, the path of
overall lowest resistance depends on the frequency. It is long for low frequencies which
are less affected by inertia (i.e Path B in Figure 2.2) and increasingly short for high
frequencies (i.e Path A in Figure 2.2). Hence every time the stapes pushes against the
oval window, low frequencies cause vibrations at the apex of the basilar membrane, and
the high frequencies cause vibration at the base. This property makes the cochlea to
Oval Window
Stapes
Basilar Membrane
A
Round
Window
Bony Wall
B
Helicotrema
Figure 2.2: Cochlea unrolled, in cross section. The grey shading represents the inertial gradient of the
fluids and the stiffness gradient of the basilar membrane. Note the gradients run in opposite direction.
Figure, redrawn with permission from Schnupp, Nelken, and King (2011)
CHAPTER 2. PHYSIOLOGY AND PERCEPTION
6
operate as a mechanical frequency analyser. However it is to be noted that the cochlea
does not have a sharp frequency resolution and it is perhaps more useful to think of the
cochlea as a set of mechanical filters (Schnupp, Nelken, and King 2011, pp 55-64).
Another important phenomena that the basilar membrane exhibits is the travelling wave
phenomena. However, Schnupp, Nelken, and King (2011), say that describing the travelling wave as a manifestation of the sound energy can be misleading and suggest that
it is probably more accurate to imagine the mechanical vibrations as travelling along
the membrane only in the sense that they travel mostly through the fluid next to the
membrane and then pass through the membrane as they come near the point of lowest
resistance. The travelling wave may then be mostly a curious side effect of the fact that
the mechanical filters created by each small piece of basilar membrane, together with
the associated cochlear fluid columns, all happen to be slightly out of phase with each
other.
The mechanical vibrations from the basilar membrane are transduced into electrical
potentials by the shearing against the tectorial membrane of the stereocilia in the organ
of corti (cf Figure 2.3). This happens as follows: A structure in the organ of corti, named
scala vascularis, leaks the K + ions from the bloods stream into the scala media. The scala
vascularis also sets up an electrical voltage gradient across the basilar membrane. As the
stereocilia in each bundle are not all of the same length, and as the tips are connected
with each other, by fine protein fiber strands known as “tip links”, the ion channels, open
in response to stretch (increase in the tension) on the tip links, allowing the K + ions to
flow through the hair cells. The hair cells then form glutamatergic, excitatory synoptic
contacts with the spiral ganglion neurons along their lower end. These neurons form
the long axons that travel through the auditory nerve and reach the cochlear nucleus
(Schnupp, Nelken, and King, 2011).
Cross sectoin of the cochlea
Organ of Corti
Tectorial membrane
Stria
vascularis
Scala Vestibuli
Scala
media
Basilar membrane
Outer
hair cells
Inner
hair cells
Basilar membrane
Scala tympani
Figure 2.3: Cross section of the cochlea, and the schematic view of the organ of corti. Figure, redrawn
with permission from Schnupp, Nelken, and King (2011)
As can be seen in Figure 2.3, there are two types of hair cells, outer and inner hair cells.
The inner and outer hair cells are connected to type I and type II fibers respectively.
Anatomically, type II fibers are unsuited to provide fast through output of the encoded
information (Schnupp, Nelken, and King, 2011). Hence, only the inner hair cells are
known to be the biological transducers. Although the outer hair cells do not provide
CHAPTER 2. PHYSIOLOGY AND PERCEPTION
7
any neural transduction, they are known to exhibit motility which cause the non linear
cochlear amplification. A detailed description of how this non linear cochlear amplification can be modeled using the signal processing techniques is presented in Chapter
4.
2.1.2
Central auditory nervous system
As discussed in the above section, the auditory periphery transduces the acoustic sound.
However hearing involves more than neural coding of the sound, i.e. processing of the
encoded sound. This processing is done by the central auditory nervous system. The
central auditory nervous system consists of the cochlear nucleus, the superior olivary
complex, the inferior coliculus, the medial geniculate body and the auditory cortex, and
also other structures, Figure 2.4, illustrates this.
Left hemisphere
Medial geniculate
Right hemisphere
Inferior colliculus
Lateral lemniscus
Superior olive
Dorsal cochlear
nucleus
Left ear
Right ear
Ventral cochlear nucleus
Figure 2.4: An illustration of the most important pathways and nuclei from the ear to the auditory
cortex. The nuclei illustrated are located in the brain stem. Figure redrawn with permission from
Moore (2013).
There is evidence that many cells in the dorsal cochlear nucleus react in a manner that
suggest a lateral inhibition network, which helps in sharpening the neural representation
CHAPTER 2. PHYSIOLOGY AND PERCEPTION
8
of the spectral information (Yost 2007, pp 240). As the information from the left and
right ears converge at the olivary nuclei they are assumed to process the spatial perception of sound (Schnupp, Nelken, and King, 2011). The spectral and spatial information
from the cochlear nucleus and the superior olivary complex are further processed and
combined by the inferior coliculus. Finally, the region in auditory cortex processes the
complex sound.
2.2
Perception
The physiological background remains one main inspiration for the auditory models but
they are also based on how the physical and perceptual attributes of the acoustic sound
are encoded in the auditory system. Loudness, pitch and timbre are three subjective
attributes of the acoustic sound that are relevant for human echolocation. Therefore this
section discusses how these attributes are encoded in the auditory system.
2.2.1
Loudness
Loudness is the perceptual attribute of intensity and is defined as that attribute of
auditory sensation in terms of which sounds can be ordered on a scale extending from
quiet to loud (ASA, 1973).
Regarding the underlying mechanisms of how loudness is perceived, there is no full
understanding. The dynamic range of auditory system is wide and different mechanisms
play a role in intensity discrimination. Psychophysical experiments suggest that neuron
firing rates, spread of excitation and phase locking play a role in intensity perception,
but the latter two may not always be essential. A disadvantage with the neuron firing
rates is that, although the single neurons in the auditory nerve can be used to explain
the intensity discrimination, this does not explain why the intensity discrimination is
not better than observed, suggesting that the discrimination is limited by the capacity
of the higher levels in the auditory system, which may also play a role in intensity
discrimination (Moore, 2013).
Stimulus
Fixed filter
for transfer of
outer/ middle ear
Transform
spectrum to
excitation pattern
Transform
excitation pattern to
specific loudness
Calculate the area
under the specific
loudness pattern
Figure 2.5: Basic structure of the models used for the calculation of loudness. Figure, redrawn from
Moore (2013).
Several models (cf, Moore, 2013, pp 139 - 140) have been proposed to calculate the average loudness that would be perceived by a large group of listeners. Figure 2.5, shows the
basic structure of a model used to calculate loudness. The model performs the outer and
middle ear transformations and then calculates the excitation pattern. The excitation
pattern is transformed into specific loudness, which involves a compressive non-linearity.
The total area under the specific loudness pattern is assumed to be proportional to the
overall loudness. Therefore, whatever may be the mechanism underlying the perception
of loudness, the excitation pattern seems to be the essential information that should be
used to design an auditory model of loudness.
CHAPTER 2. PHYSIOLOGY AND PERCEPTION
2.2.2
9
Pitch
Pitch is defined as “that attribute of auditory sensation in terms of which sounds may
be ordered on a musical scale” (ASA, 1960).
Regarding the underlying mechanisms of how pitch is encoded is still a matter of debate.
One view is that, as the cochlea is assumed to perform the spectrum analysis, the acoustic
vibrations are transformed into a spectrum, coded as a profile of discharge rate across
the auditory nerve. An alternative view proposes that the role of the cochlea is to
transduce the acoustic vibrations into temporal patterns of neural firing. These two
views are known as place and time hypotheses. Figure 2.6 shows a simulation of the
basilar membrane motion of a 200 Hz sinusoid, generated using dynamic gammachirp
filterbank module available in AIM-MAT. It can be seen that both the frequency and
the temporal patterns are preserved.
According to the place hypothesis, pitch is determined from the position of maximum
excitation along the basilar membrane, within the cochlea. This explains how the pitch
is perceived by the pure tones at low levels, but it fails to explain pure tones at higher
levels i.e at higher levels, due to non linearity of the basilar membrane (as described
in the physiology section) the peaks become broader and tends to shift towards a lower
frequency place. This should lead to a decrease in pitch; however the psychophysical
experiments show that the pitch is stable. Another case where the place hypothesis fails
is, its inability to explain the pitch of the stimuli whose fundamental is absent. According
to the paradox of the missing fundamental, the pitch evoked by a pure tone remains the
8
Frequency (kHz)
4.7
2.8
1.7
0.9
0.5
0.2
0
0
5
10
15
20
25
30
35
40
time (ms)
Figure 2.6: A simulation of the basilar membrane motion for a 200 Hz sinusoid. The figure was
generated by using dynamic gamma chirp filter bank module available in AIM-MAT. It can be seen
that both the place and the temporal information is preserved.
CHAPTER 2. PHYSIOLOGY AND PERCEPTION
10
6
3.9
Frequency (kHz)
2.5
1.5
0.9
0.5
0.3
0.1
0
5
10
15
20
25
30
35
40
time (ms)
Figure 2.7: A simulation of the basilar membrane motion for a 500ms iterated ripple noise with gain=1,
delay=10ms and no of iterations = 2. The figure was generated using dynamic gamma chirp filter
bank module available in AIM-MAT. It can be seen that there are no periodic repetitions to support
the time hypothesis.
same if we add additional tones with frequencies that are integer multiples of that of the
original pure tone (harmonics). It also does not change if we then remove the original
pure tone (the fundamental) (De Cheveigné, 2010).
On the other hand, as the time hypothesis states that pitch is derived from the periodic
pattern of the acoustic waveform, it overcomes the problem of the missing fundamental.
However the main difficulty with the time hypothesis is that it is not easy to extract one
pulse per period, in a way that is reliable and fully general. Psychoacoustic studies also
show that pitch exist for stimuli which is not periodic. An example of such a stimuli
is iterated ripple noise (IRN), a stimuli that models some of the human echolocation
signals (cf Figure 2.7).
In order to overcome the limitations of the place and time hypothesis two new theories
were proposed, pattern matching (De Boer 1956, cited in De Cheveigné 2010) , and a
theory based on autocorrelation (Licklider 1951, cited in De Cheveigné 2010). De Boer
(1956) described the concept of pattern matching in his thesis. It states that the fundamental partial is the necessary correlate of pitch, but it may be absent if other parts
of the pattern are present. In this way pattern matching supports the place hypothesis.
Later Goldstein (1973), Wightman (1973) and Terhardt (1974) described different models for pattern matching. One problem with the pattern matching theory is that it fails
to account for pitch whose stimuli have no resolved harmonics.
CHAPTER 2. PHYSIOLOGY AND PERCEPTION
11
The autocorrelation hypothesis assumes temporal processing in the auditory system.
It states that, instead of detecting the peaks at regular intervals, the periodic neural
pattern is processed by coincidence detector neurons that calculate the equivalent of
an autocorrelation function (Licklider 1951, cited in De Cheveigné 2010). The spike
trains are delayed within the brain by various time lags (using neural delay lines) and
are combined or correlated with the original. When the lag is equal to the time delay
between spikes the correlation is high and outputs of the coincidence detectors tuned to
that lag are strong. Spike trains in each frequency channel are processed independently
and the results combined into an aggregate pattern. However, De Cheveigné (2010) says
that the autocorrelation hypothesis works too well: It predicts that, pitch should be
equally salient for stimuli with resolved and unresolved partials, whereas this is not the
case (De Cheveigné, 2010). An alternative to the theory based on an autocorrelation like
function is the strobe temporal integration (STI) of Patterson et al (1995). According
to STI, the auditory image underlying the perception of pitch is obtained by using
triggered, quantised, temporal integration, instead of an autocorrelation like function.
The STI works by finding the strobes from the neural activity pattern and integrating
it over a certain period.
To summarize, there is no full understanding of how pitch is perceived. Whether temporal, spectral or multi mechanisms determine the pitch perception, the underlying
information that the auditory system uses to detect the pitch is the excitation pattern.
Hence, the excitation pattern remains the crucial information that should be simulated
to design an auditory model of pitch perception.
2.2.3
Timbre
When the loudness and pitch of an acoustic sound are similar, the subjective attribute
of sound which is used to distinguish the sound is the timbre. Timbre has been defined
as that attribute of auditory sensation which enables a listener to judge that two non
identical sounds, similarly presented and having same loudness and pitch , are dissimilar
(ANSI, 1994). One example is the difference between two musical instruments playing
the same tone e.g guitar and piano.
Timbre is a multidimensional percept and there is no single scale on which we can
order timbre. To quantize timbre one approach is to consider the overall distribution of
the spectral energy. Plomp and his colleagues, showed that the perceptual differences
between different sounds, were closely related to the levels in 18 1/3 octave bands, thus
relating the timbre to the relative level produced by the sound in each critical band.
Hence, generally, for both speech and non speech sounds, the timbre of steady tones are
determined by their magnitude spectra, although the relative phases may play a small
role (Plomp as cited in Moore, 2013). When we consider time varying patterns, there
are several factors that influence the perception of timbre, which include:(i) periodicity;
(ii) variation in the envelope of the waveform; (iii) spectrum changing over time; (iv)
what the preceding and following sounds are like.
The timbre information can be assessed using the auditory models from the levels in
the spectral envelope and variation of the temporal envelope. Another way to preserve
the fine grain time interval information that is necessary for timbre perception is by the
strobe temporal integration of Patterson et al (1995).
Chapter 3
Room acoustics
Before analyzing how an acoustic sound might be represented in the auditory system
using auditory models, it is vital to study the physics and room acoustics of the sound
that determines human echolocation. Hence, this chapter initially reviews the studies
analyzing the acoustic signals.
3.1
Review of studies analyzing acoustic signals
As discussed in chapter 2, the iterated ripple noise stimuli models some of the human
echolocation signals. Initially, a brief review of the studies performed on this stimuli in
the literature is presented. Thereafter, the review of the studies of other acoustic stimuli
used for understanding human echolocation is presented.
Bassett and Eastmond (1964) examined the physical variations in the sound field close
to a reflecting wall. They used a loudspeaker which generated Gaussian noise, placed
at more than 5 m from a large horizontal reflecting panel, in an anechoic chamber. A
microphone was placed at number of points between the loudspeaker and the panel and
an interference pattern was observed. Bassett and Eastmond reported a perceived pitch
caused by the interference of direct and reflected sound at different distances from the
wall; the pitch value being equal to the inverse of the delay. In a similar way, (Small
JR and McClellan as cited in Bilsen 1966), delayed identical pulses and found the pitch
perceived was equal to the inverse of the delay, naming the pitch as time separation
pitch. Later, Bilsen and Ritsma (1969) stated that, when a sound and the repetition of
that sound are listened to, a subjective tone is perceived with a pitch corresponding to
reciprocal value of the delay time and termed the pitch perceived as repetition pitch.
Bilsen, tried to explain repetition pitch phenomenon by using autocorrelation peaks or
the spectral peaks. Yost (1996) performed experiments using iterated ripple noise stimuli
and concluded that autocorrelation is the underlying mechanism, used by the listeners
to detect the repetition pitch phenomenon.
Regarding other acoustic stimuli used for understanding human echolocation, Rojas et al
(2009, 2010), conducted a physical analysis on the acoustic characters of orally produced
pulses and finger produced pulses, showing that the former were better for echolocation.
Papadopoulos et al (2011) examined the acoustic signals used in the study of Dufour
et al (2005) and stated that the information for obstacle discrimination were found in the
frequency dependent inter aural level differences (ILD) especially in the range from 5.5
to 6.5 kHz, rather than on inter aural time differences (ITD). Pelegrin Garcia, Roozen,
and Glorieux (2013) performed a study using the boundary element method and found
that frequencies above 2 kHz provide information for localization of the object, whereas
the lower frequency range would be used for size determination. Similar analysis was
performed by Rowan et al (2013) using a virtual auditory space technique and came to
12
CHAPTER 3. ROOM ACOUSTICS
13
the same conclusion, viz. that performance was primarily based on information above 2
kHz. In view of the above studies several analysis were performed for this thesis and are
presented in the remaining part of this chapter to identify the information necessary for
the detection of the object.
3.2
Sound recordings
The sound recordings of Schenkman and Nilsson (2010), Schenkman, Nilsson, and Grbic
(2011) are used in our study. A brief description of how the recordings were made is given
here. In Schenkman and Nilsson (2010), the binaural sound recordings were conducted
in an ordinary conference room and in an anechoic chamber using an artificial manikin.
The object was a reflecting 1.5 mm thick aluminium disk with a diameter of 0.5 m.
Recordings were conducted at a 0.5, 1, 2, 3, 4, and 5 m distance between microphones
and the reflecting object. In addition, recordings were made with no obstacle in front of
(a)
(b)
(c)
Figure 3.1: Sound recordings made in the Experiment 1, a) anechoic room b) conference room, with
loudspeaker on the chest of the artificial manikin and in the Experiment 2, c) lecture room with
loudspeaker behind the artificial manikin. The pictures are reproduced with permission from Bo
Schenkman.
CHAPTER 3. ROOM ACOUSTICS
14
the artificial manikin. The following durations of the noise signal were used: 500, 50, and
5 ms; the shortest corresponds perceptually to a click. The electrical signal was a white
noise. However, the emitted sound was not perfectly white, because of the non-linear
frequency response of the loudspeaker and the system. A loudspeaker generated the
sounds, resting on the chest of the artificial manikin. The sound recording set ups can
be seen in Figures 3.1(a), 3.1(b).
In Schenkman, Nilsson, and Grbic (2011), recordings were conducted in an ordinary lecture room. Recordings were conducted at 100 and 150 cm distance between microphones
and the reflecting object. The emitted sound were either bursts of 5 ms each, varying in
rates from 1 to 64 bursts per 500 ms or a 500 ms white noise. These sounds were generated by a loudspeaker placed 1 m straight behind the center of the head of the artificial
manikin. The sound recording set up can be seen in Figure 3.1(c). From now on the
recordings of Schenkman and Nilsson (2010), Schenkman, Nilsson, and Grbic (2011) will
be referred to as Experiment 1 and Experiment 2 respectively. A detailed description
of the recordings can be found in Schenkman and Nilsson (2010) and in Schenkman,
Nilsson, and Grbic (2011).
3.3
Signal analysis
To find out the information used for detecting an object and to analyze how the acoustics
of the room affect human echolocation, a number of different analysis were performed
namely: sound pressure level, autocorrelation, and spectral centroid. Before analyzing
the recordings, the recordings were calibrated by the calibrating constants (CC), using
equation 3.1. Based on the SPL of 77, 79 and 79 dBA for the 500ms recording without
the object at the ear of the artificial manikin in the anechoic, conference and lecture
room of Experiment 1 and Experiment 2, the CC’s were calculated to be 2.4663, 2.6283
and 3.5021 respectively. 1
CC = 10
SP L−20∗log10
( rms(signal)
)
20∗10−6
20
(3.1)
As the recordings were binaural both the left and right ear recordings were analyzed.
The recordings in Experiment 1 and Experiment 2, had 10 versions of each duration and
distance. It should be noted that the recordings vary over the versions causing the term
”rms(signal)” in equation 3.1 to vary, thereby varying the calibrated constants with
the versions. However, as the variation is very small in this thesis only the 9th version
of the 500ms first recording without the object (NoObject rec1) in Experiment 1 and
the 9th version of the 500ms recording without the object in Experiment 2 were used
to find the above calibrated constants. Another reason to choose only the 9th version
is that although the other versions may not have the same CC’s they will be relatively
calibrated with respect to the recording of version 9. For example, suppose the recording
in the anechoic chamber version 1 had 67 dB SPL and version 9 had 66 dB SPL before
calibration, then the levels obtained by calibrating the recordings to 77 dB SPL using
the CC of the 9th version would be 78 dB SPL for version 1 and 77 dB SPL version 9.
In other words, they will give the same level difference, even after calibration.
3.3.1
Sound Pressure Level (SPL)
The detection of the objects may to a certain extent based on an intensity difference.
Hence, the SPL in dBA were calculated using equation 3.2, where ”RMS” is the root
1
The A weighting was not included in equation 3.1. However, the difference was found to be less than 0.5 dB and
hence was neglected. See section A.1 of the appendix for more details.
CHAPTER 3. ROOM ACOUSTICS
15
mean square amplitude of the signal analyzed. The results of the 500ms recordings in
Experiment 1 and Experiment 2 are tabulated in table 3.1 and 3.2. A detailed analysis
of the SPL values for all the 10 versions of 5, 50 and 500ms recordings are presented in
Tables A.2 to A.17 in Appendix A.
CC ∗ rms(signal)
(3.2)
SP L = 20 ∗ log10
20 ∗ 10−6
Recording
NoObject rec1
NoObject rec2
Object50cm
Object100cm
Object200cm
Object300cm
Object400cm
Object500cm
Anechoic chamber
Left ear Right ear
77.153
77.866
77.592
77.374
85.182
88.216
81.877
82.550
77.097
78.044
76.975
78.211
77.051
77.986
76.987
78.033
Conference room
Left ear Right ear
79.003
78.817
78.993
78.824
87.539
87.457
82.827
82.377
79.598
79.481
78.926
78.898
79.016
78.860
79.009
78.798
Table 3.1: Mean of the sound pressure levels (dBA) for the left and right ears over the 10 versions of
the 500ms duration signals in the anechoic and conference room of Experiment 1.
Recording
NoObject
Object100cm
Object150cm
Lecture room
Left ear Right ear
79.165
79.577
79.594
81.545
79.412
79.681
Table 3.2: Mean of the sound pressure level (dBA) for the left and right ears over the 10 versions of
the 500ms duration signals in the lecture room of Experiment 2.
The tabulated SPL values in Tables 3.1 and 3.2 show the effect of room acoustics in the
form of level differences (both between the ears and also among the rooms). The level
differences between the recording without object and recording with object at 100 and
150cm were less in Experiment 2 when compared to Experiment 1. This may be due
to the differences in experimental setup (cf Figure 3.1) and the acoustics of the room.
However the extent to which this information is used by the participants is not straight
forward as loudness perceived by the human auditory system cannot be related directly
to the SPL (Moore, 2013). This issue is further discussed in Chapter 4.
3.3.2
Autocorrelation Function (ACF)
Generally, intensity differences play a role in human echolocation. However, Schenkman
and Nilsson (2011) showed that repetition pitch is the more important information used
by the participants rather than the loudness in order to detect the objects. As discussed
in pitch perception section of Chapter 2, pitch perception can often be explained using
the peaks in the autocorrelation function, hence an autocorrelation analysis is performed
in this section.
The repetition pitch for the recordings in Experiment 1 and Experiment 2 can be theoretically calculated using equation 3.3. The corresponding values for recordings with
objects at 50, 100, 150, 200, 300, 400 and 500 cm would be approximately 344, 172, 114,
86, 57, 43 and 34.4 Hz (assumed sound velocity to be 344m/s). As the theory based
CHAPTER 3. ROOM ACOUSTICS
16
on autocorrelation uses temporal information, repetition pitch perceived at the above
frequencies can be explained by the peaks in the autocorrelation function (ACF) at the
inverse of the frequencies i.e, approximately 2.9, 5.8, 8.7, 11.6, 17.4, 23.2 and 29 ms
respectively. Therefore, the autocorrelation analysis was performed using a 32 ms frame
which would cover the required pitch period. A 32ms hop size was to analyze the ACF
for the next time instants 64ms, 96ms etc. In order to compare the peaks among all the
recordings the ACF was not normalized to the limits -1 to 1.
RP =
speed of the sound
2 ∗ distance of the object
(3.3)
In Experiment 1 the participants performed well with the longer duration signals (cf
Schenkman and Nilsson (2010)). They, assumed that the higher detection ability of the
participants for the longer duration signals may be that although the subject may miss
the RP at the first repetition, they may perceive it in the later repetitions. This can
be visualized using the ACF in Figure 3.2 and 3.3 , where for the 5ms recording the
peak was present only for the initial 32ms frame (note that for each duration signals an
additional 450ms silence was padded and presented to the participants, the ACF were
analyzed in the same manner hence the 5ms duration signal had total duration of 455ms
and 500ms signal had total duration of 950ms) whereas for the 500ms recording the peak
was also present for frames with time instants greater than 32 ms.
The assumption of Schenkman and Nilsson (2010) could explain the reason for the high
echolocation ability of the participants for higher duration signals in Experiment 1. However, in Experiment 2 the performance decreased although the repetitions were present
for the frames with time instant greater than 32ms (cf Figure 3.6 and 3.7). Therefore, drawing the conclusion that longer duration signals are always beneficial for human
echolocation cannot be made based on the available results. The peak heights at the pitch
period for the recordings with object at 100cm for the 5ms duration signal in conference
room when compared with the lecture room show that the peak height is greater for
the recording in conference room (cf Figure 3.4 and 3.6). For the 500ms duration signal
with object at 100cm in the lecture room when compared to the 5ms signal recording in
the conference room although had a greater peak height (cf Figure 3.4 and 3.7) the peak
is not distinct enough when compared with the 500ms duration signal in the conference
room (cf Figure 3.5 and 3.7).
The reason for these differences in the peak heights between the conference room and
the lecture room may be due to the room acoustics. As ACF depends on the spectrum
of the signal the acoustics of the room certainly influences the peaks in the ACF. The
reverberation time T60 for conference and lecture room was 0.4 and 0.6 seconds respectively, indicating that the acoustics of the room may influence the ACF and in turn the
echolocation ability. How this information of the peaks is represented in the auditory
system is further discussed in Chapter 4.
3.3.3
Spectral Centroid (SC)
Detection of an object may also be based on the efficient use of the timbre information
available in the stimuli. To test this hypothesis one has to describe that attributes of the
acoustic sound which contribute to the timbre perception. An attribute that describes
the timbre perception is the spectral centroid (Peeters et al, 2011). The spectral centroid
gives a time varying value characterizing the subjective center of the timbre for a sound.
Therefore, the spectral centroid analysis is performed on the recordings and presented
in this section.
CHAPTER 3. ROOM ACOUSTICS
17
10
10
5
0
−5
−10
32ms frame at 64ms time instant
ACF Index
ACF Index
32ms frame at 32ms time instant
0
500
1000
5
0
−5
−10
1500
0
500
Lag
1000
10
10
5
32ms frame at 128ms time instant
ACF Index
ACF Index
32ms frame at 96ms time instant
0
−5
−10
0
500
1000
5
0
−5
−10
1500
0
500
Lag
1000
10
32ms frame at 160ms time instant
5
32ms frame at 192ms time instant
ACF Index
ACF Index
1500
Lag
10
0
−5
−10
1500
Lag
0
500
1000
5
0
−5
−10
1500
0
500
Lag
1000
1500
Lag
Figure 3.2: The autocorrelation function of a 5ms signal recorded in the anechoic chamber (Experiment
1) with reflecting object at 100cm. The sub plots show the autocorrelation function at 32,64,96,128,160
and 192 ms time instants of the signal respectively. As the recording is only 5ms in duration the
autocorrelation function is only present in the first 32ms frame.
50
50
0
−50
32ms frame at 64ms time instant
ACF Index
ACF Index
32ms frame at 32ms time instant
0
500
1000
0
−50
1500
0
500
Lag
1000
50
50
32ms frame at 128ms time instant
ACF Index
ACF Index
32ms frame at 96ms time instant
0
−50
0
500
1000
0
−50
1500
0
500
Lag
1000
50
32ms frame at 192ms time instant
ACF Index
32ms frame at 160ms time instant
ACF Index
1500
Lag
50
0
−50
1500
Lag
0
500
1000
Lag
1500
0
−50
0
500
1000
1500
Lag
Figure 3.3: The autocorrelation function of a 500ms signal recorded in the anechoic chamber (Experiment 1) with reflecting object at 100cm. The sub plots show the autocorrelation function at
32,64,96,128,160 and 192 ms time instants of the signal respectively.
CHAPTER 3. ROOM ACOUSTICS
18
10
10
5
0
−5
−10
32ms frame at 64ms time instant
ACF Index
ACF Index
32ms frame at 32ms time instant
0
500
1000
5
0
−5
−10
1500
0
500
Lag
1000
10
10
5
32ms frame at 128ms time instant
ACF Index
ACF Index
32ms frame at 96ms time instant
0
−5
−10
0
500
1000
5
0
−5
−10
1500
0
500
Lag
1000
10
32ms frame at 160ms time instant
5
32ms frame at 192ms time instant
ACF Index
ACF Index
1500
Lag
10
0
−5
−10
1500
Lag
0
500
1000
5
0
−5
−10
1500
0
500
Lag
1000
1500
Lag
Figure 3.4: The autocorrelation function of a 5ms signal recorded in the conference room (Experiment
1) with reflecting object at 100cm. The sub plots show the autocorrelation function at 32,64,96,128,160
and 192 ms respectively. As the recording is only 5ms in duration the autocorrelation function is only
present in the first 32ms frame.
50
50
0
−50
32ms frame at 64ms time instant
ACF Index
ACF Index
32ms frame at 32ms time instant
0
500
1000
0
−50
1500
0
500
Lag
1000
50
50
32ms frame at 128ms time instant
ACF Index
ACF Index
32ms frame at 96ms time instant
0
−50
0
500
1000
0
−50
1500
0
500
Lag
1000
50
32ms frame at 192ms time instant
ACF Index
32ms frame at 160ms time instant
ACF Index
1500
Lag
50
0
−50
1500
Lag
0
500
1000
Lag
1500
0
−50
0
500
1000
1500
Lag
Figure 3.5: The autocorrelation function of a 500ms signal recorded in the conference room (Experiment 1) with reflecting object at 100cm. The sub plots show the autocorrelation function at
32,64,96,128,160 and 192 ms respectively.
CHAPTER 3. ROOM ACOUSTICS
19
10
10
5
0
−5
−10
32ms frame at 64ms time instant
ACF Index
ACF Index
32ms frame at 32ms time instant
0
200
400
600
800
1000
1200
5
0
−5
−10
1400
0
200
400
600
Lag
800
10
0
−5
0
200
400
600
800
1000
1200
5
0
−5
−10
1400
0
200
400
600
Lag
800
1000
1200
1400
Lag
10
10
32ms frame at 160ms time instant
5
32ms frame at 192ms time instant
ACF Index
ACF Index
1400
32ms frame at 128ms time instant
ACF Index
ACF Index
5
0
−5
−10
1200
10
32ms frame at 96ms time instant
−10
1000
Lag
0
200
400
600
800
1000
1200
5
0
−5
−10
1400
0
200
400
600
Lag
800
1000
1200
1400
Lag
Figure 3.6: The autocorrelation function of a 5ms signal recorded in the lecture room (Experiment 2)
with reflecting object at 100cm. The sub plots show the autocorrelation function at 32,64,96,128,160
and 192 ms respectively. As the recording is only 5ms in duration the autocorrelation function is only
present in the first 32ms frame.
50
50
0
−50
32ms frame at 64ms time instant
ACF Index
ACF Index
32ms frame at 32ms time instant
0
200
400
600
800
1000
1200
0
−50
1400
0
200
400
600
Lag
800
50
0
200
400
600
800
1000
1200
0
−50
1400
0
200
400
600
Lag
800
1000
1200
1400
Lag
50
50
32ms frame at 192ms time instant
ACF Index
32ms frame at 160ms time instant
ACF Index
1400
32ms frame at 128ms time instant
ACF Index
ACF Index
0
0
−50
1200
50
32ms frame at 96ms time instant
−50
1000
Lag
0
200
400
600
800
Lag
1000
1200
1400
0
−50
0
200
400
600
800
1000
1200
1400
Lag
Figure 3.7: The autocorrelation function of a 500ms signal recorded in the lecture room (Experiment
2) with reflecting object at 100cm. The sub plots show the autocorrelation function at 32,64,96,128,160
and 192 ms respectively.
CHAPTER 3. ROOM ACOUSTICS
20
To compute the spectral centroid, the recordings were analyzed using a 32ms frame with
a 2ms overlap. The spectral centroid for each frame was computed using equation 3.4.
As the spectral centroid for each frame is a time varying function, it is plotted as a
function of time. The mean of the spectral centroid for the 10 versions at each condition
for the 500ms left ear recordings is plotted in Figures 3.8 to 3.10. A detailed analysis of
all the recordings can be seen in section A.3 of the Appendix. Figures A.1 to A.14 show
the spectral centroid for the left and right ear recordings in Experiment 1 and A.5 to
A.16 show the spectral centroid for the left and right ear recordings in Experiment 2.
(F requency ∗ F F T (f rame))
(3.4)
SpectralCentroid =
(F F T (f rame)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Time(sec)
Mean of the spectral centroid over 10 recordings for Object at 50cm
10000
5000
0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Time(sec)
Mean of the spectral centroid over 10 recordings for Object at 200cm
10000
5000
0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Time(sec)
Mean of the spectral centroid over 10 recordings for Object at 400cm
10000
5000
0
0
0.1
0.2
0.3
0.4
0.5
Time(sec)
0.6
0.7
0.8
0.9
Frequency (Hz)
0
Frequency (Hz)
5000
Frequency (Hz)
Mean of the spectral centroid over 10 recordings for NoObject rec1
10000
Frequency (Hz)
Frequency (Hz)
Frequency (Hz)
Frequency (Hz)
Frequency (Hz)
In Experiment 1 the spectral centroid for all the recordings without the object were
approximately below 5000 Hz. The recordings with object at 50 and 100cm were approximately above 5000Hz (eg: cf Figure 3.8), which would provide some information to
distinguish them from the recordings without the object. The recordings with object at
200 to 500cm did not vary much when compared with the recording without the object.
In Experiment 2 the spectral centroid was approximately 6000Hz for all recordings (cf
Figure 3.10), showing very small changes which may not be useful for detection. The
analysis thus showed that there was variation in the spectral centroid in the recordings of
Experiment 1 with object at shorter distances (distance less than 200cm) but for longer
distances the difference in the spectral centroid was almost negligible.
Mean of the spectral centroid over 10 recordings for NoObject rec2
10000
5000
0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Time(sec)
Mean of the spectral centroid over 10 recordings for Object at 100cm
10000
5000
0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Time(sec)
Mean of the spectral centroid over 10 recordings for Object at 300cm
10000
5000
0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Time(sec)
Mean of the spectral centroid over 10 recordings for Object at 500cm
10000
5000
0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Time(sec)
Figure 3.8: The mean of the spectral centroid for the 10 versions as a function of time of the left ear
500ms recording in the anechoic chamber (Experiment 1).
On the other hand as the spectral analysis performed by the auditory system is more
complex than FFT, which was used to compute the spectral centroid. It will be shown
later in Chapter 4 that the above conclusion will be modified, when we take in to account
the results of auditory models. It should also be noted that there are other attributes
that describe the timbre perception. Spectral centroid is considered in this thesis because
it is believed to be important feature of timbre.
0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Time(sec)
Mean of the spectral centroid over 10 recordings for Object at 50cm
10000
5000
0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Time(sec)
Mean of the spectral centroid over 10 recordings for Object at 200cm
10000
5000
0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Time(sec)
Mean of the spectral centroid over 10 recordings for Object at 400cm
10000
5000
0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Frequency (Hz)
5000
Frequency (Hz)
Mean of the spectral centroid over 10 recordings for NoObject rec1
10000
Frequency (Hz)
21
Frequency (Hz)
Frequency (Hz)
Frequency (Hz)
Frequency (Hz)
Frequency (Hz)
CHAPTER 3. ROOM ACOUSTICS
Mean of the spectral centroid over 10 recordings for NoObject rec2
10000
5000
0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Time(sec)
Mean of the spectral centroid over 10 recordings for Object at 100cm
10000
5000
0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Time(sec)
Mean of the spectral centroid over 10 recordings for Object at 300cm
10000
5000
0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Time(sec)
Mean of the spectral centroid over 10 recordings for Object at 500cm
10000
5000
0
0
0.1
0.2
0.3
Time(sec)
0.4
0.5
0.6
0.7
0.8
0.9
Time(sec)
Figure 3.9: The mean of the spectral centroid for the 10 versions as a function of time of the left ear
500ms recording in the conference room (Experiment 1).
Mean of the spectral centroid over 10 recordings for Object at 100cm
10000
8000
8000
Frequency (Hz)
Frequency (Hz)
Mean of the spectral centroid over 10 recordings for NoObject
10000
6000
4000
2000
6000
4000
2000
0
0
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
Time(sec)
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
Time(sec)
Mean of the spectral centroid over 10 recordings for Object at 150cm
Frequency (Hz)
10000
8000
6000
4000
2000
0
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
Time(sec)
Figure 3.10: The mean of the spectral centroid for the 10 versions as a function of time of the left ear
500ms recording in the lecture room (Experiment 2).
Chapter 4
Auditory models
4.1
Description of the auditory image model
The auditory image model (AIM) is a time-domain, functional model of the signal processing performed in the auditory pathway as the system converts a sound wave into the
initial perception that we experience when presented with that sound. This representation is referred to as an auditory image by analogy with the visual image of a scene
that we experience in response to optical stimulation (Patterson et al, 1992 ; Patterson
et al, 1995). As discussed in Chapter 2, in order to simulate the internal representation
of an acoustic sound in the human auditory system, one should simulate the mechanisms
of both the peripheral and the central auditory system. The AIM simplifies these concepts into different modules. How the modules were implemented using different signal
processing strategies are described below.
4.1.1
Pre Cochlear Processing (PCP)
The outer middle ear transformation of the acoustic sound is simulated in AIM using
a PCP module. The PCP module consists of four different FIR filters, designed for
different applications.(i) Minimum audible field (MAF), which is suitable for signals
presented in free field. (ii) Minimum audible pressure (MAP), which is suitable for
10
5
Relative transmission (dB)
0
−5
−10
−15
−20
−25
−30
−35
−40
20
50
100
200
500 1000 2000
Frequency (Hz)
5000 10000 20000
Figure 4.1: The frequency response used to design the gm2002 filter of the PCP module in the AIM.
The frequency response was obtained from the frontal field to cochlea correction data of Glasberg and
Moore (2002).
22
CHAPTER 4. AUDITORY MODELS
23
systems which produce a flat frequency response. (iii) Equal loudness contour (ELC)
and (iv) Glaberg and Moore 2002 (gm2002) are almost the same and include the factors
associated with the extra internal noise at low and high frequencies. However, gm2002
uses more recent data of Glasberg and Moore (2002) .
The MAF, MAP, ELC are designed using Parks-McClellan optimal equiripple FIR filter
design algorithm and the gm2002 is designed using a frequency sampling method. An
example of the frequency response used to generate a PCP filter is shown in Figure
4.1. The transmission of the acoustic sound through the PCP filter can be modelled
using equation 4.1. Where Signalinput is the input to the AIM and Signalpcp is the filtered
output of the corresponding P CP filter.
Signalpcp = f ilter(P CPf ilter , Signalinput )
4.1.2
(4.1)
Basilar Membrane Motion (BMM)
An important feature of the peripheral auditory system is the non linear spectral response
of the basilar membrane. This is implemented in the AIM using a dynamic compressive
gammachirp filter bank dcGC (Irino and Patterson, 2006). Two properties of the BMM
are the asymmetry and the compression of the auditory filters made in proportion to
the level. These properties are designed using a compressive gammachirp filter. The
compressive gammachirp (cGC) filter is a generalized form of the gammatone filter, which
was derived with operator techniques (Irino and Patterson, 1997). The development of
both the gammatone and gammachirp filters is described in Patterson, Unoki, and Irino
2003, Appendix A. The cGC is simulated by cascading a passive gammachirp filter (pGC)
with a high pass asymmetric function (HP-AF). The asymmetrical property is simulated
by the pGC filter and the output of the pGC is used to adjust the level dependency of
the active part i.e the HP-AF.
There are also other options available for generating BMM in AIM namely, the gammatone function and the pole zero filter cascade. However, the gammatone function
does not depict the non-linearity of the basilar membrane. The default filterbank dcGC,
was used to simulate the BMM in this thesis. The transformation of the BMM can
be modelled using equation 4.2. Where SignalpGC (fc ) is the filtered output of the pGC
filterbank, fc is the centre frequency of the filter, ACF (fc ) is the high pass asymmetric
compensation filters and SignalcGC (fc ) is the final compressed output of the BMM stage.
For a detailed description of the pGC and cGC the reader is advised to refer to Irino
and Patterson (2006).
4.1.3
SignalpGC (fc ) = f ilter(pGC(fc ), Signalpcp )
(4.2)
SignalcGC (fc ) = f ilter(ACF (fc ), SignalpGC (fc ))
(4.3)
Neural Activity Pattern (NAP)
The basilar membrane motion is transduced into an electrical potential by the inner hair
cells. As discussed in Chapter 3 the stretch in the tip links of the stereocilia will only
produce the K + ions to flow through them therefore the NAP can be simulated using a
signal processing concept of half wave rectification. This is implemented in AIM using
half wave rectification followed by a low pass filtering. The low pass filtering is done as
the phase locking is not possible for high frequencies.
CHAPTER 4. AUDITORY MODELS
24
There are three modules to generate the NAP i.e (i) half wave rectification followed by
compression followed by low pass filtering (H-C-L) (ii) half wave rectification followed
by low pass filtering (H-L) (iii) two dimensional adaptive threshold (same as H-C-L
but has adaptation which is more realistic). The choice of the NAP modules depends
on the choice of the BMM modules. As dcGC filter bank was used in this thesis, the
compression of the basilar membrane is already simulated by it. Therefore, the H-L was
chosen to generate the NAP and this transformation can be modeled using the equation
4.4. Where abs(Signalbmm (fc ) is the half wave rectified signal of the basilar membrane, fc
is the centre frequency of the filter, LP F is the low pass filter and (Signalnap (fc ) is the
modeled NAP.
(4.4)
Signalnap (fc ) = f ilter(LP F, abs(Signalbmm (fc ))
4.1.4
Strobe Temporal Integration (STI)
The next stage in the AIM is the processing done by the central nervous system. Perceptual research suggests that the fine grain temporal information is needed to preserve
the timbre information. General auditory models time average the NAP, which loses the
fine grain information. To prevent this AIM uses a mechanism known as STI. This is
subdivided into two modules in AIM i.e (i) strobe finding (SF) (ii) temporal integration
(TI).
Strobe Finding (SF) : AIM uses a sub module named sf2003 to find the strobes from
the NAP. The sf2003 uses an adaptive strobe threshold to issue a strobe and the time
of the strobe is that associated with the peak of the NAP pulse. After the strobe is
initiated the threshold initially rises along a parabolic path and then returns to the
linear decay to avoid spurious strobes. The duration of the parabola is proportional to
the centre frequency of the channel and the height to the height of the strobe. After
the parabolic section of the adaptive threshold, its level decreases linearly to zero in
30 ms. An additional feature of sf2003 is the inter channel interaction i.e a strobe in
one channel reduces the threshold in the neighboring channels. An example of how the
threshold varies and the strobes are calculated can be seen in Figure 4.2 .
0.6
0.4
0.3
NAP
Threshold Variation
Strobes
amplitude
0.5
0.2
0.1
0
0.005
0.01
0.015
0.02
0.025
0.03
0.035
0
0.04
time (ms)
Figure 4.2: The NAP of a 200 Hz pure tone in the 253 Hz frequency channel. The green line shows
the threshold variation and the red dots indicate the calculated strobes.
CHAPTER 4. AUDITORY MODELS
25
Temporal Integration (TI) : The temporal integration is implemented in AIM using a
module called stabilized auditory image (SAI). The SAI uses a sub module called ti2003
to do this. The ti2003 changes the time dimension of the NAP into a time interval
dimension. This works as follows: Initially, a temporal integration is initiated when a
strobe is detected. If no further strobes are detected, the process continues for 35ms
and then stops. In the case of strobes detected within the 35ms interval, each strobe
initiates a temporal integration process.
To preserve the shape of the SAI to that of the NAP, ti2003 uses a weighting concept
i.e the new strobes are initially weighted high (also the weights are normalized such that
the sum of the weights is equal to 1) so that the older strobes contribute relatively less
to the SAI.
4.1.5
Autocorrelation Function (ACF)
The AIM also offers an alternative module named autocorr to implement the processing
done by the central nervous system. The module takes the NAP as the input and
computes the ACF on each center frequency channel of the NAP by using a duration
of 70 ms, hop time of 10ms and a maximum delay of 35ms. By using the autocorr
module one can implement the autocorrelation hypothesis of Licklider (1951) mentioned
in Chapter 3.
This is how the AIM represents the internal representation of the acoustic sound. A
detailed description of each module of AIM can be found at http://www.acousticscale.
org/wiki/index.php/AIM2006_Documentation. The above mentioned modules were used to
analyze the recordings in the thesis. All the processing modules of AIM were written
in matlab and the current version is referred to as AIM-MAT. It can be downloaded
from https://code.soundsoftware.ac.uk/projects/aimmat. The autocorr module was only
present in the 2003 version of AIM and can be downloaded from http://w3.pdn.cam.ac.
uk/groups/cnbh/aimmanual/download/downloadframeset.htm. The matlab code from AIMMAT was used as the implementation of the AIM for the analysis of the recordings in
this thesis.
4.2
4.2.1
Auditory analysis
Loudness analysis
In the room acoustics chapter, the sound pressure level analysis was made to get a
general picture of how the amplitude of acoustic sound may affect human echolocation
ability. In this section a similar analysis is made using the loudness model of Glasberg
and Moore (2002), as it takes account of human hearing since loudness depends not only
on the frequency selectivity but also on the bandwidth and duration of the sound. The
reason for choosing the model of Glasberg and Moore (2002) over AIM for the loudness
analysis is clarified next.
The loudness model of Glasberg and Moore (2002) computes the frequency selectivity
and compression of the basilar membrane in two stages, i.e. by computing the excitation
pattern and the specific loudness of the input signal. However, physiologically they
are interlinked and a time domain filter bank which simulates both the selectivity and
compression might be more appropriate. Although there are different time domain
models of the level dependent auditory filters available in AIM (eg : dcGC) , they do
not give a sufficiently good fit to the equal loudness contours in ISO 2006 (Moore, 2014).
This was the main reason for not choosing the AIM to model loudness in this thesis.
Therefore, instead we use the model of Glasberg and Moore (2002).
CHAPTER 4. AUDITORY MODELS
26
As discussed in the perception of loudness section in chapter 2, a loudness model should
consider the outer middle ear filtering, the non linearity of the basilar membrane and the
temporal integration of the auditory system. The loudness model of Glasberg and Moore
(2002), estimates the loudness of steady sounds and time varying sounds, by accounting
for the above mentioned features of the human auditory system. Each stage of the model
is described briefly.
Outer middle ear transformation: The outer middle ear transformation was modeled using a FIR filter with 4097 coefficients and the response at the inner ear can be represented
using equation 4.5, where x and yomt are the signals before and after transformation and
h is the impulse response of the filter.
yomt = f ilter(h, x)
(4.5)
Excitation pattern: The excitation pattern is defined as the magnitude of the output
of each auditory filter plotted as a function of filter centre frequency. To compute the
excitation pattern from the time domain signal Glasberg and Moore (2002) used six
FFTs in parallel based on Hanning-windowed segments with durations of 2, 4, 8, 16,
32, and 64 ms, all aligned at their temporal centres. The windowed segments are zero
padded, and all FFTs are based on 2048 sample points. All FFTs are updated at 1ms
intervals. Each FFT was used to calculate the spectral magnitudes at specific frequency
ranges, values outside the range were discarded.
The running spectrum was given as the input to the auditory filters and the output of the
auditory filters were calculated at the center frequency of 0.25 Equivalent rectangular
bandwidth (ERB) intervals taking in to account the known variation of the auditory
filter shape with center frequency and level. The excitation pattern is then defined as
the output of the auditory filter as a function of center frequency (Glasberg and Moore,
2002). This can be represented using the equation 4.6 where W (fc ) is the frequency
response of the auditory filter at center frequency fc , Yomt is the power spectrum of yomt
calculated using six parallel FFTs as mentioned above over a 1-ms interval and E(fc ) is
the magnitude of the output of each auditory filter with centre frequency fc .
E(fc ) = Yomt ∗ W (fc )
(4.6)
Specific loudness (SL): To model the non linearity of the basilar membrane the excitation
pattern has to be converted to specific loudness. This was done in Glasberg and Moore
(2002) using three conditions (cf equation 4.7).
⎧
1.5
2E(fc )
⎪
⎪
+ ((G ∗ E(fc ) + A)α − Aα ) if E(fc ) ≤ TQ (fc )
C
∗
⎪
E(fc )+TQ (fc )
⎨
SL(fc ) =
((G ∗ E(fc ) + A)α − Aα )
⎪
0.5
⎪
⎪
⎩C ∗ E(fc ) 6
1.04∗10
if 1010 ≥ E(fc ) ≥ TQ (fc )
(4.7)
if E(fc ) ≥ 1010
Where TQ (fc ) is the threshold of excitation which is frequency dependent. G represents
the low level gain in the cochlear amplifier, relative to the gain at 500 Hz and above, and
is also frequency dependent. The parameter A is used to bring the input-output function
close to linear around the absolute threshold. α is a compressive exponent which varies
between 0.27 and 0.2. C is a constant which scales the loudness to conform to the sone
scale, where the loudness of 1 kHz tone at 40 dB SPL is 1 sone and C is equal to 0.047.
CHAPTER 4. AUDITORY MODELS
27
Loudness depends not only on the intensity and bandwidth of the sound but also on other
factors, especially duration of the sound. The influence of the duration on the loudness
was modeled by Glasberg and Moore (2002) using three concepts namely, Instantaneous
loudness, Short Term Loudness and Long Term Loudness. These depict the temporal
integration of loudness in the auditory system and are described below.
Instantaneous loudness (IL): The area under the specific loudness pattern is summed to
give the instantaneous loudness. If the hearing is binaural the specific loudness pattern
at the two ears is summed and the area under the sum of the specific loudness pattern is
again summed to give the instantaneous loudness. It is to be noted that the instantaneous
loudness is an intervening variable which is used for calculation. Thus it is not available
for conscious perception.
Short Term Loudness (STL): The Short Term loudness was calculated by averaging the
instantaneous loudness using an attack constant αa = 0.045 and a decay constant αr = 0.02
(cf equation 4.8). The values of αa and αr were chosen so that the model will give
reasonable predictions for variation of loudness with duration and amplitude modulated
sounds (Moore, 2014) .
αa ∗ ILn + (1 − αa ) ∗ ST Ln−1 if IL(n) ≥ ST L(n − 1)
ST L(n) =
(4.8)
αr ∗ ILn + (1 − αr ) ∗ ST Ln−1 if IL(n) ≤ ST L(n − 1)
Long Term Loudness (LTL): The Long Term loudness was calculated by averaging the
short term loudness using an attack constant αa1 = 0.01 and a decay constant αr1 = 0.0005
(cf equation 4.9). The values of αa1 and αr1 were chosen so that the model to give
reasonable predictions for overall loudness of sounds that are amplitude modulated at
low rates (Moore, 2014).
αa1 ∗ ST Ln + (1 − αa1 ) ∗ LT Ln−1 if ST L(n) ≥ LT L(n − 1)
(4.9)
LT L(n) =
αr1 ∗ ST Ln + (1 − αr1 ) ∗ LT Ln−1 if ST L(n) ≤ LT L(n − 1)
Another important characteristic that affects the loudness of a sound is the influence
of the intensity at the two ears. To model the binaural loudness several psychoacoustic
results have been considered (For details see Moore, 2014). Some early results measured
to find the level difference required for equal loudness (LDEL) of monaurally and diotically presented sound was approximately 10 dB. As the loudness of a sound doubles
with every 10 dB rise in intensity it was assumed in the loudness model of Glasberg and
Moore (2002) that loudness sums across ears. However, recent results suggest that the
LDEL is rather 5 to 6 dB. Moore and Glasberg described a model to account for these
results using the concept of inhibition that a strong input to the one ear can inhibit the
internal response evoked by a weaker input to the other ear (Moore, 2014).
Moore and Glasberg implemented the inhibition between the ears by using a gain function. Initially the specific loudness pattern was smoothed using a Gaussian weighting
function and the relative values of the smoothed function at the two ears was used to
compute the gain functions of the ears. The gains were than applied to the specific
loudness patterns of the two ears . The loudness for each ear was computed by summing
the specific loudness over the centre frequencies and the binaural loudness was obtained
by summing the loudness values across the two ears (Moore, 2014) . This procedure was
used to compute the binaural loudness in this thesis.
CHAPTER 4. AUDITORY MODELS
28
The binaural loudness model of Moore and Glasberg was implemented in PsySound3, a
GUI-driven Matlab environment for analysis of audio recordings. The software can be
downloaded from http://www.psysound.org. This matlab code was used to calculate the
loudness of our recordings
Glasberg and Moore (2002) assumed that the loudness of a brief sound is determined by
the maximum of the short term loudness while the long term loudness may correspond
to the memory for the loudness of an event that can last for several seconds. It is to
be noted that for a time varying sound (eg amplitude modulated tone) it is appropriate
to consider the long time loudness as a function of time to calculate the time varying
loudness. However, in this thesis as the stimuli presented to the participants were noise
bursts and can be considered steady and brief we follow the assumption of Glasberg and
Moore (2002) of using the maximum of short time loudness as a measure of the loudness
of the recordings. The results of the maximum of the short time loudness for recordings
in Experiment 1 and Experiment 2 are tabulated in Tables B.1 to B.10 in Appendix B.
NoObjectrec1
NoObjectrec2
Object50cm
Object100cm
Object150cm
Object200cm
Object300cm
Object400cm
Object500cm
Anechoic
13.357
13.296
20.674
20.194
14.404
13.347
13.379
13.420
Conference
19.320
19.376
26.707
24.377
21.537
19.651
19.975
19.529
Lecture
15.497
17.160
16.179
-
Table 4.1: Mean of the maximum of the Short Term Loudness in sones of 10 versions for the recordings
in anechoic conference and the lecture room with 5ms duration signal. The blank cells indicate that
there were no recordings made at those distances.
NoObjectrec1
NoObjectrec2
Object50cm
Object100cm
Object150cm
Object200cm
Object300cm
Object400cm
Object500cm
Anechoic
40.090
40.023
63.672
52.307
40.320
40.292
40.213
40.089
Conference
44.999
45.072
69.607
55.682
47.619
45.135
45.249
45.041
Lecture
-
Table 4.2: Mean of the maximum of the Short Term Loudness in sones of 10 versions for the recordings
in anechoic conference and the lecture room with 50ms duration signal. The blank cells indicate that
there were no recordings made at those distances.
The mean of the maximum of the STL over the 10 version for 5, 50 and 500ms recordings
in different room conditions are presented in Tables 4.1, 4.2 and 4.3. From the above
tabulated data the loudness difference between the recording without the object and
with the object at 100 cm was less in the case of lecture room than for the anechoic or
conference room. This may be the reason for the low performance of the participants
in the lecture room. Another comparison would be that the loudness values follow the
CHAPTER 4. AUDITORY MODELS
NoObjectrec1
NoObjectrec2
Object50cm
Object100cm
Object150cm
Object200cm
Object300cm
Object400cm
Object500cm
29
Anechoic
48.137
48.082
76.143
62.159
48.353
48.377
48.187
48.131
Conference
52.444
52.487
78.659
63.574
54.580
52.387
52.569
52.502
Lecture
52.013
54.712
52.466
-
Table 4.3: Mean of the maximum of the Short Term Loudness in sones of 10 versions for the recordings
in anechoic, lecture and conference room with 500ms duration signal. The blank cells indicate that
there were no recordings made at those distances.
same trend as in the sound pressure level analysis of the room acoustics chapter (cf 3.2
and 4.3). However, the values in Tables 4.1 to 4.3 are psychophysical and depict not only
the acoustics of the rooms but also take into account relevant aspects of human hearing.
A detailed comparison of loudness results with the performance of the participants is
made in Chapter 5.
4.2.2
4.2.2.1
Auto correlation analysis for pitch perception
Dual profile:
As discussed in the room acoustics chapter, one of the phenomenon that the echo locators
use in detecting the objects is the repetition pitch. The repetition pitch is generally
perceived at a frequency which is equal to the inverse of the delay time between the
sound and its reflection (Bilsen and Ritsma, 1969) . In the experiments of Schenkman
and Nilsson (2010), Schenkman, Nilsson, and Grbic (2011), the objects were at 50, 100,
150, 200, 300, 400 and 500 cm. These distances correspond to a delay of 2.9, 5.8, 8.7,
11.6, 17.4, 23.2 and 29 ms. The frequency of the pitch perceived for these delays would
be 344 , 172, 114, 86, 57, 43 and 34 Hz. However, it is to be noted that the actual delays
may vary due to the different factors like the recording set up, speed of sound etc.
To test the presence of repetition pitch at these frequencies and how this information
would be represented in the auditory system the PCP, BMM and NAP modules mentioned in the description of the AIM (cf section 4.1) were used to analyze the recordings.
Most of the previous research done to explain the repetition pitch perception of iterated
rippled noise stimuli states that the peaks in the autocorrelation function are the basis
for the repetition pith perception (Yost 1996; Patterson et al 1996). Hence, instead
of using the strobe finding and the temporal integration modules, the autocorr module
of the AIM was used as the final stage in this thesis to quantify the repetition pitch
information.
The reader should note that by not choosing the strobe temporal integration as the final
stage in this thesis does not mean that it is not the way in which the pitch information is
represented in the auditory system. As previous research on iterated rippled noise has
quantified the repetition pitch perception using the autocorrelation theory, the thesis
follows in their foot steps to quantify the repetition pitch perception that is known to
be useful for echolocation using the same principle of autocorrelation. To know whether
or not the strobe temporal integration is the way in which the repetition pitch that
is known to be useful for human echolocation is represented in the auditory system, a
further analysis is needed but is left as a future work. For interested reader an example
CHAPTER 4. AUDITORY MODELS
30
figure of the results obtained using the strobe temporal integration module is presented
in Appendix B.3.
After generating the ACF using the autocorr module there is a dual profile development
module in the AIM which as a dual profile sums up the ACF along both the temporal and
the spectral domain. This is relevant to human hearing in depicting how the temporal
and spectral information might be represented. An important feature of the dual profile
model is that it plots both the temporal and the spectral sum on the frequency axis,
in a single plot. The temporal profile and the spectral profile were scaled in the dual
profile for this and the inverse relation of the time verses frequency ( f = 1/t) was used to
plot both time and frequency on a frequency scale. As these features of the dual profile
model are useful for analyzing the repetition pitch, this module was used to analyze the
temporal and spectral results.
The recordings with object at 300 to 500cm in Experiment 1 and 2,4,8,16,32 and 64, 5ms
clicks in Experiment 2 do not provide any additional information and were not analyzed.
It is to be noted that the temporal profile (blue line in the below figures) is calculated
by summing the ACF output along 100 critical bands (50Hz to 8000Hz) at each time
delay and the spectral profile (red line in the below figures) is calculated by summing
the ACF output in each critical band along a 35ms time delay. Therefore, the temporal
profile consists of 35ms delay samples and the spectral profile consists of 100 samples.
When the recordings were presented to the participants they were presented as 5 or 50
or 500ms in duration plus an additional 450 ms of silence. Hence all the analysis of the
recordings in this thesis were also done using the same principle i.e the whole signal was
analyzed (eg 5ms recordings had 5ms duration plus 450ms of silence). However, for the
sake of presenting the figures the first 70ms time interval of the recordings was used.
In the analysis of 5ms recordings the peaks were identified both in the temporal profile
(blue line) and the spectral profile (red line) (cf Figures 4.3 to 4.5). Note that the
amplitude scale of the y axis is different in each sub figure of a particular figure. As
the investigated attribute in this section is pitch the sub figures should be compared in
reference with the No object sub figure of a particular figure. A distinct peak in any other
sub figure which is absent in the No object sub figure indicates the possibility of a pitch
perception. There were small spectral differences but this information do not indicate
any pitch information. For the temporal profile the peaks were identified approximately
at the theoretical frequency of the repetition pitch 86 Hz for recordings with object at
200cm in the conference room (Experiment 1) and 172, 114 Hz for recordings with object
at 100, 150 cm in the lecture room (Experiment 2) (cf Figure 4.4(d), 4.5(b) and 4.5(c)).
In the 50ms and 500ms signal recordings distinct peaks that explain the pitch perception
were absent in the spectral profile (cf Figure 4.6 to 4.10). The temporal profile of
the figures 4.6 to 4.10 might have some peaks approximately around the theoretical
frequencies of the repetition pitch which are not clearly visible due to the scaling of the
figures. Therefore, from the dual profile analysis it can be concluded that the spectral
profile (red line) does not provide any information for pitch perception. On the other
hand, to conclude that it is the temporal profile (blue line) that is necessary for the
detecting the objects based on repetition pitch is not certain from this analysis, as the
peaks were not clearly visible. A further analysis is needed which quantifies the peaks
in the temporal profile.
To determine if it is the temporal information that is necessary for detecting the objects
based on repetition pitch, the pitch strength development module of AIM which measures
CHAPTER 4. AUDITORY MODELS
31
the pitch perceived based on the peak strength was used. This is further discussed in
the next subsection where it will be shown that the temporal profile has the peaks at the
4.5
0.6
spectral profile
temporal profile
0.3
0.2
Scaled autocorrelation index
0.4
3.5
3
spectral profile
temporal profile 2.5
2
1.5
1
Scaled autocorrelation index
4
0.5
0.1
0.5
0
0
100
200
400
800
1600
3200
6400
50
100
200
Frequency [Hz]
400
800
1600
3200
6400
Frequency [Hz]
(a) No object
(b) Object at 50 cm
2.5
0.5
spectral profile
temporal profile
spectral profile
temporal profile
0.45
2
1
Scaled autocorrelation index
1.5
0.4
0.35
0.3
0.25
0.2
0.15
0.5
Scaled autocorrelation index
50
0.1
0.05
0
50
100
200
400
800
1600
Frequency [Hz]
(c) Object at 100 cm
3200
6400
0
50
100
200
400
800
1600
3200
6400
Frequency [Hz]
(d) Object at 200 cm
Figure 4.3: The Dual profile of a 5ms signal recorded in the anechoic room (Experiment 1). Blue line
is the sum of the ACF along the spectral axis and the red line is the sum of the ACF along the time
delay axis for 70ms time interval. The temporal and spectral profiles are scaled to be compared to
each other. The ’x’ axis is changed in the temporal profile by using the inverse relationship of time
and frequency, f = 1/t. Note that the amplitude scale of the ’y’ axis is different in each sub figure.
However, as the investigated attribute is pitch the sub figures should be compared in reference with
the No object sub figure. A distinct peak in any other sub figure which is absent in the No object sub
figure indicates the possibility of a pitch perception.
CHAPTER 4. AUDITORY MODELS
32
theoretical frequencies of the repetition pitch which explains the perception of repetition
pitch phenomenon.
1.2
9
spectral profile
temporal profile
spectral profile
temporal profile
8
0.6
0.4
Scaled autocorrelation index
0.8
7
6
5
4
3
2
Scaled autocorrelation index
1
0.2
1
0
100
200
400
800
1600
3200
0
6400
50
100
200
Frequency [Hz]
400
800
1600
3200
6400
Frequency [Hz]
(a) No object
(b) Object at 50 cm
5
spectral profile
temporal profile
1.8
spectral profile
temporal profile
4.5
4
3
2.5
2
1.5
1.4
Peak approximately at the
theoretical frequency (86Hz)
of the repetition pitch
Scaled autocorrelation index
3.5
1.6
1.2
1
0.8
0.6
0.4
1
0.2
0.5
0
50
100
200
400
800
1600
Frequency [Hz]
(c) Object at 100 cm
3200
6400
Scaled autocorrelation index
50
0
50
100
200
400
800
1600
3200
6400
Frequency [Hz]
(d) Object at 200 cm
Figure 4.4: The Dual profile of a 5ms signal recorded in the conference room (Experiment 1). Blue
line is the sum of the ACF along the spectral axis and the red line is the sum of the ACF along the
time delay axis for 70ms time interval. The temporal and spectral profiles are scaled to be compared
to each other. The ’x’ axis is changed in the temporal profile by using the inverse relationship of time
and frequency, f = 1/t. Note that the amplitude scale of the ’y’ axis is different in each sub figure.
However, as the investigated attribute is pitch the sub figures should be compared in reference with
the No object sub figure. A distinct peak in any other sub figure which is absent in the No object sub
figure indicates the possibility of a pitch perception.
CHAPTER 4. AUDITORY MODELS
33
0.7
spectral profile
temporal profile
0.5
0.4
0.3
0.2
Scaled autocorrelation index
0.6
0.1
0
50
100
200
400
800
1600
3200
6400
Frequency [Hz]
(a) No object
0.9
0.8
spectral profile
temporal profile
0.7
Peak approximately at the
theoretical frequency (172Hz)
of the repetition pitch
0.6
0.5
0.4
0.3
0.2
Scaled autocorrelation index
0.8
0.7
0.6
Peak approximately at the
theoretical frequency (115 Hz)
of the repetition pitch
0.5
0.4
0.3
0.2
0.1
0.1
0
50
100
200
400
800
1600
Frequency [Hz]
(b) Object at 100 cm
3200
6400
Scaled autocorrelation index
spectral profile
temporal profile
0
50
100
200
400
800
1600
3200
6400
Frequency [Hz]
(c) Object at 150 cm
Figure 4.5: The Dual profile of a 5ms signal recorded in the lecture room (Experiment 2). Blue line
is the sum of the ACF along the spectral axis and the red line is the sum of the ACF along the time
delay axis for 70ms time interval. The temporal and spectral profiles are scaled to be compared to
each other. The ’x’ axis is changed in the temporal profile by using the inverse relationship of time
and frequency, f = 1/t. Note that the amplitude scale of the ’y’ axis is different in each sub figure.
However, as the investigated attribute is pitch the sub figures should be compared in reference with
the No object sub figure. A distinct peak in any other sub figure which is absent in the No object sub
figure indicates the possibility of a pitch perception.
CHAPTER 4. AUDITORY MODELS
34
8
70
spectral profile
temporal profile
7
5
4
3
2
Scaled autocorrelation index
6
60
50
40
30
20
10
1
0
50
100
200
400
800
1600
3200
Scaled autocorrelation index
spectral profile
temporal profile
0
6400
50
100
200
400
Frequency [Hz]
800
1600
3200
6400
Frequency [Hz]
(a) No object
(b) Object at 50 cm
8
25
spectral profile
temporal profile
spectral profile
temporal profile
7
10
6
5
4
3
2
Scaled autocorrelation index
15
Scaled autocorrelation index
20
5
1
0
50
100
200
400
800
1600
Frequency [Hz]
(c) Object at 100 cm
3200
6400
0
50
100
200
400
800
1600
3200
6400
Frequency [Hz]
(d) Object at 200 cm
Figure 4.6: The Dual profile of a 50ms signal recorded in the anechoic room (Experiment 1). Blue
line is the sum of the ACF along the spectral axis and the red line is the sum of the ACF along the
time delay axis for 70ms time interval. The temporal and spectral profiles are scaled to be compared
to each other. The ’x’ axis is changed in the temporal profile by using the inverse relationship of time
and frequency, f = 1/t. Note that the amplitude scale of the ’y’ axis is different in each sub figure.
However, as the investigated attribute is pitch the sub figures should be compared in reference with
the No object sub figure. A distinct peak in any other sub figure which is absent in the No object sub
figure indicates the possibility of a pitch perception.
CHAPTER 4. AUDITORY MODELS
35
14
140
spectral profile
temporal profile
spectral profile
temporal profile
8
6
4
100
80
60
40
2
20
0
100
200
400
800
1600
3200
0
6400
50
100
200
400
Frequency [Hz]
800
1600
3200
6400
Frequency [Hz]
(a) No object
(b) Object at 50 cm
15
45
spectral profile
temporal profile
spectral profile
temporal profile
35
30
25
20
15
10
Scaled autocorrelation index
40
10
5
Scaled autocorrelation index
50
Scaled autocorrelation index
10
120
Scaled autocorrelation index
12
5
0
50
100
200
400
800
1600
Frequency [Hz]
(c) Object at 100 cm
3200
6400
0
50
100
200
400
800
1600
3200
6400
Frequency [Hz]
(d) Object at 200 cm
Figure 4.7: The Dual profile of a 50ms signal recorded in the conference room (Experiment 1). Blue
line is the sum of the ACF along the spectral axis and the red line is the sum of the ACF along the
time delay axis for 70ms time interval. The temporal and spectral profiles are scaled to be compared
to each other. The ’x’ axis is changed in the temporal profile by using the inverse relationship of time
and frequency, f = 1/t. Note that the amplitude scale of the ’y’ axis is different in each sub figure.
However, as the investigated attribute is pitch the sub figures should be compared in reference with
the No object sub figure. A distinct peak in any other sub figure which is absent in the No object sub
figure indicates the possibility of a pitch perception.
CHAPTER 4. AUDITORY MODELS
36
14
120
spectral profile
temporal profile
spectral profile
temporal profile
12
6
4
80
60
40
20
2
0
50
100
200
400
800
1600
3200
Scaled autocorrelation index
8
Scaled autocorrelation index
10
100
0
6400
50
100
200
400
Frequency [Hz]
800
1600
3200
6400
Frequency [Hz]
(a) No object
(b) Object at 50 cm
40
spectral profile
temporal profile
12
spectral profile
temporal profile
35
25
20
15
10
8
6
4
Scaled autocorrelation index
30
Scaled autocorrelation index
10
2
5
0
50
100
200
400
800
1600
Frequency [Hz]
(c) Object at 100 cm
3200
6400
0
50
100
200
400
800
1600
3200
6400
Frequency [Hz]
(d) Object at 200 cm
Figure 4.8: The Dual profile of a 500ms signal recorded in the anechoic room (Experiment 1). Blue
line is the sum of the ACF along the spectral axis and the red line is the sum of the ACF along the
time delay axis for 70ms time interval. The temporal and spectral profiles are scaled to be compared
to each other. The ’x’ axis is changed in the temporal profile by using the inverse relationship of time
and frequency, f = 1/t. Note that the amplitude scale of the ’y’ axis is different in each sub figure.
However, as the investigated attribute is pitch the sub figures should be compared in reference with
the No object sub figure. A distinct peak in any other sub figure which is absent in the No object sub
figure indicates the possibility of a pitch perception.
CHAPTER 4. AUDITORY MODELS
37
250
18
spectral profile
temporal profile
16
12
10
8
6
4
200
Scaled autocorrelation index
14
150
100
50
Scaled autocorrelation index
spectral profile
temporal profile
2
0
0
50
100
200
400
800
1600
3200
6400
50
100
200
400
Frequency [Hz]
800
1600
3200
6400
Frequency [Hz]
(a) No object
(b) Object at 50 cm
70
20
spectral profile
temporal profile
spectral profile
temporal profile
18
60
40
30
20
14
12
10
8
6
Scaled autocorrelation index
50
Scaled autocorrelation index
16
4
10
2
0
50
100
200
400
800
1600
Frequency [Hz]
(c) Object at 100 cm
3200
6400
0
50
100
200
400
800
1600
3200
6400
Frequency [Hz]
(d) Object at 200 cm
Figure 4.9: The Dual profile of a 500ms signal recorded in the conference room (Experiment 1). Blue
line is the sum of the ACF along the spectral axis and the red line is the sum of the ACF along the
time delay axis for 70ms time interval. The temporal and spectral profiles are scaled to be compared
to each other. The ’x’ axis is changed in the temporal profile by using the inverse relationship of time
and frequency, f = 1/t. Note that the amplitude scale of the ’y’ axis is different in each sub figure.
However, as the investigated attribute is pitch the sub figures should be compared in reference with
the No object sub figure. A distinct peak in any other sub figure which is absent in the No object sub
figure indicates the possibility of a pitch perception.
CHAPTER 4. AUDITORY MODELS
38
25
spectral profile
temporal profile
15
10
Scaled autocorrelation index
20
5
0
50
100
200
400
800
1600
3200
6400
Frequency [Hz]
(a) No object
25
25
spectral profile
temporal profile
spectral profile
temporal profile
10
15
10
5
5
0
50
100
200
400
800
1600
Frequency [Hz]
(b) Object at 100 cm
3200
6400
Scaled autocorrelation index
15
20
Scaled autocorrelation index
20
0
50
100
200
400
800
1600
3200
6400
Frequency [Hz]
(c) Object at 150 cm
Figure 4.10: The Dual profile of a 500ms signal recorded in the lecture room (Experiment 2). Blue
line is the sum of the ACF along the spectral axis and the red line is the sum of the ACF along the
time delay axis for 70ms time interval. The temporal and spectral profiles are scaled to be compared
to each other. The ’x’ axis is changed in the temporal profile by using the inverse relationship of time
and frequency, f = 1/t. Note that the amplitude scale of the ’y’ axis is different in each sub figure.
However, as the investigated attribute is pitch the sub figures should be compared in reference with
the No object sub figure. A distinct peak in any other sub figure which is absent in the No object sub
figure indicates the possibility of a pitch perception.
CHAPTER 4. AUDITORY MODELS
4.2.2.2
39
Pitch strength:
As the peaks were randomly distributed in the temporal profile of the autocorrelation
function computed using the dual profile module of AIM, it is not obvious which peak
corresponds to a pitch. To solve this issue the auditory image model consists of a pitch
strength module which calculates the pitch strength to determine if a particular peak is
valid or not. The pitch strength module initially calculates the local maximas and their
corresponding local minimas. The ratio of peak height to the peak width of the peak
(local maxima) is subtracted from the mean of the peak height between two adjacent
local minima to obtain the pitch strength (PS) of a particular peak.
There were two modifications made in the pitch strength algorithm to improve its performance for the analysis in this thesis. 1) Removed the low pass filtering as it smooths
out the peaks and 2) Measured the pitch strength using equation 4.10 . Figure 4.11
is an example to illustrate the present pitch strength algorithm. The peak having the
greatest peak height has greater pitch strength and would be the perceived frequency of
the repetition pitch.
Pitch strength = Peak height - MEAN(Peak height between two adjacent local minima).
(4.10)
1.5
233 Hz: 0.48
1
Autocorrelation Index
Pitch strength = Peak height - Mean of the peak height of
two adjacent local minimas = 0.48
0.5
4.5
5
5.5
6
6.5
7
time(sec)
7.5
8
8.5
9
×10 -3
Figure 4.11: An example to illustrate the pitch strength measure computed using the pitch strength
module of the AIM. The blue dot indicates the local maxima and the two red dots are the corresponding
local minima. The vertical pink line is the pitch strength calculated using the equation 4.10. The
frequency in Hz was computed by inverting the time delay, f = 1/t.
The results of the calculated pitch strength’s for the recordings of Experiment 1 and
Experiment 2 are tabulated in Tables 4.4 to 4.6. It should be noted that the peaks
CHAPTER 4. AUDITORY MODELS
NoObjectrec1
NoObjectrec2
Object50cm
Object100cm
Object150cm
Object200cm
40
Anechoic
0.37
0.40
2.54
1.06
0.35
Conference
0.78
0.80
9.65
2.43
0.85
Lecture
0.19
0.29
0.28
-
Table 4.4: Mean of the pitch strength (autocorrelation index) of 10 versions for the recordings in
anechoic conference and the lecture room with 5ms duration signal. The blank cells indicate that
there were no recordings made at those distances.
NoObjectrec1
NoObjectrec2
Object50cm
Object100cm
Object150cm
Object200cm
Anechoic
0.47
0.44
4.17
1.67
0.42
Conference
0.55
0.52
6.59
2.22
0.54
Lecture
-
Table 4.5: Mean of the pitch strength (autocorrelation index) of 10 versions for the recordings in
anechoic conference and the lecture room with 50ms duration signal. The blank cells indicate that
there were no recordings made at those distances.
NoObjectrec1
NoObjectrec2
Object50cm
Object100cm
Object150cm
Object200cm
Anechoic
0.71
0.78
4.75
2.44
0.70
Conference
0.84
0.90
7.74
2.91
1.35
Lecture
1.30
1.36
1.42
-
Table 4.6: Mean of the pitch strength (autocorrelation index) of 10 versions for the recordings in
anechoic, lecture and conference room with 500ms duration signal. The blank cells indicate that there
were no recordings made at those distances.
were also identified in the case of recordings without the object which should not have
any pitch perception. This is because the pitch strength algorithm identifies the local
maximas and minimas and hence it calculates the pitch strength for all random peaks
(local maximas).
The units for pitch strength in this analysis is the autocorrelation index as it is computed
on the autocorrelation function. The tabulated data shows that for the 5ms and 50ms
duration signals the pitch strength was greater than 1 for 50 and 100cm in the anechoic
and conference room (cf Table 4.4 and 4.5). For 500ms duration signal the strength was
greater than 1 for 50 and 100cm in the anechoic room and 50, 100 and 200 cm in the
conference room . Although the lecture room also had pitch strength greater than 1 at
this condition the pitch strength computed was not consistent over a single frequency
and it lasted only for 4 to 8 time frames (The time frames were 35ms in time delay
computed from a 70ms interval NAP signal. Each frame had an hop time of 10ms). This
was not the case for the anechoic and conference room which had high pitch strengths
at a particular frequency and also lasted for 14 to 18 time frames of 35ms interval each
with an hop time of 10ms.
CHAPTER 4. AUDITORY MODELS
41
The perceptual results in Experiment 1 and Experiment 2 show that the participants
were able to detect the objects with a high percentage of correct at 50 and 100cm
in the anechoic room and 50, 100 and 200cm in the conference room (cf Schenkman
and Nilsson 2010; Schenkman, Nilsson, and Grbic 2011). As discussed in the above
paragraph the pitch strength was greater than 1 at these conditions. Assuming that pitch
is the underlying information that the participants used to detect the objects at these
distances, the above comparison shows that there might be a perceptual threshold of 1
(autocorrelation index) for pitch strength and the peak with that pitch strength should
exist for certain time frames in order for the participants to perceive the repetition pitch.
This is determined by the acoustics of the room. A further comparison of pitch strength
results with the performance of the participants is made in Chapter 5.
4.2.3
Sharpness analysis for timbre perception
In the room acoustics chapter the spectral centroid was used as a measure for the timbre
perception. However, the spectral centroid was computed on the time varying Fourier
Transform. To study the effect of human hearing on timbre pereception Fastl and Zwicker
(2007) computed the weighted centroid of the specific loudness rather than the Fourier
Transform. This measure was known as sharpness which is a measure of how sound
extends from being perceived to vary form dull to sharp. The sharpness analysis for our
recordings was made using code available from Psysound. As the sharpness varies over
time the median of it is used to depict the perceived sharpness. The results of the mean
of the medians of the perceived sharpness over the 10 versions in anechoic, conference
and lecture room for 5, 50 and 500ms duration signals are tabulated in Tables 4.7 to 4.9.
The results for all the recordings can be seen in Appendix B, Tables B.11 to B.20 .
According to Pedrielli, Carletti, and Casazza (2008) their participants had a just noticeable difference for sharpness of 0.04 acum. The results in Tables 4.7 to 4.9 show
that the difference in median sharpness was greater than 0.04 acum for the object at
50 and 100cm when compared to the recording without the object for Experiment 1.
For Experiment 2 the difference between the recordings with and without object was
comparably less with the Experiment 1 but they were greater than 0.04 acum. However,
at smaller distances (less than 200cm) repetition pitch and loudness information might
be more relevant for providing information for the participants to echolocate than the
sharpness information.
The recordings of Experiment 1 with objects at distances 200cm, 300cm, 400cm and
500cm for 5ms (anechoic, conference), 50ms (anechoic, conference) and 500ms signal
(conference) duration’s had differences in median sharpness less than 0.04 acum when
NoObjectrec1
NoObjectrec2
Object50cm
Object100cm
Object150cm
Object200cm
Object300cm
Object400cm
Object500cm
Anechoic
1.888
1.900
2.052
2.138
1.921
1.906
1.891
1.889
Conference
1.972
1.983
2.032
2.032
2.003
2.009
1.982
1.986
Lecture
1.849
1.778
1.834
-
Table 4.7: Mean of the 10 versions of mean of median of the sharpness (acums) for the recordings in
anechoic, conference and the lecture room with 5ms duration signal.
CHAPTER 4. AUDITORY MODELS
NoObjectrec1
NoObjectrec2
Object50cm
Object100cm
Object150cm
Object200cm
Object300cm
Object400cm
Object500cm
42
Anechoic
1.889
1.901
2.068
2.141
1.912
1.904
1.874
1.881
Conference
1.893
1.894
1.964
1.950
1.936
1.914
1.917
1.888
Lecture
-
Table 4.8: Mean of the 10 versions of mean of median of the sharpness (acums) for the recordings in
anechoic, conference and the lecture room with 50ms duration signal.
NoObjectrec1
NoObjectrec2
Object50cm
Object100cm
Object150cm
Object200cm
Object300cm
Object400cm
Object500cm
Anechoic
1.861
1.882
2.116
2.119
1.892
1.858
1.831
1.835
Conference
1.935
1.938
2.095
2.043
1.967
1.950
1.949
1.941
Lecture
2.072
2.200
2.110
-
Table 4.9: Mean of the 10 versions of mean of median of the sharpness (acums) for the recordings in
anechoic, conference and the lecture room with 500ms duration signal.
compared to the recording without the object. For 500ms signal duration’s in the anechoic room the recordings with object at 400cm and 500cm had difference in sharpness
greater than 0.04 acum when compared to the recording without the object (cf Table
B.13 in appendix). This might be the information that blind participants in Experiment
1 use to identify the object at longer distance than 400cm. A detailed analysis of the
results with the performance of the participants is made in Chapter 5.
Chapter 5
Analysis of the perceptual results
5.1
Description of the non parametric modeling:
A psychometric function is used in psychoacoustics to relate the perceptual results with
the physical parameters of the stimulus. Traditionally the psychometric function is
estimated using parametric fitting i.e assuming a true function that can be described
by a specific parametric model and then estimating the parameters of that model by
maximizing the likelihood. However, in practice the correct parametric model underlying
the description of the psychometric function is unknown and estimating the psychometric
function based on such a model may lead to incorrect interpretations (Zychaluk and
Foster, 2009). To solve this problem Zychaluk and Foster (2009), implemented a non
parametric model to estimate the psychometric function i.e the psychometric function
is modeled locally without any need for the assumptions of a true function. Therefore,
the method proposed by Zychaluk and Foster (2009) is used in our analysis. Below a
brief description of the non parametric model in estimating the underlying psychometric
function is described followed by the analysis of the results.
A generalized linear model (GLM) is usually used in fitting a psychometric function
using parametric modeling. It consists of three components, i.e a random component
from the exponential family, a systematic component η, and a monotonic differentiable
link function g, that relates the two. Hence, the psychometric function P (x), can be
modeled using equation 5.1. The parameters of the GLM are estimated by maximizing
the appropriate likelihood function (Zychaluk and Foster, 2009). The efficiency of the
GLM relies on how much the chosen link function g approximates the true function.
η(x) = g[P (x)]
(5.1)
In the non parametric modelling, instead of fitting the link function g, the function η is
fitted using a local linear method i.e for a given point x, the value η(u) at any point u in
a neighbourhood of x is approximated using equation 5.2 (Zychaluk and Foster, 2009).
η(u) ≈ η(u) − (u − x)η (x)
(5.2)
where η (x) is the first derivative of η . The actual estimate of the value of η(x) is obtained
by fitting this approximation to the data over the prescribed neighbourhood of x . Two
features are important for this purpose, kernel K and the bandwidth h . A Gaussian
kernel is preferred as it has unbounded support and is best for widely spaced levels. An
optimal bandwidth can be chosen using plugin, bootstrap or cross validation methods
(Zychaluk and Foster, 2009). As no method is guaranteed to always work, bootstrap
method with 30 replications was chosen in our analysis to find the optimal bandwidth.
However, when bootstrap method failed to find the optimal bandwidth cross validation
was used to find the optimal bandwidth.
43
CHAPTER 5. ANALYSIS OF THE PERCEPTUAL RESULTS
5.2
Analysis
5.2.1
Distance
44
Initially the psychometric function is fitted to the mean proportion of correct responses
with respect to the distance. Figures 5.1, 5.2 and 5.3 shows the non parametric modeling (local linear fit) and the parametric modeling of the blind participants perceptual
1.2
1.2
Mean proportion of correct
Weibull fit
Local linear fit
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
50
Mean proportion of correct
Weibull fit
Local linear fit
1.1
Proportion of correct responses
Proportion of correct responses
1.1
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
100
150
200
250
300
350
400
450
0.2
50
500
100
Distance from the object (cm)
150
200
250
300
350
400
450
500
Distance from the object (cm)
(a)
(b)
Figure 5.1: The parametric (Weibull fit) and non parametric (Local linear fit) modeling of the mean
proportion of correct responses of the blind participants as a function of distance. (a) For the 5ms
recordings in anechoic chamber. (b) For the 5ms recording in conference room.
1.2
1.2
Mean proportion of correct
Weibull fit
Local linear fit
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
50
Mean proportion of correct
Weibull fit
Local linear fit
1.1
Proportion of correct responses
Proportion of correct responses
1.1
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
100
150
200
250
300
350
400
Distance from the object (cm)
(a)
450
500
0.2
50
100
150
200
250
300
350
400
450
500
Distance from the object (cm)
(b)
Figure 5.2: The parametric (Weibull fit) and non parametric (Local linear fit) modeling of the mean
proportion of correct responses of the blind participants as a function of distance. (a) For the 50ms
recordings in anechoic chamber. (b) For the 50ms recording in conference room.
1.2
1.2
1.1
1.1
1
1
Proportion of correct responses
Proportion of correct responses
CHAPTER 5. ANALYSIS OF THE PERCEPTUAL RESULTS
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
50
100
150 200 250 300 350 400
Distance from the object (cm)
(a)
0.9
0.8
0.7
0.6
0.5
0.4
Mean proportion of correct
Weibull fit
Local linear fit
Mean proportion of correct
Weibull fit
Local linear fit
0.3
450
500
45
0.2
50
100
150 200 250 300 350 400
Distance from the object (cm)
450
500
(b)
Figure 5.3: The parametric (Weibull fit) and non parametric (Local linear fit) modeling of the mean
proportion of correct responses of the blind participants as a function of distance. (a) For the 500ms
recordings in anechoic chamber. (b) For the 500ms recording in conference room.
results with respect to distance for recordings of 5, 50 and 500ms signals in the anechoic
and conference room. The link function used for the parametric modeling was Weibull
function. As the link function was not appropriate the fit does not correlate well with
the perceptual results . However, the local linear fit correlates well with the perceptual
results. This demonstrates the advantages of the use of non parametric modeling. It is
to be noted that the mean of the proportion of the correct responses of the participants
were used for psychometric fitting in this chapter. If we had used the individual responses
the individual participants thresholds would vary but the local linear fit would probably
be well correlated with the perceptual results. Hence, the results in the remaining part
of the section of this chapter are based on the psychometric function using local linear
fit. The implementation of the non parametric model fitting in Matlab by the Zychaluk
and Foster (2009) was used for this purpose.
The local linear fit needs at least 3 stimulus values to make the fit . As the recordings
in the lecture room (Experiment 2) had only two stimulus values i.e. at 100 and 150
cm it was not possible to make a psychometric fit for these recordings. When the
subject’s proportion of correct response is 0.75 one can say that the subject can detect
the object. Hence the threshold values of the stimulus in this chapter were chosen at this
proportion of correct response. The term threshold refers to the subjective threshold as
the output of the auditory models depict the human hearing. The threshold of loudness,
repetition pitch and sharpness refers to absolute threshold for which a participant can
echolocate using the respective subjective attribute. The threshold of distance refers to
the distance at which a person may detect an object with a certain probability. As the
fitted psychometric function is discrete, it may be possible that the fit may not have
a value at 0.75 exactly. Hence, the threshold values were chosen by taking the mean
between the proportion of correct response 0.73 and 0.75.
The threshold values of the distance for which the blind and the sighted would be able
to detect the object using echolocation are tabulated in Table 5.1. The results show that
the blind participants could detect the object at farther distances than the sighted.
CHAPTER 5. ANALYSIS OF THE PERCEPTUAL RESULTS
Room
Threshold(cm)
5ms
50ms
500ms
blind
sighted
blind
sighted
blind
sighted
Anechoic
150
130
166
160
172
166
Conference
158
121
176
147
247
207
-a
-
-
-
-
-
Lecture
a
46
Non parametric psychometric fit needs atleast 3 inputs
Table 5.1: Detection thresholds of object distance (cm) for duration, room, and listener groups. The
threshold values were calculated from the psychometric function of the blind and sighted participants
response at the mean percentage of correct responses value of 0.73 to 0.75.
5.2.2
Loudness
Room
Threshold(sones)
5ms
500ms
blind
sighted
blind
sighted
blind
sighted
Anechoic
16.8
17.5
43.7
45.1
52.9
53.2
Conference
22.6
24.1
49.4
53.1
53.6
55.3
-a
-
-
-
-
-
Lecture
a
50ms
Non parametric psychometric fit needs atleast 3 inputs
Table 5.2: Threshold values of loudness (sones) for duration, room, and listener groups. The threshold
values were calculated from the psychometric function of the blind and sighted subjects response at
the mean percentage of correct responses value of 0.73 to 0.75.
The threshold values of the loudness for which the blind and the sighted would be able
to detect the object using echolocation are tabulated in Table 5.2. The tabulated data
show that the blind subjects threshold for loudness was low compared to the sighted.
It was roughly 1 sone less in the anechoic chamber and 2 sones less in the conference
room. As the loudness model used for both the sighted and the blind is the same it is
concluded that the low threshold of the blind is due to their perceptual ability. This is
further discussed in Chapter 6.
5.2.3
Pitch
The threshold values of the pitch strength for which the blind and the sighted would be
able to detect the object using echolocation are tabulated in Table 5.3. The threshold
varied for blind and sighted between signal duration and room conditions. One explanation for this variation in the threshold is that for shorter duration signals the participant
is more likely to miss the pitch information and hence it is assumed that the performance
(percentage of correct response) of the participants with 5 and 50ms signals is not just
based on the pitch strength but also on the attention of the participant. Due to the
influence of this attention factor on the perceptual results the thresholds obtained from
5 and 50ms duration signals in Table 5.3 cannot be used.
CHAPTER 5. ANALYSIS OF THE PERCEPTUAL RESULTS
47
Schenkman and Nilsson (2011) showed in their study that when pitch information was
present in the stimuli the participants performance was almost 100 percent. The 500ms
recordings with the object at 50 and 100cm in Experiment 1 had almost 100 percent
correct response for both the blind and the sighted (cf Schenkman and Nilsson, 2010).
Therefore, at this condition it is assumed that the perceptual results depict the performance of the participants based solely on pitch information and the attention factor of
the participant to miss the pitch information can be neglected. Hence, the threshold
obtained from 500ms duration signals in Table 5.3 were used to find the pitch strength
thresholds for the blind and sighted.
Room
Threshold(autocorrelation index)
5ms
500ms
blind
sighted
blind
sighted
blind
sighted
Anechoic
0.77
0.88
0.80
0.96
1.10
1.23
Conference
1.54
2.21
1.07
1.69
1.14
1.41
-a
-
-
-
-
-
Lecture
a
50ms
Non parametric psychometric fit needs atleast 3 inputs
Table 5.3: Threshold values of the pitch strength (autocorrelation index) for duration, room, and
listener groups calculated from the psychometric function of the blind and sighted participants response
at the mean percentage of correct responses value of 0.73 to 0.75.
If it is assumed that the auditory system analyses the pitch information absolutely i.e. it
does not compare the peak heights in the ACF between the recordings (when presented
in a two alternative forced choice manner) than the results depict that the absolute
threshold for detecting the pitch based on autocorrelation theory should be greater than
1.10 and 1.23 (autocorrelation index) for the blind and the sighted, respectively. On
the other hand, if it is assumed that the auditory system analyses the pitch information
relatively i.e. it compares the peak heights in the ACF between the recordings (when
presented in a two alternative forced choice manner) than the results depict that the relative threshold for detecting the pitch based on autocorrelation theory should be greater
than 0.36 and 0.49 (autocorrelation index) for the blind and the sighted, respectively.
5.2.4
Sharpness
The threshold values of the sharpness for which the blind and the sighted would be
able to detect the object using echolocation is tabulated in Table 5.4 . The tabulated
data show that the blind and sighted participants threshold for sharpness was almost
the same. However, unlike loudness and pitch strength the sharpness information may
not always be greater in value for the participants to detect the object. For example
in Experiment 1 and Experiment 2 the participants were presented with two stimuli
one with object and one without object in an alternative forced choice manner. The
participant distinguishes the recording with the object from the recording without the
object by identifying the one with higher loudness level or pitch strength.
However, suppose a participant uses sharpness to distinguish the recordings it is not
always necessary that the recording with the object has the higher value of sharpness.
It could be that the recording with the object is duller (lower value of sharpness) than
the recording without the object and the particpant may use this information to identify
CHAPTER 5. ANALYSIS OF THE PERCEPTUAL RESULTS
Room
Threshold(acums)
5ms
50ms
500ms
blind
sighted
blind
sighted
blind
sighted
Anechoic
1.97
1.98
1.96
1.98
1.94
1.96
Conference
2.01
2.03
1.94
1.94
1.97
1.97
-a
-
-
-
-
-
Lecture
a
48
Non parametric psychometric fit needs atleast 3 inputs
Table 5.4: Threshold values of the mean of the mean of median sharpness (acums) for duration,
room, and listener groups calculated from the psychometric function of the blind and sighted subjects
response at the mean percentage of correct responses value of 0.73 to 0.75.
the object. A detailed discussion on whether or not the sharpness information is useful
for the participants to echolocate is presented in Chapter 6.
Chapter 6
Discussion
As stated in the introduction, one recent focus on human echolocation research is to
find the causes for the variability of echolocation ability between the blind and sighted.
Although it is expected that the combination of neuroimaging and psychoacoustic methods can give us some insight into the high echolocating ability of the blind, these do not
reveal the information in the acoustic stimulus that determines it (at least when the information is not known) and how this information is represented in the human auditory
system. The implementation of auditory models for human echolocation was mainly to
solve this issue of finding the important information that is the cause for the variability
of echolocation ability between the blind and sighted and how this information might be
represented in the human auditory system.
Initially the signal analysis was done and presented in Chapter 3 to find the physical
information that is useful for echolocation and also to analyze the influence of the room
acoustics on human echolocation. The sound pressure level, auto correlation and spectral
centroid analysis were performed on the recordings and the results demonstrate that the
acoustics of the room does effect the stimuli and thereby the physical attributes that
depend on it. However, as the information represented in the auditory system is complex
the auditory models available in the literature were used to study how the corresponding
perceptual attributes of sound pressure level, auto correlation and spectral centroid were
represented in the auditory system.
The results suggest that the repetition pitch, loudness and sharpness provide potential
information for the listeners to echolocate at distances below 200cm. The results also
show that at longer distances sharpness information may influence human echolocation.
A detailed discussion of how loudness, pitch and sharpness is essential for human echolocation and how they might be represented in the auditory system is presented in sections
6.1, 6.2 and 6.3. A discussion of how the room acoustics and binaural information affect
human echolocation is presented in sections 6.4 and 6.5 followed by discussion of the advantages of using auditory models in understanding human echolocation and theoretical
implications of the thesis in sections 6.6 and 6.7 respectively.
6.1
Echolocation and loudness
The loudness model of Glasberg and Moore (2002) was used in our analysis as it gives
a good fit to the equal loudness contours in ISO 2006. The results of the model were
compared with the proportion of correct responses of the listeners . The results are
tabulated in Tables 4.1 to 4.3 of Chapter 4 and a comparison of these with the participants
perceptual response are shown in Table 5.2 of Chapter 5.
49
CHAPTER 6. DISCUSSION
50
The difference in loudness level between the loudness threshold of the sighted and loudness level of the recording without the object for 5, 50 and 500ms duration signals in
the anechoic room were approximately 4.2, 5 and 5 sones and for the conference room
were 5, 8 and 3 sones, respectively (cf Table 4.1 to 4.3 and Table 5.2). This difference
in loudness level is sufficient to be used by the participants to echolocate. This shows
that the loudness information is one of the potential information that can be used by
the participants to echolocate. When comparing the loudness threshold of sighted and
blind, the threshold for the blind was lower when compared to the sighted (cf Table 5.2).
As the same model is used for the analysis it is not obvious what makes this perceptual
difference. However, if it is assumed that the loudness information is encoded in the
same manner for both the blind and the sighted then the results show that the blind can
echolocate at lower levels when compared to that of the sighted.
6.2
Echolocation and pitch
Repetition pitch is one of the important information sources that the listeners use to
detect the object at shorter distances. However it is not clear how this information is
represented in the auditory system. To find out how the repetition pitch is perceived in
the auditory system a dual profile analysis was performed in section 4.2.2.1 of Chapter
4 . The results suggested that the repetition pitch could be explained using the peaks in
the temporal profile rather than the peaks in the spectral profile of the auto correlation
function. This is agreement with the study of Yost (1996), that the peaks in the temporal domain of the autocorrelation are the basis for the explanation of repetition pitch
perception.
However, the dual profile analysis was not sufficient to find the strength of the pitch
perceived as the peaks were more random in the temporal profile of the autocorrelation
function. A pitch strength measure was used to solve this problem (cf equation 4.10).
The results are tabulated in Tables 4.4 to 4.6 of Chapter 4 and Table 5.3 of Chapter 5.
The pitch strength results show that there is a threshold of 1 for the participants to detect
the pitch from the peak heights of the auto correlation function in the temporal profile.
Regarding the pitch strength threshold between the sighted and the blind the threshold
for the blind was lower when compared to the sighted. As the auditory models were used
without changing its parameters for the analysis it is not evident what determines this
perceptual difference. In this thesis it is assumed that the pitch information is encoded
in the same manner for both the blind and the sighted. In light of this assumption the
results show that the blind when compared to the sighted can echolocate efficiently by
using the pitch information which has a lower pitch strength.
6.3
Echolocation and sharpness
Sharpness is the measure of the sound extending from dull to sharp. To find out how the
sharpness information is useful for the participants to echolocate the weighted centroid
was computed on the specific loudness by using the code from Psysound3. Pedrielli,
Carletti, and Casazza (2008), showed in their analysis that the just noticeable difference
for sharpness was 0.04 acum. The tabulated results in our analysis (cf Table 4.7 to 4.9)
show that the difference in sharpness was greater than 0.04 acum for recordings with
object at 50, 100, 150 and 200cm . However, at these distances the loudness information
or the pitch information is more prominent. Hence, at these distances the sharpness
information might not be the major information for the participants to echolocate but
this has to be verified.
CHAPTER 6. DISCUSSION
51
Interestingly, for the 500ms recording in the anechoic chamber with object at 400cm
and 500cm the sharpness difference was approximately greater than 0.04 acum when
compared to the recording without the object (cf Table B.13). According to the study
of Pedrielli, Carletti, and Casazza (2008) the just noticeable difference for sharpness was
0.04 acum. Hence, this should be the vital information that the participants may use in
order to detect the object at 400cm in Experiment 1. Performing a further experiment
by controlling the sharpness information of the stimuli might give us much insight into
how this attribute of the sound is helpful for echolocation.
6.4
Echolocation and room acoustics
Loudness, pitch and sharpness provide the participants useful information to echolocate.
These attributes depend on the physiology of the auditory system but on the other hand
they also depend on the acoustics of the room and the type of the stimuli used. The
results of the recordings of Experiment 1 and Experiment 2 depict this.
For example, the conference room of Experiment 1 improved the pitch strength and hence
enabled the participants to echolocate at farther distances but on the other hand the
lecture room in Experiment 2 diminished the pitch strength and the participants had to
rely on other information like loudness to echolocate in this room causing a deterioration
in the performance. One cause for deterioration in performance of the participants may
be due to the difference in the recording setup of Experiment 2 and Experiment 1 i.e.
the loudspeaker was on the chest of the artificial head for Experiment 1 but was behind
the artificial head for Experiment 2. Another cause for the deterioration might be the
room acoustics itself i.e. the reverberation time for Experiment 1 conference room was
0.4s and Experiment 2 lecture room was 0.6s.
Another example that depicts the influence of room acoustics on echolocation is the
recordings in the anechoic room from Experiment 1. The recordings with object at
400cm and 500cm had no other reflections from the room except from the object. This
may be the cause for the slight sharpness difference which might be favorable for the
participants to detect the object. These results show that by careful design of the room
acoustics one can improve the echolocation ability of the listeners in that environment.
6.5
Echolocation and binaural information
The binaural information may provide additional information for the participants to
echolocate. As mentioned in Chapter 3, past studies show that the inter aural level differences and inter aural time difference provide information for echolocating. For example,
in the study of Papadopoulos et al. (2011) the information for obstacle discrimination
were found in the frequency dependent inter aural level differences (ILD) especially in
the range from 5.5 to 6.5 kHz. Recently in the study of Nilsson and Schenkman (2015),
it was found that the blind people used the ILD more efficiently than the sighted.
As the recordings of Experiment 1 and Experiment 2 were static the binaural information
was not considered in this thesis. The static nature of recordings might be a cause for the
lower performance of the particpants to echolocate. However, in a real situation blind
persons would use their own sounds and also be moving heads and body. It is reasonable
to conclude that such sounds offer more information to the blind.
CHAPTER 6. DISCUSSION
6.6
52
Advantages or disadvantages of auditory model approach to human echolocation
The research done to understand human echolocation has mostly been using psychoacoustic experiments, where a physical stimulus was presented to the participants in a
controlled manner. This helps the researcher to identify the underlying cause for the
echolocation of the participants. However, in some cases although the stimuli are presented in a controlled manner the underlying cause for the echolocation is not evident.
This is the case with the experiments of Schenkman and Nilsson (2010), where the blind
participants were able to perform better than the sighted but the underlying cause for
the high performance could not be determined.
As discussed in the introduction of this thesis scanning the participants brain using
functional magnetic resonance imaging and trying to locate which areas in the brain are
activated when the participant detects an object can help the researcher to understand
whether physiological adaptation is the cause for the high echolocation ability of the
blind. However, one disadvantage of such an analysis is that it does not fully reveal
us how the information necessary for the high echolocation ability is represented in the
auditory system.
To solve this problem the binaural loudness model of Moore and Glasberg (2007), auditory image model of Patterson, Allerhand, and Giguere (1995) and sharpness model
of Fastl and Zwicker (2007) were implemented in this thesis. The reason for choosing
the loudness model of Moore and Glasberg (2007) was that it agrees well with the equal
loudness contours of ISO 2006 and also gives an accurate representation of the binaural loudness (Moore, 2014). One reason for choosing the auditory image model is that
instead of using two different modules to depict the frequency selectivity and compression it uses a dynamic compressive gammachirp filterbank (dcGC) module to depict the
frequency selectivity and the compression of the basilar membrane.
The analysis performed using the AIM showed that the peaks in the temporal information
is the source for repetition pitch perception. The sharpness analysis performed using
the sharpness model showed that the blind participants might be using this attribute
to detect objects at longer distances and that both temporal and spectral information
is required to encode this attribute. The results suggest that the auditory models do
explain how the information necessary for the high echolocation ability of the blind is
represented in the auditory system.
In order to know whether the high echolocation ability is due to physiological differences or not one should vary the parameters of the model such that the results from
the model fit the participants perceptual results. This was not considered in this thesis
and an assumption was made that the high echolocation ability is due to the high perceptual ability. In light of the above mentioned advantages and disadvantages it would
be more efficient for a researcher to use psychoacoustic experiments, neuroimaging as
well as auditory model analysis in conjunction with signal analysis to understand human
echolocation.
CHAPTER 6. DISCUSSION
6.7
53
Theoretical implications of thesis
The signal analysis performed on the physical stimuli showed how sound pressure level,
autocorrelation and spectral centroid vary with the recordings. Hence, signal analysis
is a vital tool that can be used to find the physical information that is necessary for
human hearing. Furthermore as the auditory models were developed on the basis of the
research in physiology and psychology of human auditory system they depict the human
hearing. The auditory analysis done on the recordings of Experiment 1 and Experiment
2 agree with the study of Yost (1996) that the information necessary for pitch perception
is represented temporally in the auditory system.
Assuming that one cause for high echolocation ability is perceptual, the subjective thresholds for the blind and the sighted participants were obtained by comparing the auditory
models results with the perceptual results of the blind and the sighted participants. The
results indicate that the blind participants have low thresholds of detection and hence
are better than sighted in echolocating.
Regarding the implications of the thesis to human echolocation, the auditory analysis
confirmed that repetition pitch and loudness are important information sources for the
listeners to echolocate at shorter distances which is in agreement with the results of
Schenkman and Nilsson (2010, 2011), Kolarik, Cirstea, Pardhan, and Moore (2014).
Sharpness information was also analyzed and it was found that it can be important both
at short and long distance. There has been no previous research in human echolocation that investigated the usefulness of sharpness for human echolocation. Performing
psychoacoustic experiments might give us further insight on the usefulness of timbre
qualities such as sharpness for echolocation.
Chapter 7
General Conclusion
7.1
Conclusions
The aim in implementing the auditory models for human echolocation was to find the
information that determines high echolocation ability and how this information is represented in the auditory system. As for the information necessary for high echolocation
ability three subjective attributes were considered in this thesis as they are known to
be of importance namely, loudness, pitch and sharpness. To study how these subjective
attributes are represented in the human auditory system a number of auditory models
were used.
To analyze how loudness is useful for echolocation the binaural loudness model of Moore
and Glasberg (2007) was used as it gives a good fit to the equal loudness contours in
ISO 2006 (Moore, 2014). The auditory image model of Bleeck, Ives, and Patterson
(2004b), was used to analyze the repetition pitch phenomenon that is known to be useful
for echolocation at shorter distances. One reason for using auditory image model for
repetition pitch analysis was due to the the dynamic compressive filterbank which is
physiologically inspired and depicts the frequency selectivity and compression of the
basilar membrane. Finally to analyze sharpness, the loudness model of Glasberg and
Moore (2002) was used and the sharpness information was obtained from the weighted
centroid of the specific loudness (Fastl and Zwicker, 2007).
The analysis showed that at shorter distances repetition pitch, loudness and sharpness
provide the information for the participants to echolocate. At longer distances sharpness information might be used by the subjects to echolocate. This conclusion has to
be justified by performing a further experiment which has control over the sharpness
attribute of the stimuli. Regarding how the useful information for human echolocation
might be represented in the auditory system, the analysis confirmed that the repetition
pitch is represented using the peaks in the temporal profile rather than the spectral
profile (Yost, 1996) and as the sharpness information is computed using the centroid of
the specific loudness, it is represented using the spectral and temporal information.
Although the auditory analysis in this thesis were done using different auditory models
to analyze loudness, pitch and sharpness attributes, the auditory model used to compare
the perceptual results of the blind and the sighted were the same (e.g: the same loudness
model was used for both sighted and blind). Hence, it is assumed in this thesis that the
high echolocation ability of the blind is due to their perceptual ability and therefore it
was justified to compute the thresholds for the blind and the sighted in the same way.
The analysis showed that the blind had lower thresholds than the sighted and could
echolocate at a lower loudness and pitch strength levels. It is to be noted that the the
recordings in Experiment 1 and Experiment 2 were recorded at static positions. In real
54
CHAPTER 7. GENERAL CONCLUSION
55
life situations the listeners would be using their own sounds and both the listener and
the reflecting object may be moving. Probably the thresholds would be even lower for
such situations.
In conclusion, the thesis has shown the importance of understanding the roles of pitch,
loudness and timbre for human echolocation. The specific roles and interactions of these
three aspects have to be studied in more detail. Especially, the role of timbre is a topic
worthy of deeper understanding.
7.2
Future work
In this thesis it was assumed that the information is represented in a similar way for
both the blind and the sighted. However, this presupposition may not be true, i.e.
the high echolocation ability of the blind may be due to physiological differences. As a
part of future work to investigate this, it is required to change the parameters in the
auditory models and analyze the results in parallel with neuroimaging, psychoacoustic
experiments as well as various methods in signal analysis. Neuroimaging may help to
identify whether the high echolocation ability is related to the listeners physiological
ability. When it is established that the underlying ability of the listeners is physiological then the parameters of the auditory models can be varied until the results from
the auditory models agree with the psychoacoustic results. In this way nueroimaging,
psychoacoustic experiments, auditory models and signal analysis together may help us
to understand how information necessary for high ability of the blind is represented and
perceived.
Bibliography
ANSI, 1994 “American national standard acoustical terminology, ansi s1.1-1994” American National
Standard Institute, New York
Arias C, Ramos O A, 1997 “Psychoacoustic tests for the study of human echolocation ability” Applied
Acoustics 51 399–419
ASA, 1960 “Acoustical terminology si, 1–1960” American Standards Association, New York
ASA, 1973 “American national psychoacoustical terminology, s3.20–1973” American Standards Association, New York
Bassett I G, Eastmond E J, 1964 “Echolocation: Measurement of pitch versus distance for sounds
reflected from a flat surface” The Journal of the Acoustical Society of America 36 911
Bilsen F, 1966 “Repetition pitch: monaural interaction of a sound with the repetition of the same,
but phase shifted sound” Acustica 17 295–300
Bilsen F, Ritsma R, 1969 “Repetition pitch and its implication for hearing theory” Acustica 22 63–73
Bleeck S, 2011 “Aim-mat” [Online; accessed 25-April-2016]
URL https://code.soundsoftware.ac.uk/projects/aimmat
Bleeck S, Ives T, Patterson R D, 2004a “Aim-mat” [Online; accessed 25-April-2016]
URL http://w3.pdn.cam.ac.uk/groups/cnbh/aimmanual/download/downloadframeset.htm
Bleeck S, Ives T, Patterson R D, 2004b “Aim-mat: the auditory image model in matlab” Acta Acustica
United with Acustica 90 781–787
Cabrera D, 2014 “Psysound3” [Online; accessed 25-April-2016]
URL http://www.psysound.org
Cabrera D, Ferguson S, Schubert E, 2007 “’psysound3’: Software for acoustical and psychoacoustical
analysis of sound recordings” in “Proceedings of the 19th International Conference on Auditory
Display (ICAD 2007)”, pp. 356–363
Dallenbach M, Cotzin K M, 1950 “” facial vision:” the rôle of pitch and loudness in the perception of
obstacles by the blind” The American Journal of Psychology 63 485–515
Dallenbach Cotzin M, Supa Milton K M, 1944 “” facial vision”: The perception of obstacles by the
blind” The American Journal of Psychology 57 133–183
De Boer E, 1956 On the” residue” in hearing Ph.D. thesis Uitgeverij Excelsior
De Cheveigné A, 2010 “Pitch perception” in C J Plack, ed., “Oxford Handbook of Auditory Science
– Auditory Perception”, pp. 71–104 (Oxford University Press, Oxford)
Dufour A, Després O, Candas V, 2005 “Enhanced sensitivity to echo cues in blind subjects” Experimental Brain Research 165 515–519
56
BIBLIOGRAPHY
57
Fastl H, Zwicker E, 2007 Psychoacoustics: Facts and Models volume 22 (Springer Science & Business
Media, Berlin)
Glasberg B R, Moore B C, 2002 “A model of loudness applicable to time-varying sounds” Journal of
the Audio Engineering Society 50 331–342
Goldstein J L, 1973 “An optimum processor theory for the central formation of the pitch of complex
tones” The Journal of the Acoustical Society of America 54 1496
Irino T, Patterson R D, 1997 “A time-domain, level-dependent auditory filter: The gammachirp” The
Journal of the Acoustical Society of America 101 412–419
Irino T, Patterson R D, 2006 “A dynamic compressive gammachirp auditory filterbank” Audio, Speech,
and Language Processing, IEEE Transactions on 14 2222–2232
Kellogg W N, 1962 “Sonar system of the blind new research measures their accuracy in detecting the
texture, size, and distance of objects” by ear” Science 137 399–404
Köhler I, 1964 “Orientation by aural clues. american foundation for the blind” Research Bulletin 4
14–53
Kolarik A J, Cirstea S, Pardhan S, 2013 “Evidence for enhanced discrimination of virtual auditory distance among blind listeners using level and direct-to-reverberant cues” Experimental Brain Research
224 623–633
Kolarik A J, Cirstea S, Pardhan S, Moore B C, 2014 “A summary of research investigating echolocation
abilities of blind and sighted humans” Hearing Research 310 60–68
Licklider J C, 1951 “A duplex theory of pitch perception” Cellular and Molecular Life Sciences 7
128–134
Miura T, Ueda K, Muraoka T, Ino S, Ifukube T, 2008 “Object’s width and distance distinguished by
the blind using auditory sense while they are walking” Journal of the Acoustical Society of America
123 3859
Moore B C, 2013 An Introduction to the Psychology of Hearing volume 6 (Academic press, San Diego)
Moore B C, 2014 “Development and current status of the cambridge loudness models” Trends in
Hearing 18 1–29
Moore B C, Glasberg B R, 2007 “Modeling binaural loudness” The Journal of the Acoustical Society
of America 121 1604–1612
Nilsson M E, Schenkman B N, 2015 “Blind people are more sensitive than sighted people to binaural
sound-location cues, particularly inter-aural level differences” Hearing Research
Papadopoulos T, Edwards D S, Rowan D, Allen R, 2011 “Identification of auditory cues utilized in
human echolocation-objective measurement results” Biomedical Signal Processing and Control 6
280–290
Patterson R D, Allerhand M H, Giguere C, 1995 “Time-domain modeling of peripheral auditory
processing: A modular architecture and a software platform” The Journal of the Acoustical Society
of America 98 1890
Patterson R D, Handel S, Yost W A, Datta A J, 1996 “The relative strength of the tone and noise
components in iterated rippled noise” The Journal of the Acoustical Society of America 100 3286
Patterson R D, Robinson K, Holdsworth J, McKeown D, Zhang C, Allerhand M, 1992 “Complex
sounds and auditory images” Auditory Physiology and Perception 83 429–446
BIBLIOGRAPHY
58
Patterson R D, Unoki M, Irino T, 2003 “Extending the domain of center frequencies for the compressive
gammachirp auditory filter” The Journal of the Acoustical Society of America 114 1529–1542
Pedrielli F, Carletti E, Casazza C, 2008 “Just noticeable differences of loudness and sharpness for
earth moving machine. proceedings of acoustics conference, france” in “Proceedings of Acoustics 08
Conference, Paris”, pp. 2205–2210
Peeters G, Giordano B L, Susini P, Misdariis N, McAdams S, 2011 “The timbre toolbox: Extracting
audio descriptors from musical signals” The Journal of the Acoustical Society of America 130 2902–
2916
Pelegrin Garcia D, Roozen B, Glorieux C, 2013 “Calculation of human echolocation cues by means of
the boundary element method” in “Proceedings of the 19th International Conference on Auditory
Display (ICAD 2013)”, pp. 253–259
Rice C E, Feinstein S H, Schusterman R J, 1965 “Echo-detection ability of the blind: Size and distance
factors” Journal of Experimental Psychology 70 246–251
Rojas J A M, Hermosilla J A, Montero R S, Espi P L L, 2009 “Physical analysis of several organic
signals for human echolocation: oral vacuum pulses” Acta Acustica united with Acustica 95 325–330
Rojas J A M, Hermosilla J A, Montero R S, Espı́ P L L, 2010 “Physical analysis of several organic
signals for human echolocation: hand and finger produced pulses” Acta Acustica united with Acustica
96 1069–1077
Rowan D, Papadopoulos T, Edwards D, Holmes H, Hollingdale A, Evans L, Allen R, 2013 “Identification of the lateral position of a virtual object based on echoes by humans” Hearing Research
Schenkman B, 1985 Human echolocation: The detection of objects by the blind Ph.D. thesis Uppsala
University
Schenkman B, Nilsson M E, Grbic N, 2011 “Human echolocation using click trains and continuous
noise” in “Fechner Day 2011: Proceedings of the 27th Annual Meeting of the International Society
for Psychophysics”, pp. 13–18
Schenkman B N, Nilsson M E, 2010 “Human echolocation: Blind and sighted persons’ ability to detect
sounds recorded in the presence of a reflecting object” Perception 39 483
Schenkman B N, Nilsson M E, 2011 “Human echolocation: Pitch versus loudness information” Perception 40 840
Schnupp J, Nelken I, King A, 2011 Auditory Neuroscience (The MIT Press, Cambridge Massachusetts)
Seki Y, Ifukube T, Tanaka Y, 1994 “Relation between the reflected sound localization and the obstacle
sense of the blind” Journal of Acoustical Society of Japan 50 289–295
Teng S, Puri A, Whitney D, 2012 “Ultrafine spatial acuity of blind expert human echolocators”
Experimental Brain Research 216 483–488
Teng S, Whitney D, 2011 “The acuity of echolocation: Spatial resolution in the sighted compared to
expert performance” Journal of Visual Impairment & Blindness 105 20
Terhardt E, 1974 “Pitch, consonance, and harmony” The Journal of the Acoustical Society of America
55 1061
Thaler L, Arnott S R, Goodale M A, 2011 “Neural correlates of natural human echolocation in early
and late blind echolocation experts” PLoS One 6 e20162
Thaler L, Milne J L, Arnott S R, Kish D, Goodale M A, 2014 “Neural correlates of motion processing through echolocation, source hearing, and vision in blind echolocation experts and sighted
echolocation novices” Journal of Neurophysiology 111 112–127
BIBLIOGRAPHY
59
Vestergaard M, Bleeck S, Patterson R, 2011 “Aim2006 documentation” [Online; accessed 24-April2016]
URL http://www.acousticscale.org/wiki/index.php/AIM2006_Documentation
Wallmeier L, Geßele N, Wiegrebe L, 2013 “Echolocation versus echo suppression in humans” Proceedings of the Royal Society B: Biological Sciences 280
Wightman F L, 1973 “The pattern-transformation model of pitch” The Journal of the Acoustical
Society of America 54 407
Yost W, 2007 Fundamentals of Hearing: An Introduction (Elsevier Academic Press, San Diego)
Yost W A, 1996 “Pitch strength of iterated rippled noise” The Journal of the Acoustical Society of
America 100 3329
Zychaluk K, Foster D H, 2009 “Model-free estimation of the psychometric function” Attention, Perception, & Psychophysics 71 1414–1425
Appendix A
Room acoustics
A.1
Calibration Constant
The reference sound pressure level (SPL) to calculate the calibration constants in anechoic, conference and lecture rooms were documented in dB(A) i.e. 77, 79 and 79
dB(A) respectively (Schenkman and Nilsson, 2010; Schenkman, Nilsson, and Grbic,
2011). Hence, to calculate the calibration constant the recordings should be A weighted.
However, at the time of documentation it was found that instead of using the equation
A.2 to find the calibration constant (CC) equation A.1 was used.
SP L−20∗log10
20
CC = 10
( rms(signal)
)
20∗10−6
SP L−20∗log10
(A.1)
( rms(Aweighting(signal))
)
20∗10−6
20
CC = 10
(A.2)
To find out the difference between equation A.1 and equation A.2 the calibrated levels
with and without A weighting for the 9th version of left ear, 500ms no object first
recording in anechoic, conference room and for the 9th version of left ear, 500ms no
object recording in lecture room were calculated. The results are tabulated in Table
A.1.
With A weighting
Without A weighting
Anechoic
77.46
77.00
Conference
79.51
78.99
Lecture
79.29
78.99
Table A.1: Calibrated levels with and without A weighting for the 9th version of left ear 500ms no
object first recording in anechoic, conference room and for the 9th version of left ear 500ms no object
recording in lecture room.
The results suggests that finding the calibration constant from A weighted signal would
give us a increase in calibrated level by approximately less than 0.5 dB which is small.
Hence, it was concluded that although equation A.1 was used instead of using A.2 to
calculate the calibration constants. As the difference between them is small the calibration constants calculated from equation A.1 were used to calibrate all the recordings in
this thesis.
60
APPENDIX A. ROOM ACOUSTICS
A.2
61
Sound Pressure Level
NoObject1
NoObject2
Object50cm
Object100cm
Object200cm
Object300cm
Object400cm
Object500cm
Ver1
78.354
78.873
86.316
78.055
78.282
78.146
78.243
78.194
Ver2
79.504
79.910
85.274
79.212
79.479
79.331
79.428
79.364
Ver3
75.639
76.276
83.197
75.465
75.657
75.493
75.540
75.496
Ver4
79.177
79.705
86.097
78.872
79.120
78.964
79.074
78.999
Ver5
79.269
79.757
85.623
78.872
79.209
79.000
79.152
79.098
Ver6
78.826
79.251
83.799
78.495
78.765
78.620
78.768
78.678
Ver7
78.260
78.757
85.329
77.959
78.240
78.060
78.156
78.112
Ver8
76.552
77.177
83.303
76.323
76.568
76.457
76.438
77.698
Ver9
77.852
78.149
83.158
77.494
77.703
77.681
77.775
77.755
Ver10
76.254
76.894
83.295
75.977
76.281
76.098
76.108
76.011
Mean
77.969
78.475
84.539
77.673
77.931
77.785
77.868
77.941
Table A.2: SPL values (dBA) for 10 versions of the left ear recordings in the anechoic chamber with
5ms duration signal
NoObject1
NoObject2
Object50cm
Object100cm
Object200cm
Object300cm
Object400cm
Object500cm
Ver1
79.049
78.563
88.572
79.405
79.211
79.382
79.179
79.212
Ver2
80.219
79.755
88.017
80.526
80.381
80.518
80.346
80.384
Ver3
76.185
75.684
86.074
76.676
76.439
76.637
76.353
76.414
Ver4
79.742
79.201
89.413
79.989
79.903
80.082
79.891
79.897
Ver5
80.017
79.478
89.294
80.416
80.280
80.403
80.161
80.233
Ver6
79.280
78.821
87.578
79.672
79.442
79.566
79.435
79.457
Ver7
79.067
78.516
87.492
79.347
79.230
79.399
79.204
79.253
Ver8
77.087
76.531
87.286
77.316
77.271
77.467
77.238
78.493
Ver9
78.498
78.049
85.596
78.862
78.610
78.831
78.632
78.700
Ver10
77.061
76.461
86.860
77.256
77.286
77.424
77.207
77.215
Mean
78.620
78.106
87.618
78.947
78.805
78.971
78.765
78.926
Table A.3: SPL values (dBA) for 10 versions of the right ear recordings in the anechoic chamber with
5ms duration signal
NoObject1
Object100cm
Object150cm
Ver1
73.832
74.361
74.141
Ver2
71.237
71.686
71.213
Ver3
72.262
72.767
72.536
Ver4
72.784
73.255
73.167
Ver5
70.988
71.563
71.182
Ver6
72.732
73.146
72.883
Ver7
71.606
72.089
71.807
Ver8
75.142
75.616
75.227
Ver9
73.013
73.501
73.036
Ver10
70.779
71.127
70.735
Mean
72.437
72.911
72.593
Table A.4: SPL values (dBA) for 10 versions of the left ear recordings in the lecture chamber with
5ms duration signal
NoObject1
Object100cm
Object150cm
Ver1
73.957
76.450
74.128
Ver2
71.567
72.820
71.594
Ver3
72.382
74.638
72.557
Ver4
73.078
75.320
73.243
Ver5
71.414
73.516
71.489
Ver6
72.774
74.667
72.912
Ver7
72.005
74.211
72.105
Ver8
75.163
77.810
75.257
Ver9
73.148
75.351
73.150
Ver10
70.995
72.393
70.977
Mean
72.648
74.718
72.741
Table A.5: SPL values (dBA) for 10 versions of the right ear recordings in the lecture chamber with
5ms duration signal
APPENDIX A. ROOM ACOUSTICS
NoObject1
NoObject2
Object50cm
Object100cm
Object200cm
Object300cm
Object400cm
Object500cm
Ver1
80.438
80.413
89.311
80.223
80.279
80.340
80.379
80.443
Ver2
80.437
80.402
89.232
80.221
80.268
80.341
80.377
80.444
Ver3
80.433
80.396
89.175
80.215
80.256
80.335
80.376
80.432
62
Ver4
80.434
80.393
89.168
80.218
80.241
80.332
80.366
80.421
Ver5
80.434
80.385
89.169
80.214
80.234
80.334
80.364
80.409
Ver6
80.432
80.377
89.176
80.217
80.234
80.331
80.364
80.394
Ver7
80.429
80.354
89.181
80.215
80.231
80.328
80.368
80.392
Ver8
80.429
80.339
89.179
80.213
80.218
80.326
80.370
80.389
Ver9
80.432
80.340
89.171
80.213
80.203
80.329
80.363
80.385
Ver10
80.430
80.306
89.160
80.206
80.192
80.326
80.366
80.375
Mean
80.433
80.370
89.192
80.216
80.236
80.332
80.369
80.409
Table A.6: SPL values (dBA) for 10 versions of the left ear recordings in the conference chamber with
5ms duration signal
NoObject1
NoObject2
Object50cm
Object100cm
Object200cm
Object300cm
Object400cm
Object500cm
Ver1
79.779
79.687
89.084
79.883
79.771
79.853
79.789
79.725
Ver2
79.776
79.677
89.086
79.885
79.760
79.858
79.787
79.715
Ver3
79.772
79.670
89.068
79.880
79.757
79.853
79.789
79.693
Ver4
79.776
79.661
89.079
79.877
79.743
79.852
79.782
79.683
Ver5
79.774
79.655
89.058
79.869
79.740
79.853
79.780
79.680
Ver6
79.774
79.653
89.073
79.868
79.737
79.850
79.781
79.673
Ver7
79.773
79.631
89.060
79.866
79.734
79.849
79.781
79.664
Ver8
79.776
79.579
89.048
79.865
79.724
79.849
79.783
79.654
Ver9
79.775
79.497
89.038
79.862
79.708
79.856
79.779
79.650
Ver10
79.772
79.358
89.040
79.854
79.666
79.849
79.785
79.648
Mean
79.775
79.607
89.063
79.871
79.734
79.852
79.784
79.679
Table A.7: SPL values (dBA) for 10 versions of the right ear recordings in the conference chamber
with 5ms duration signal
NoObject1
NoObject2
Object50cm
Object100cm
Object200cm
Object300cm
Object400cm
Object500cm
Ver1
77.496
77.955
85.654
82.085
77.411
77.236
77.536
77.311
Ver2
74.845
75.192
83.337
79.567
74.784
74.584
74.767
74.697
Ver3
76.418
76.839
84.884
81.857
76.340
76.198
76.372
76.241
Ver4
77.436
77.776
84.873
82.412
77.314
77.185
77.346
77.263
Ver5
77.466
77.915
85.147
82.157
77.447
77.275
80.044
77.306
Ver6
77.561
77.914
85.211
82.127
77.415
77.308
77.506
77.419
Ver7
77.275
77.704
84.687
81.104
77.237
77.054
77.164
77.094
Ver8
77.695
78.168
86.040
81.829
77.621
77.492
77.603
77.513
Ver9
77.418
77.798
85.682
82.018
77.316
77.160
77.331
77.252
Ver10
76.461
76.816
84.667
81.621
76.332
76.227
76.432
76.298
Mean
77.007
77.408
85.018
81.678
76.922
76.772
77.210
76.839
Table A.8: SPL values (dBA) for 10 versions of the left ear recordings in the anechoic chamber with
50ms duration signal
NoObject1
NoObject2
Object50cm
Object100cm
Object200cm
Object300cm
Object400cm
Object500cm
Ver1
78.199
77.715
88.589
82.631
78.421
78.518
78.441
78.362
Ver2
75.433
74.989
85.930
80.359
75.618
75.728
75.555
75.614
Ver3
77.092
76.604
87.915
81.934
77.311
77.428
77.281
77.258
Ver4
78.019
77.563
88.257
82.806
78.197
78.298
78.136
78.168
Ver5
78.220
77.720
88.238
82.870
78.427
78.538
80.726
78.392
Ver6
78.135
77.680
88.193
82.773
78.287
78.445
78.280
78.311
Ver7
77.960
77.480
87.764
82.212
78.166
78.236
78.085
78.104
Ver8
78.377
77.895
88.938
82.768
78.553
78.717
78.509
78.545
Ver9
78.062
77.580
88.614
82.569
78.246
78.368
78.195
78.235
Ver10
77.060
76.584
87.443
81.904
77.173
77.359
77.222
77.216
Mean
77.656
77.181
87.988
82.283
77.840
77.964
78.043
77.820
Table A.9: SPL values (dBA) for 10 versions of the right ear recordings in the anechoic chamber with
50ms duration signal
APPENDIX A. ROOM ACOUSTICS
NoObject1
NoObject2
Object50cm
Object100cm
Object200cm
Object300cm
Object400cm
Object500cm
Ver1
79.427
79.428
88.367
83.272
79.933
79.375
79.432
79.428
Ver2
79.429
79.427
88.375
83.258
79.939
79.366
79.428
79.427
Ver3
79.422
79.422
88.370
83.244
79.946
79.367
79.434
79.425
63
Ver4
79.423
79.421
88.373
83.252
79.960
79.367
79.430
79.425
Ver5
79.422
79.421
88.364
83.261
79.951
79.370
79.431
79.425
Ver6
79.415
79.419
88.348
83.276
79.950
79.368
79.425
79.423
Ver7
79.417
79.421
88.346
83.278
79.925
79.363
79.432
79.423
Ver8
79.417
79.420
88.346
83.278
79.953
79.366
79.425
79.416
Ver9
79.415
79.418
88.353
83.274
79.927
79.365
79.420
79.414
Ver10
79.414
79.418
88.347
83.282
79.942
79.360
79.424
79.418
Mean
79.420
79.422
88.359
83.267
79.943
79.367
79.428
79.422
Table A.10: SPL values (dBA) for 10 versions of the left ear recordings in the conference chamber
with 50ms duration signal
NoObject1
NoObject2
Object50cm
Object100cm
Object200cm
Object300cm
Object400cm
Object500cm
Ver1
79.307
79.315
88.257
82.959
80.217
79.422
79.374
79.283
Ver2
79.304
79.313
88.276
82.965
80.221
79.418
79.372
79.283
Ver3
79.302
79.311
88.254
82.958
80.228
79.414
79.371
79.279
Ver4
79.299
79.307
88.252
82.973
80.235
79.409
79.373
79.282
Ver5
79.302
79.307
88.239
82.972
80.228
79.411
79.376
79.279
Ver6
79.302
79.305
88.241
82.977
80.238
79.409
79.372
79.279
Ver7
79.297
79.304
88.240
82.968
80.217
79.413
79.374
79.280
Ver8
79.298
79.303
88.238
82.961
80.238
79.417
79.368
79.277
Ver9
79.295
79.305
88.251
82.953
80.221
79.412
79.366
79.274
Ver10
79.298
79.306
88.243
82.948
80.236
79.413
79.367
79.275
Mean
79.300
79.308
88.249
82.963
80.228
79.414
79.371
79.279
Table A.11: SPL values (dBA) for 10 versions of the right ear recordings in the conference chamber
with 50ms duration signal
NoObject1
NoObject2
Object50cm
Object100cm
Object200cm
Object300cm
Object400cm
Object500cm
Ver1
77.795
78.264
86.080
82.249
77.706
77.583
77.693
77.619
Ver2
77.517
77.965
85.668
82.085
77.501
77.329
77.423
77.346
Ver3
77.020
77.478
85.124
81.621
77.039
76.819
76.917
76.843
Ver4
76.485
76.922
84.469
81.355
76.416
76.272
76.387
76.329
Ver5
77.047
77.506
84.844
81.971
77.003
76.865
76.942
76.901
Ver6
77.280
77.712
84.995
82.227
77.217
77.058
77.187
77.124
Ver7
77.563
77.986
85.480
82.461
77.495
77.339
77.470
77.435
Ver8
76.939
77.356
85.089
81.811
76.847
76.700
76.848
76.772
Ver9
77.000
77.407
85.092
81.666
76.919
77.127
76.871
76.799
Ver10
76.889
77.327
84.976
81.329
76.829
76.661
76.775
76.704
Mean
77.153
77.592
85.182
81.877
77.097
76.975
77.051
76.987
Table A.12: SPL values (dBA) for 10 versions of the left ear recordings in the anechoic chamber with
500ms duration signal
NoObject1
NoObject2
Object50cm
Object100cm
Object200cm
Object300cm
Object400cm
Object500cm
Ver1
78.535
78.039
88.979
83.138
78.685
78.853
78.661
78.699
Ver2
78.231
77.741
88.651
82.881
78.464
78.568
78.362
78.396
Ver3
77.789
77.284
88.126
82.332
78.007
78.109
77.915
77.946
Ver4
77.205
76.708
87.529
81.845
77.378
77.523
77.328
77.380
Ver5
77.802
77.306
87.905
82.590
77.984
78.140
77.924
77.989
Ver6
77.986
77.501
88.112
82.869
78.148
78.289
78.107
78.153
Ver7
78.233
77.758
88.485
83.048
78.413
78.540
78.356
78.436
Ver8
77.595
77.124
88.167
82.347
77.760
77.902
77.718
77.759
Ver9
77.693
77.190
88.160
82.389
77.861
78.307
77.791
77.837
Ver10
77.588
77.087
88.042
82.066
77.737
77.883
77.696
77.735
Mean
77.866
77.374
88.216
82.550
78.044
78.211
77.986
78.033
Table A.13: SPL values (dBA) for 10 versions of the right ear recordings in the anechoic chamber
with 500ms duration signal
APPENDIX A. ROOM ACOUSTICS
NoObject1
NoObject2
Object50cm
Object100cm
Object200cm
Object300cm
Object400cm
Object500cm
Ver1
79.009
78.988
87.550
82.818
79.587
78.941
79.019
79.011
Ver2
79.007
78.988
87.539
82.863
79.625
78.931
79.025
79.003
Ver3
79.004
78.988
87.525
82.836
79.588
78.919
79.020
79.008
64
Ver4
79.006
78.990
87.538
82.825
79.556
78.936
79.012
79.003
Ver5
78.998
78.996
87.537
82.805
79.659
78.928
79.006
79.006
Ver6
79.002
78.997
87.546
82.824
79.598
78.917
79.016
79.014
Ver7
79.007
78.998
87.548
82.807
79.564
78.916
79.015
79.014
Ver8
79.002
78.992
87.529
82.822
79.602
78.925
79.011
79.005
Ver9
79.000
78.995
87.542
82.828
79.640
78.924
79.015
79.014
Ver10
78.998
78.998
87.533
82.843
79.559
78.926
79.019
79.012
Mean
79.003
78.993
87.539
82.827
79.598
78.926
79.016
79.009
Table A.14: SPL values (dBA) for 10 versions of the left ear recordings in the conference chamber
with 500ms duration signal
NoObject1
NoObject2
Object50cm
Object100cm
Object200cm
Object300cm
Object400cm
Object500cm
Ver1
78.812
78.829
87.501
82.340
79.478
78.914
78.858
78.801
Ver2
78.812
78.825
87.471
82.394
79.493
78.902
78.858
78.796
Ver3
78.814
78.823
87.418
82.407
79.477
78.892
78.859
78.799
Ver4
78.817
78.826
87.452
82.384
79.443
78.904
78.856
78.797
Ver5
78.814
78.827
87.490
82.335
79.520
78.901
78.853
78.801
Ver6
78.819
78.828
87.436
82.381
79.473
78.889
78.865
78.797
Ver7
78.823
78.822
87.426
82.361
79.460
78.893
78.864
78.797
Ver8
78.816
78.817
87.483
82.378
79.483
78.898
78.859
78.794
Ver9
78.818
78.821
87.469
82.391
79.523
78.892
78.862
78.799
Ver10
78.820
78.821
87.425
82.393
79.463
78.894
78.868
78.801
Mean
78.817
78.824
87.457
82.377
79.481
78.898
78.860
78.798
Table A.15: RSPL values (dBA) for 10 versions of the right ear recordings in the conference chamber
with 500ms duration signal
NoObject1
Object100cm
Object150cm
Ver1
79.719
80.188
79.968
Ver2
78.588
79.106
78.812
Ver3
78.855
79.205
79.156
Ver4
78.867
79.239
79.225
Ver5
79.097
79.553
79.350
Ver6
79.445
79.852
79.719
Ver7
78.255
78.630
78.468
Ver8
79.791
80.284
80.028
Ver9
80.036
80.431
80.209
Ver10
79.000
79.453
79.187
Mean
79.165
79.594
79.412
Table A.16: SPL values (dBA) for 10 versions of the left ear recordings in the lecture chamber with
500ms duration signal
NoObject1
Object100cm
Object150cm
Ver1
80.164
82.120
80.222
Ver2
79.051
81.088
79.149
Ver3
79.307
81.139
79.444
Ver4
79.224
81.159
79.327
Ver5
79.490
81.377
79.577
Ver6
79.854
81.852
79.949
Ver7
78.719
80.516
78.818
Ver8
80.197
82.339
80.321
Ver9
80.346
82.393
80.461
Ver10
79.417
81.466
79.543
Mean
79.577
81.545
79.681
Table A.17: SPL values (dBA) for 10 versions of the right ear recordings in the lecture chamber with
500ms duration signal
APPENDIX A. ROOM ACOUSTICS
A.3
65
Spectral Centroid
Frequency (Hz)
Spectral Centroid for NoObject rec2
10000
5000
0
0.05
0.1
0.15
0.2
0.25
0.3
Time(sec)
Spectral Centroid for Object at 50cm
0.35
0.4
Frequency (Hz)
0
10000
5000
0
0.05
0.1
0.15
0.2
0.25
0.3
Time(sec)
Spectral Centroid for Object at 200cm
0.35
0.4
Frequency (Hz)
0
10000
5000
0
0
0.05
0.1
0.15
0.2
0.25
0.3
Time(sec)
Spectral Centroid for Object at 400cm
0.35
0.4
Frequency (Hz)
Frequency (Hz)
Frequency (Hz)
Frequency (Hz)
Frequency (Hz)
Spectral Centroid for NoObject rec1
10000
5000
0
0
0.05
0.1
0.15
0.2
0.25
Time(sec)
0.3
0.35
0.4
10000
5000
0
0
0.05
0.1
0.15
0.2
0.25
0.3
Time(sec)
Spectral Centroid for Object at 100cm
0.35
0.4
0
0.05
0.1
0.15
0.35
0.4
0
0.05
0.1
0.35
0.4
0
0.05
0.1
0.35
0.4
10000
5000
0
0.2
0.25
0.3
Time(sec)
Spectral Centroid for Object at 300cm
10000
5000
0
0.15
0.2
0.25
0.3
Time(sec)
Spectral Centroid for Object at 500cm
10000
5000
0
0.15
0.2
0.25
Time(sec)
0.3
Figure A.1: The spectral centroid as a function of time for the 10 versions (marked in different colors
for each subplot) of the left ear 5ms recording in the anechoic chamber (Experiment 1).
Frequency (Hz)
Spectral Centroid for NoObject rec2
5000
0
0.05
0.1
0.15
0.2
0.25
0.3
Time(sec)
Spectral Centroid for Object at 50cm
0.35
0.4
Frequency (Hz)
0
10000
5000
0
0.05
0.1
0.15
0.2
0.25
0.3
Time(sec)
Spectral Centroid for Object at 200cm
0.35
0.4
Frequency (Hz)
0
10000
5000
0
0
0.05
0.1
0.15
0.2
0.25
0.3
Time(sec)
Spectral Centroid for Object at 400cm
0.35
0.4
Frequency (Hz)
Frequency (Hz)
Frequency (Hz)
Frequency (Hz)
Frequency (Hz)
Spectral Centroid for NoObject rec1
10000
10000
5000
0
0
0.05
0.1
0.15
0.2
0.25
Time(sec)
0.3
0.35
0.4
10000
5000
0
0
0.05
0.1
0.15
0.2
0.25
0.3
Time(sec)
Spectral Centroid for Object at 100cm
0.35
0.4
0
0.05
0.1
0.15
0.35
0.4
0
0.05
0.1
0.15
0.35
0.4
0
0.05
0.1
0.15
0.35
0.4
10000
5000
0
0.2
0.25
0.3
Time(sec)
Spectral Centroid for Object at 300cm
10000
5000
0
0.2
0.25
0.3
Time(sec)
Spectral Centroid for Object at 500cm
10000
5000
0
0.2
0.25
Time(sec)
0.3
Figure A.2: The spectral centroid as a function of time for the 10 versions (marked in different colors
for each subplot) of the right ear 5ms recording in the anechoic chamber (Experiment 1).
APPENDIX A. ROOM ACOUSTICS
66
Frequency (Hz)
Spectral Centroid for NoObject rec2
10000
5000
0
0.05
0.1
0.15
0.2
0.25
0.3
Time(sec)
Spectral Centroid for Object at 50cm
0.35
0.4
Frequency (Hz)
0
10000
5000
0
0.05
0.1
0.15
0.2
0.25
0.3
Time(sec)
Spectral Centroid for Object at 200cm
0.35
0.4
Frequency (Hz)
0
10000
5000
0
0
0.05
0.1
0.15
0.2
0.25
0.3
Time(sec)
Spectral Centroid for Object at 400cm
0.35
0.4
Frequency (Hz)
Frequency (Hz)
Frequency (Hz)
Frequency (Hz)
Frequency (Hz)
Spectral Centroid for NoObject rec1
10000
5000
0
0
0.05
0.1
0.15
0.2
0.25
Time(sec)
0.3
0.35
0.4
10000
5000
0
0
0.05
0.1
0.15
0.2
0.25
0.3
Time(sec)
Spectral Centroid for Object at 100cm
0.35
0.4
0
0.05
0.1
0.15
0.35
0.4
0
0.05
0.1
0.15
0.35
0.4
0
0.05
0.1
0.15
0.35
0.4
10000
5000
0
0.2
0.25
0.3
Time(sec)
Spectral Centroid for Object at 300cm
10000
5000
0
0.2
0.25
0.3
Time(sec)
Spectral Centroid for Object at 500cm
10000
5000
0
0.2
0.25
Time(sec)
0.3
Figure A.3: The spectral centroid as a function of time for the 10 versions (marked in different colors
for each subplot) of the left ear 5ms recording in the conference room (Experiment 1).
Frequency (Hz)
Spectral Centroid for NoObject rec2
10000
5000
0
0.05
0.1
0.15
0.2
0.25
0.3
Time(sec)
Spectral Centroid for Object at 50cm
0.35
0.4
Frequency (Hz)
0
10000
5000
0
0.05
0.1
0.15
0.2
0.25
0.3
Time(sec)
Spectral Centroid for Object at 200cm
0.35
0.4
Frequency (Hz)
0
10000
5000
0
0
0.05
0.1
0.15
0.2
0.25
0.3
Time(sec)
Spectral Centroid for Object at 400cm
0.35
0.4
Frequency (Hz)
Frequency (Hz)
Frequency (Hz)
Frequency (Hz)
Frequency (Hz)
Spectral Centroid for NoObject rec1
10000
5000
0
0
0.05
0.1
0.15
0.2
0.25
Time(sec)
0.3
0.35
0.4
10000
5000
0
0
0.05
0.1
0.15
0.2
0.25
0.3
Time(sec)
Spectral Centroid for Object at 100cm
0.35
0.4
0
0.05
0.1
0.15
0.35
0.4
0
0.05
0.1
0.15
0.35
0.4
0
0.05
0.1
0.15
0.35
0.4
10000
5000
0
0.2
0.25
0.3
Time(sec)
Spectral Centroid for Object at 300cm
10000
5000
0
0.2
0.25
0.3
Time(sec)
Spectral Centroid for Object at 500cm
10000
5000
0
0.2
0.25
Time(sec)
0.3
Figure A.4: The spectral centroid as a function of time for the 10 versions (marked in different colors
for each subplot) of the right ear 5ms recording in the conference room (Experiment 1).
APPENDIX A. ROOM ACOUSTICS
67
Spectral Centroid for Object at 100cm
10000
8000
8000
Frequency (Hz)
Frequency (Hz)
Spectral Centroid for NoObject
10000
6000
4000
2000
0
6000
4000
2000
0
0.2
0.4
0.6
0.8
1
1.2
Time(sec)
1.4
1.6
1.8
1.6
1.8
0
0
0.2
0.4
0.6
0.8
1
1.2
Time(sec)
1.4
1.6
1.8
Spectral Centroid for Object at 150cm
Frequency (Hz)
10000
8000
6000
4000
2000
0
0
0.2
0.4
0.6
0.8
1
1.2
Time(sec)
1.4
Figure A.5: The spectral centroid as a function of time for the 10 versions (marked in different colors
for each subplot) of the left ear 5ms recording in the lecture room (Experiment 2).
Spectral Centroid for Object at 100cm
10000
8000
8000
Frequency (Hz)
Frequency (Hz)
Spectral Centroid for NoObject
10000
6000
4000
2000
0
6000
4000
2000
0
0.2
0.4
0.6
0.8
1
Time(sec)
1.2
1.4
1.6
1.8
1.6
1.8
0
0
0.2
0.4
0.6
0.8
1
Time(sec)
1.2
1.4
1.6
1.8
Spectral Centroid for Object at 150cm
Frequency (Hz)
10000
8000
6000
4000
2000
0
0
0.2
0.4
0.6
0.8
1
Time(sec)
1.2
1.4
Figure A.6: The spectral centroid as a function of time for the 10 versions (marked in different colors
for each subplot) of the right ear 5ms recording in the lecture room (Experiment 2).
APPENDIX A. ROOM ACOUSTICS
68
Frequency (Hz)
Spectral Centroid for NoObject rec2
10000
5000
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
Time(sec)
Spectral Centroid for Object at 50cm
0.4
0.45
Frequency (Hz)
0
10000
5000
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
Time(sec)
Spectral Centroid for Object at 200cm
0.4
0.45
Frequency (Hz)
0
10000
5000
0
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
Time(sec)
Spectral Centroid for Object at 400cm
0.4
0.45
Frequency (Hz)
Frequency (Hz)
Frequency (Hz)
Frequency (Hz)
Frequency (Hz)
Spectral Centroid for NoObject rec1
10000
5000
0
0
0.05
0.1
0.15
0.2
0.25
Time(sec)
0.3
0.35
0.4
0.45
10000
5000
0
0
0.05
0.1
0
0.05
0.1
0
0.05
0.1
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
Time(sec)
Spectral Centroid for Object at 100cm
0.4
0.45
0.15
0.4
0.45
0.15
0.4
0.45
0.4
0.45
10000
5000
0
0.2
0.25
0.3
0.35
Time(sec)
Spectral Centroid for Object at 300cm
10000
5000
0
0.2
0.25
0.3
0.35
Time(sec)
Spectral Centroid for Object at 500cm
10000
5000
0
0.15
0.2
0.25
Time(sec)
0.3
0.35
Figure A.7: The spectral centroid as a function of time for the 10 versions (marked in different colors
for each subplot) of the left ear 50ms recording in the anechoic chamber (Experiment 1).
Frequency (Hz)
Spectral Centroid for NoObject rec2
10000
5000
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
Time(sec)
Spectral Centroid for Object at 50cm
0.4
0.45
Frequency (Hz)
0
10000
5000
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
Time(sec)
Spectral Centroid for Object at 200cm
0.4
0.45
Frequency (Hz)
0
10000
5000
0
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
Time(sec)
Spectral Centroid for Object at 400cm
0.4
0.45
Frequency (Hz)
Frequency (Hz)
Frequency (Hz)
Frequency (Hz)
Frequency (Hz)
Spectral Centroid for NoObject rec1
10000
5000
0
0
0.05
0.1
0.15
0.2
0.25
Time(sec)
0.3
0.35
0.4
0.45
10000
5000
0
0
0.05
0.1
0
0.05
0.1
0
0.05
0.1
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
Time(sec)
Spectral Centroid for Object at 100cm
0.4
0.45
0.15
0.4
0.45
0.15
0.4
0.45
0.4
0.45
10000
5000
0
0.2
0.25
0.3
0.35
Time(sec)
Spectral Centroid for Object at 300cm
10000
5000
0
0.2
0.25
0.3
0.35
Time(sec)
Spectral Centroid for Object at 500cm
10000
5000
0
0.15
0.2
0.25
Time(sec)
0.3
0.35
Figure A.8: The spectral centroid as a function of time for the 10 versions (marked in different colors
for each subplot) of the right ear 50ms recording in the anechoic chamber (Experiment 1).
APPENDIX A. ROOM ACOUSTICS
69
Frequency (Hz)
Spectral Centroid for NoObject rec2
10000
5000
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
Time(sec)
Spectral Centroid for Object at 50cm
0.4
0.45
Frequency (Hz)
0
10000
5000
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
Time(sec)
Spectral Centroid for Object at 200cm
0.4
0.45
Frequency (Hz)
0
10000
5000
0
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
Time(sec)
Spectral Centroid for Object at 400cm
0.4
0.45
Frequency (Hz)
Frequency (Hz)
Frequency (Hz)
Frequency (Hz)
Frequency (Hz)
Spectral Centroid for NoObject rec1
10000
5000
0
0
0.05
0.1
0.15
0.2
0.25
Time(sec)
0.3
0.35
0.4
0.45
10000
5000
0
0
0.05
0.1
0
0.05
0.1
0
0.05
0.1
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
Time(sec)
Spectral Centroid for Object at 100cm
0.4
0.45
0.15
0.4
0.45
0.15
0.4
0.45
0.4
0.45
10000
5000
0
0.2
0.25
0.3
0.35
Time(sec)
Spectral Centroid for Object at 300cm
10000
5000
0
0.2
0.25
0.3
0.35
Time(sec)
Spectral Centroid for Object at 500cm
10000
5000
0
0.15
0.2
0.25
Time(sec)
0.3
0.35
Figure A.9: The spectral centroid as a function of time for the 10 versions (marked in different colors
for each subplot) of the left ear 50ms recording in the conference room (Experiment 1).
Frequency (Hz)
Spectral Centroid for NoObject rec2
10000
5000
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
Time(sec)
Spectral Centroid for Object at 50cm
0.4
0.45
Frequency (Hz)
0
10000
5000
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
Time(sec)
Spectral Centroid for Object at 200cm
0.4
0.45
Frequency (Hz)
0
10000
5000
0
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
Time(sec)
Spectral Centroid for Object at 400cm
0.4
0.45
Frequency (Hz)
Frequency (Hz)
Frequency (Hz)
Frequency (Hz)
Frequency (Hz)
Spectral Centroid for NoObject rec1
10000
5000
0
0
0.05
0.1
0.15
0.2
0.25
Time(sec)
0.3
0.35
0.4
0.45
10000
5000
0
0
0.05
0.1
0
0.05
0.1
0
0.05
0.1
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
Time(sec)
Spectral Centroid for Object at 100cm
0.4
0.45
0.15
0.4
0.45
0.15
0.4
0.45
0.4
0.45
10000
5000
0
0.2
0.25
0.3
0.35
Time(sec)
Spectral Centroid for Object at 300cm
10000
5000
0
0.2
0.25
0.3
0.35
Time(sec)
Spectral Centroid for Object at 500cm
10000
5000
0
0.15
0.2
0.25
Time(sec)
0.3
0.35
Figure A.10: The spectral centroid as a function of time for the 10 versions (marked in different colors
for each subplot) of the right ear 50ms recording in the conference room (Experiment 1).
APPENDIX A. ROOM ACOUSTICS
70
Frequency (Hz)
Spectral Centroid for NoObject rec2
10000
5000
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Time(sec)
Spectral Centroid for Object at 50cm
0.8
0.9
Frequency (Hz)
0
10000
5000
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Time(sec)
Spectral Centroid for Object at 200cm
0.8
0.9
Frequency (Hz)
0
10000
5000
0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Time(sec)
Spectral Centroid for Object at 400cm
0.8
0.9
Frequency (Hz)
Frequency (Hz)
Frequency (Hz)
Frequency (Hz)
Frequency (Hz)
Spectral Centroid for NoObject rec1
10000
5000
0
0
0.1
0.2
0.3
0.4
0.5
Time(sec)
0.6
0.7
0.8
0.9
10000
5000
0
0
0.1
0.2
0
0.1
0.2
0
0.1
0.2
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Time(sec)
Spectral Centroid for Object at 100cm
0.8
0.9
0.3
0.8
0.9
0.3
0.8
0.9
0.8
0.9
10000
5000
0
0.4
0.5
0.6
0.7
Time(sec)
Spectral Centroid for Object at 300cm
10000
5000
0
0.4
0.5
0.6
0.7
Time(sec)
Spectral Centroid for Object at 500cm
10000
5000
0
0.3
0.4
0.5
Time(sec)
0.6
0.7
Figure A.11: The spectral centroid as a function of time for the 10 versions (marked in different colors
for each subplot) of the left ear 500ms recording in the anechoic chamber (Experiment 1).
Frequency (Hz)
Spectral Centroid for NoObject rec2
10000
5000
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Time(sec)
Spectral Centroid for Object at 50cm
0.8
0.9
Frequency (Hz)
0
10000
5000
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Time(sec)
Spectral Centroid for Object at 200cm
0.8
0.9
Frequency (Hz)
0
10000
5000
0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Time(sec)
Spectral Centroid for Object at 400cm
0.8
0.9
Frequency (Hz)
Frequency (Hz)
Frequency (Hz)
Frequency (Hz)
Frequency (Hz)
Spectral Centroid for NoObject rec1
10000
5000
0
0
0.1
0.2
0.3
0.4
0.5
Time(sec)
0.6
0.7
0.8
0.9
10000
5000
0
0
0.1
0.2
0
0.1
0.2
0
0.1
0.2
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Time(sec)
Spectral Centroid for Object at 100cm
0.8
0.9
0.3
0.8
0.9
0.3
0.8
0.9
0.8
0.9
10000
5000
0
0.4
0.5
0.6
0.7
Time(sec)
Spectral Centroid for Object at 300cm
10000
5000
0
0.4
0.5
0.6
0.7
Time(sec)
Spectral Centroid for Object at 500cm
10000
5000
0
0.3
0.4
0.5
Time(sec)
0.6
0.7
Figure A.12: The spectral centroid as a function of time for the 10 versions (marked in different colors
for each subplot) of the right ear 500ms recording in the anechoic chamber (Experiment 1).
APPENDIX A. ROOM ACOUSTICS
71
Frequency (Hz)
Spectral Centroid for NoObject rec2
10000
5000
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Time(sec)
Spectral Centroid for Object at 50cm
0.8
0.9
Frequency (Hz)
0
10000
5000
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Time(sec)
Spectral Centroid for Object at 200cm
0.8
0.9
Frequency (Hz)
0
10000
5000
0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Time(sec)
Spectral Centroid for Object at 400cm
0.8
0.9
Frequency (Hz)
Frequency (Hz)
Frequency (Hz)
Frequency (Hz)
Frequency (Hz)
Spectral Centroid for NoObject rec1
10000
5000
0
0
0.1
0.2
0.3
0.4
0.5
Time(sec)
0.6
0.7
0.8
0.9
10000
5000
0
0
0.1
0.2
0
0.1
0.2
0
0.1
0.2
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Time(sec)
Spectral Centroid for Object at 100cm
0.8
0.9
0.3
0.8
0.9
0.3
0.8
0.9
0.8
0.9
10000
5000
0
0.4
0.5
0.6
0.7
Time(sec)
Spectral Centroid for Object at 300cm
10000
5000
0
0.4
0.5
0.6
0.7
Time(sec)
Spectral Centroid for Object at 500cm
10000
5000
0
0.3
0.4
0.5
Time(sec)
0.6
0.7
Figure A.13: The spectral centroid as a function of time for the 10 versions (marked in different colors
for each subplot) of the left ear 500ms recording in the conference room (Experiment 1).
Frequency (Hz)
Spectral Centroid for NoObject rec2
10000
5000
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Time(sec)
Spectral Centroid for Object at 50cm
0.8
0.9
Frequency (Hz)
0
10000
5000
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Time(sec)
Spectral Centroid for Object at 200cm
0.8
0.9
Frequency (Hz)
0
10000
5000
0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Time(sec)
Spectral Centroid for Object at 400cm
0.8
0.9
Frequency (Hz)
Frequency (Hz)
Frequency (Hz)
Frequency (Hz)
Frequency (Hz)
Spectral Centroid for NoObject rec1
10000
5000
0
0
0.1
0.2
0.3
0.4
0.5
Time(sec)
0.6
0.7
0.8
0.9
10000
5000
0
0
0.1
0.2
0
0.1
0.2
0
0.1
0.2
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Time(sec)
Spectral Centroid for Object at 100cm
0.8
0.9
0.3
0.8
0.9
0.3
0.8
0.9
0.8
0.9
10000
5000
0
0.4
0.5
0.6
0.7
Time(sec)
Spectral Centroid for Object at 300cm
10000
5000
0
0.4
0.5
0.6
0.7
Time(sec)
Spectral Centroid for Object at 500cm
10000
5000
0
0.3
0.4
0.5
Time(sec)
0.6
0.7
Figure A.14: The spectral centroid as a function of time for the 10 versions (marked in different colors
for each subplot) of the right ear 500ms recording in the conference room (Experiment 1).
APPENDIX A. ROOM ACOUSTICS
72
Spectral Centroid for Object at 100cm
10000
8000
8000
Frequency (Hz)
Frequency (Hz)
Spectral Centroid for NoObject
10000
6000
4000
2000
0
6000
4000
2000
0
0.2
0.4
0.6
0.8
1
Time(sec)
1.2
1.4
1.6
1.8
1.6
1.8
0
0
0.2
0.4
0.6
0.8
1
Time(sec)
1.2
1.4
1.6
1.8
Spectral Centroid for Object at 150cm
Frequency (Hz)
10000
8000
6000
4000
2000
0
0
0.2
0.4
0.6
0.8
1
Time(sec)
1.2
1.4
Figure A.15: The spectral centroid as a function of time for the 10 versions (marked in different colors
for each subplot) of the left ear 500ms recording in the lecture room (Experiment 2).
Spectral Centroid for Object at 100cm
10000
8000
8000
Frequency (Hz)
Frequency (Hz)
Spectral Centroid for NoObject
10000
6000
4000
2000
0
6000
4000
2000
0
0.2
0.4
0.6
0.8
1
Time(sec)
1.2
1.4
1.6
1.8
1.6
1.8
0
0
0.2
0.4
0.6
0.8
1
Time(sec)
1.2
1.4
1.6
1.8
Spectral Centroid for Object at 150cm
Frequency (Hz)
10000
8000
6000
4000
2000
0
0
0.2
0.4
0.6
0.8
1
Time(sec)
1.2
1.4
Figure A.16: The spectral centroid as a function of time for the 10 versions (marked in different colors
for each subplot) of the right ear 500ms recording in the lecture room (Experiment 2).
Appendix B
Auditory models
B.1
Loudness
NoObjectrec1
NoObjectrec2
Object50cm
Object100cm
Object200cm
Object300cm
Object400cm
Object500cm
Ver1
12.467
12.497
20.979
20.200
13.608
12.406
12.509
12.458
Ver2
15.203
15.104
22.566
22.348
16.274
15.217
15.221
15.199
Ver3
11.987
11.965
18.869
18.251
13.044
11.971
12.019
11.976
Ver4
14.699
14.591
22.169
21.259
15.803
14.682
14.742
14.690
Ver5
14.204
14.125
21.649
20.584
15.254
14.217
14.273
14.242
Ver6
14.704
14.643
21.813
21.344
15.636
14.753
14.758
14.731
Ver7
12.458
12.354
19.535
19.407
13.437
12.387
12.390
12.373
Ver8
12.716
12.647
19.822
19.483
13.839
12.758
12.739
13.426
Ver9
13.687
13.647
20.489
20.708
14.792
13.680
13.702
13.695
Ver10
11.444
11.386
18.850
18.361
12.353
11.402
11.435
11.412
Mean
13.357
13.296
20.674
20.194
14.404
13.347
13.379
13.420
Table B.1: Maximum of the Short Term Loudness in sones of 10 versions for the recordings in the
anechoic chamber with 5ms duration signal. The last column indicates the mean over the 10 versions.
NoObjectrec1
NoObjectrec2
Object50cm
Object100cm
Object200cm
Object300cm
Object400cm
Object500cm
Ver1
41.511
41.452
66.048
53.907
41.790
41.726
41.656
41.499
Ver2
35.716
35.647
57.284
46.762
35.917
35.860
35.768
35.739
Ver3
38.842
38.801
62.587
51.569
39.090
39.078
38.920
38.824
Ver4
40.318
40.251
63.893
53.316
40.586
40.502
40.361
40.299
Ver5
41.149
41.114
65.020
53.592
41.506
41.396
41.714
41.161
Ver6
40.951
40.861
64.267
53.255
41.121
41.135
41.040
40.965
Ver7
40.601
40.536
62.560
51.682
40.796
40.748
40.604
40.587
Ver8
41.978
41.931
67.099
53.775
42.231
42.281
42.051
41.978
Ver9
40.236
40.171
65.030
53.180
40.591
40.412
40.313
40.242
Ver10
39.596
39.462
62.933
52.029
39.574
39.779
39.709
39.594
Mean
40.090
40.023
63.672
52.307
40.320
40.292
40.213
40.089
Table B.2: Maximum of the Short Term Loudness in sones of 10 versions for the recordings in the
anechoic chamber with 50ms duration signal. The last column indicates the mean over the 10 versions.
NoObjectrec1
NoObjectrec2
Object50cm
Object100cm
Object200cm
Object300cm
Object400cm
Object500cm
Ver1
49.689
49.609
78.660
63.467
49.893
49.944
49.744
49.677
Ver2
49.617
49.576
78.586
64.006
49.914
49.913
49.681
49.640
Ver3
47.484
47.382
75.840
61.475
47.784
47.779
47.542
47.485
Ver4
46.822
46.789
73.507
60.473
47.059
47.016
46.873
46.795
Ver5
47.513
47.424
74.836
61.174
47.709
47.741
47.559
47.486
Ver6
48.206
48.146
76.411
62.485
48.432
48.388
48.233
48.192
Ver7
49.550
49.515
77.657
64.087
49.684
49.756
49.594
49.539
Ver8
47.659
47.637
75.250
61.816
47.819
47.870
47.694
47.669
Ver9
47.632
47.565
75.547
61.719
47.797
47.884
47.693
47.623
Ver10
47.198
47.177
75.132
60.891
47.440
47.481
47.261
47.204
Mean
48.137
48.082
76.143
62.159
48.353
48.377
48.187
48.131
Table B.3: Maximum of the Short Term Loudness in sones of 10 versions for the recordings in the
anechoic chamber with 500ms duration signal. The last column indicates the mean over the 10 versions.
73
APPENDIX B. AUDITORY MODELS
NoObjectrec1
NoObjectrec2
Object50cm
Object100cm
Object200cm
Object300cm
Object400cm
Object500cm
Ver1
19.309
19.349
26.688
24.389
21.485
19.635
19.981
19.531
Ver2
19.288
19.358
26.692
24.354
21.427
19.625
19.969
19.531
Ver3
19.298
19.357
26.690
24.341
21.515
19.646
19.956
19.524
Ver4
19.314
19.370
26.690
24.412
21.561
19.645
19.957
19.527
74
Ver5
19.308
19.380
26.682
24.337
21.600
19.645
19.961
19.522
Ver6
19.350
19.385
26.731
24.382
21.588
19.652
19.978
19.522
Ver7
19.335
19.384
26.701
24.349
21.544
19.668
19.997
19.526
Ver8
19.342
19.399
26.712
24.396
21.554
19.671
19.988
19.531
Ver9
19.333
19.389
26.734
24.382
21.504
19.667
19.988
19.532
Ver10
19.323
19.391
26.752
24.432
21.593
19.658
19.974
19.541
Mean
19.320
19.376
26.707
24.377
21.537
19.651
19.975
19.529
Table B.4: Maximum of the Short Term Loudness in sones of 10 versions for the recordings in the
conference room with 5ms duration signal. The last column indicates the mean over the 10 versions.
NoObjectrec1
NoObjectrec2
Object50cm
Object100cm
Object200cm
Object300cm
Object400cm
Object500cm
Ver1
45.024
45.081
69.618
55.674
47.552
45.150
45.269
45.041
Ver2
45.021
45.073
69.656
55.666
47.573
45.143
45.252
45.045
Ver3
45.006
45.075
69.625
55.635
47.597
45.147
45.273
45.036
Ver4
45.012
45.069
69.626
55.674
47.637
45.141
45.252
45.037
Ver5
44.998
45.070
69.587
55.694
47.634
45.126
45.260
45.044
Ver6
44.989
45.062
69.569
55.722
47.663
45.127
45.242
45.047
Ver7
44.991
45.065
69.586
55.708
47.594
45.138
45.257
45.044
Ver8
44.980
45.070
69.590
55.690
47.665
45.132
45.243
45.036
Ver9
44.980
45.076
69.616
55.674
47.606
45.126
45.224
45.030
Ver10
44.986
45.078
69.592
55.683
47.666
45.117
45.219
45.049
Mean
44.999
45.072
69.607
55.682
47.619
45.135
45.249
45.041
Table B.5: Maximum of the Short Term Loudness in sones of 10 versions for the recordings in the
conference room with 50ms duration signal. The last column indicates the mean over the 10 versions.
NoObjectrec1
NoObjectrec2
Object50cm
Object100cm
Object200cm
Object300cm
Object400cm
Object500cm
Ver1
52.453
52.465
78.748
63.607
54.483
52.402
52.569
52.506
Ver2
52.438
52.461
78.704
63.668
54.623
52.401
52.577
52.469
Ver3
52.449
52.471
78.586
63.669
54.589
52.382
52.582
52.481
Ver4
52.448
52.484
78.654
63.557
54.484
52.419
52.562
52.493
Ver5
52.456
52.490
78.702
63.493
54.649
52.390
52.529
52.485
Ver6
52.443
52.504
78.655
63.553
54.621
52.372
52.577
52.519
Ver7
52.446
52.501
78.629
63.540
54.498
52.360
52.570
52.519
Ver8
52.438
52.500
78.671
63.544
54.591
52.380
52.567
52.497
Ver9
52.435
52.502
78.664
63.566
54.689
52.381
52.571
52.510
Ver10
52.432
52.496
78.575
63.539
54.576
52.378
52.587
52.540
Mean
52.444
52.487
78.659
63.574
54.580
52.387
52.569
52.502
Table B.6: Maximum of the Short Term Loudness in sones of 10 versions for the recordings in the
conference room with 500ms duration signal. The last column indicates the mean over the 10 versions.
NoObjectrec1
Object100cm
Object150cm
Ver1
14.495
16.730
15.386
Ver2
16.846
18.145
17.283
Ver3
15.568
16.890
16.028
Ver4
15.419
16.902
15.897
Ver5
14.604
15.876
15.154
Ver6
16.211
18.083
17.141
Ver7
14.969
16.553
15.741
Ver8
16.825
18.971
17.734
Ver9
15.637
17.293
16.205
Ver10
14.391
16.158
15.226
Mean
15.497
17.160
16.179
Table B.7: Maximum of the Short Term Loudness in sones of 10 versions for the recordings in the
lecture room with 5ms duration signal. The last column indicates the mean over the 10 versions.
NoObjectrec1
Object100cm
Object150cm
Ver1
37.333
39.974
37.993
Ver2
39.586
42.003
40.029
Ver3
37.340
39.772
38.108
Ver4
37.542
40.061
38.036
Ver5
38.969
41.484
39.708
Ver6
39.385
41.827
40.155
Ver7
37.625
39.851
37.979
Ver8
39.139
41.983
39.842
Ver9
40.873
43.319
41.447
Ver10
38.681
41.378
39.241
Mean
38.647
41.165
39.254
Table B.8: Maximum of the Short Term Loudness in sones of 10 versions for the recordings in the
lecture room with 5ms duration, 32 clicks signal. The last column indicates the mean over the 10
versions.
APPENDIX B. AUDITORY MODELS
NoObjectrec1
Object100cm
Object150cm
Ver1
45.479
48.055
45.780
Ver2
47.913
50.425
48.312
Ver3
45.689
47.900
46.236
Ver4
45.490
47.804
46.061
75
Ver5
46.258
48.771
46.563
Ver6
46.939
49.504
47.112
Ver7
44.233
46.610
44.669
Ver8
47.865
50.673
48.356
Ver9
48.594
51.714
49.227
Ver10
46.518
49.176
46.921
Mean
46.498
49.063
46.924
Table B.9: Maximum of the Short Term Loudness in sones of 10 versions for the recordings in the
lecture room with 5ms duration 64 clicks signal. The last column indicates the mean over the 10
versions.
NoObjectrec1
Object100cm
Object150cm
Ver1
50.897
53.594
51.273
Ver2
53.347
56.330
53.633
Ver3
50.800
53.047
51.226
Ver4
50.633
53.038
51.280
Ver5
51.920
54.317
52.409
Ver6
52.236
55.310
52.673
Ver7
49.713
52.142
50.079
Ver8
53.366
56.965
54.012
Ver9
55.000
57.813
55.522
Ver10
52.216
54.564
52.557
Mean
52.013
54.712
52.466
Table B.10: Maximum of the Short Term Loudness in sones of 10 versions for the recordings in the
lecture room with 500ms duration signal. The last column indicates the mean over the 10 versions.
APPENDIX B. AUDITORY MODELS
B.2
76
Sharpness
NoObjectrec1
NoObjectrec2
Object50cm
Object100cm
Object200cm
Object300cm
Object400cm
Object500cm
Ver1
1.892
1.922
2.108
2.142
1.937
1.926
1.914
1.896
Ver2
1.884
1.910
2.119
2.166
1.933
1.867
1.853
1.862
Ver3
1.907
1.906
2.062
2.154
1.929
1.906
1.884
1.882
Ver4
1.894
1.894
1.996
2.078
1.912
1.906
1.896
1.881
Ver5
1.890
1.907
1.954
2.088
1.921
1.920
1.912
1.904
Ver6
1.847
1.869
1.982
2.115
1.896
1.915
1.891
1.897
Ver7
1.878
1.899
2.020
2.192
1.898
1.893
1.884
1.905
Ver8
1.876
1.882
2.068
2.173
1.942
1.896
1.893
1.884
Ver9
1.895
1.886
2.111
2.098
1.914
1.890
1.895
1.877
Ver10
1.916
1.922
2.101
2.169
1.926
1.941
1.884
1.907
Mean
1.888
1.900
2.052
2.138
1.921
1.906
1.891
1.889
Table B.11: Median of the sharpness in acums of 10 versions for the recordings in the anechoic room
(Experiment 1) with 5ms duration signal. The last column indicates the mean over the 10 versions.
NoObjectrec1
NoObjectrec2
Object50cm
Object100cm
Object200cm
Object300cm
Object400cm
Object500cm
Ver1
1.868
1.894
2.112
2.127
1.902
1.910
1.874
1.852
Ver2
1.899
1.893
2.080
2.155
1.927
1.907
1.865
1.852
Ver3
1.876
1.909
2.075
2.183
1.911
1.913
1.909
1.886
Ver4
1.897
1.901
2.046
2.152
1.898
1.894
1.846
1.898
Ver5
1.880
1.925
2.002
2.170
1.889
1.894
1.875
1.859
Ver6
1.883
1.904
2.016
2.113
1.928
1.895
1.870
1.884
Ver7
1.894
1.902
2.042
2.087
1.909
1.896
1.854
1.879
Ver8
1.928
1.905
2.094
2.119
1.924
1.902
1.880
1.896
Ver9
1.896
1.906
2.113
2.141
1.919
1.922
1.883
1.927
Ver10
1.872
1.867
2.095
2.161
1.910
1.911
1.888
1.875
Mean
1.889
1.901
2.068
2.141
1.912
1.904
1.874
1.881
Table B.12: Median of the sharpness in acums of 10 versions for the recordings in the anechoic room
(Experiment 1) with 50ms duration signal. The last column indicates the mean over the 10 versions.
NoObjectrec1
NoObjectrec2
Object50cm
Object100cm
Object200cm
Object300cm
Object400cm
Object500cm
Ver1
1.870
1.878
2.104
2.096
1.880
1.856
1.826
1.840
Ver2
1.861
1.878
2.110
2.074
1.901
1.860
1.835
1.832
Ver3
1.857
1.874
2.123
2.090
1.890
1.854
1.825
1.831
Ver4
1.865
1.894
2.098
2.140
1.897
1.862
1.831
1.832
Ver5
1.865
1.883
2.105
2.144
1.893
1.858
1.828
1.829
Ver6
1.855
1.878
2.115
2.143
1.889
1.843
1.821
1.832
Ver7
1.857
1.880
2.133
2.167
1.894
1.850
1.837
1.834
Ver8
1.857
1.882
2.122
2.141
1.893
1.871
1.839
1.837
Ver9
1.857
1.888
2.121
2.107
1.887
1.856
1.831
1.846
Ver10
1.862
1.890
2.127
2.084
1.899
1.865
1.837
1.835
Mean
1.861
1.882
2.116
2.119
1.892
1.858
1.831
1.835
Table B.13: Median of the sharpness in acums of 10 versions for the recordings in the anechoic room
(Experiment 1) with 500ms duration signal. The last column indicates the mean over the 10 versions.
APPENDIX B. AUDITORY MODELS
NoObjectrec1
NoObjectrec2
Object50cm
Object100cm
Object200cm
Object300cm
Object400cm
Object500cm
Ver1
1.929
1.988
2.037
2.042
2.052
1.988
1.979
1.964
Ver2
1.964
1.968
2.038
2.034
2.033
1.990
1.981
1.996
Ver3
1.955
1.963
2.013
2.029
1.985
2.038
1.977
2.027
Ver4
1.962
2.011
2.037
2.038
1.981
2.044
1.967
1.993
77
Ver5
1.979
1.975
2.028
2.035
1.993
1.972
1.995
1.978
Ver6
1.983
1.998
2.040
2.025
1.996
2.002
1.989
1.987
Ver7
1.980
2.002
2.040
2.027
2.011
2.022
2.014
1.977
Ver8
1.989
1.964
2.047
2.024
2.008
2.009
1.951
1.995
Ver9
1.991
1.973
2.027
2.025
2.015
2.019
1.990
1.972
Ver10
1.989
1.993
2.008
2.037
1.959
2.005
1.981
1.967
Mean
1.972
1.983
2.032
2.032
2.003
2.009
1.982
1.986
Table B.14: Median of the sharpness in acums of 10 versions for the recordings in the conference room
(Experiment 1) with 5ms duration signal. The last column indicates the mean over the 10 versions.
NoObjectrec1
NoObjectrec2
Object50cm
Object100cm
Object200cm
Object300cm
Object400cm
Object500cm
Ver1
1.893
1.896
1.980
1.974
1.940
1.905
1.918
1.863
Ver2
1.892
1.908
1.968
1.960
1.927
1.922
1.916
1.891
Ver3
1.898
1.896
1.957
1.954
1.941
1.910
1.925
1.892
Ver4
1.881
1.892
1.954
1.927
1.923
1.909
1.911
1.871
Ver5
1.886
1.893
1.964
1.940
1.938
1.893
1.923
1.899
Ver6
1.904
1.883
1.966
1.945
1.944
1.917
1.921
1.863
Ver7
1.901
1.890
1.966
1.947
1.937
1.924
1.932
1.889
Ver8
1.897
1.895
1.954
1.968
1.933
1.921
1.903
1.888
Ver9
1.883
1.893
1.967
1.933
1.934
1.917
1.907
1.911
Ver10
1.893
1.898
1.960
1.951
1.944
1.923
1.914
1.911
Mean
1.893
1.894
1.964
1.950
1.936
1.914
1.917
1.888
Table B.15: Median of the sharpness in acums of 10 versions for the recordings in the conference room
(Experiment 1) with 50ms duration signal. The last column indicates the mean over the 10 versions.
NoObjectrec1
NoObjectrec2
Object50cm
Object100cm
Object200cm
Object300cm
Object400cm
Object500cm
Ver1
1.937
1.939
2.100
2.040
1.967
1.950
1.953
1.940
Ver2
1.931
1.940
2.098
2.043
1.966
1.948
1.951
1.941
Ver3
1.935
1.938
2.091
2.049
1.968
1.952
1.945
1.946
Ver4
1.938
1.942
2.093
2.043
1.969
1.943
1.952
1.942
Ver5
1.934
1.937
2.099
2.041
1.967
1.951
1.948
1.944
Ver6
1.934
1.938
2.092
2.046
1.968
1.954
1.947
1.936
Ver7
1.932
1.935
2.090
2.043
1.967
1.948
1.952
1.936
Ver8
1.933
1.936
2.102
2.038
1.967
1.951
1.951
1.943
Ver9
1.933
1.937
2.095
2.045
1.965
1.953
1.943
1.942
Ver10
1.938
1.936
2.088
2.042
1.968
1.955
1.946
1.944
Mean
1.935
1.938
2.095
2.043
1.967
1.950
1.949
1.941
Table B.16: Median of the sharpness in acums of 10 versions for the recordings in the conference room
(Experiment 1) with 500ms duration signal. The last column indicates the mean over the 10 versions.
APPENDIX B. AUDITORY MODELS
NoObjectrec1
Object100cm
Object150cm
Ver1
1.817
1.729
1.773
Ver2
1.873
1.820
1.886
Ver3
1.773
1.763
1.752
Ver4
1.959
1.776
1.902
78
Ver5
1.826
1.855
1.825
Ver6
1.729
1.601
1.667
Ver7
1.861
1.900
1.958
Ver8
1.892
1.973
1.894
Ver9
1.905
1.754
1.863
Ver10
1.853
1.614
1.824
Mean
1.849
1.778
1.834
Table B.17: Median of the sharpness in acums of 10 versions for the recordings in the lecture room
(Experiment 2) with 5ms duration signal. The last column indicates the mean over the 10 versions.
NoObjectrec1
Object100cm
Object150cm
Ver1
2.051
2.153
2.083
Ver2
1.992
2.105
2.032
Ver3
2.015
2.108
2.022
Ver4
2.013
2.125
2.048
Ver5
2.006
2.103
2.030
Ver6
1.982
2.097
2.006
Ver7
2.044
2.145
2.080
Ver8
2.006
2.059
2.049
Ver9
2.025
2.099
2.043
Ver10
2.031
2.142
2.053
Mean
2.017
2.114
2.045
Table B.18: Median of the sharpness in acums of 10 versions for the recordings in the lecture
room(Experiment 2) with 5ms duration, 32 clicks signal. The last column indicates the mean over the
10 versions.
NoObjectrec1
Object100cm
Object150cm
Ver1
2.075
2.188
2.112
Ver2
2.035
2.150
2.068
Ver3
2.066
2.175
2.091
Ver4
2.069
2.182
2.098
Ver5
2.074
2.197
2.109
Ver6
2.067
2.181
2.102
Ver7
2.070
2.169
2.098
Ver8
2.078
2.192
2.117
Ver9
2.061
2.169
2.094
Ver10
2.065
2.178
2.101
Mean
2.066
2.178
2.099
Table B.19: Median of the sharpness in acums of 10 versions for the recordings in the lecture room
(Experiment 2) with 5ms duration, 64 clicks signal. The last column indicates the mean over the 10
versions.
NoObjectrec1
Object100cm
Object150cm
Ver1
2.075
2.210
2.123
Ver2
2.061
2.184
2.095
Ver3
2.071
2.202
2.112
Ver4
2.081
2.206
2.120
Ver5
2.064
2.197
2.105
Ver6
2.085
2.218
2.124
Ver7
2.068
2.181
2.101
Ver8
2.083
2.206
2.118
Ver9
2.062
2.194
2.093
Ver10
2.069
2.202
2.108
Mean
2.072
2.200
2.110
Table B.20: Median of the sharpness in acums of 10 versions for the recordings in the lecture room
(Experiment 2) with 500ms duration signal. The last column indicates the mean over the 10 versions.
APPENDIX B. AUDITORY MODELS
B.3
79
Pitch strength using strobe temporal integration
Figure B.1 shows the temporal profile of the stabilised auditory image for a 500ms
recording in the conference room. As stated in Chapter 4 the stabilised auditory image
was implemented using two modules namely, sf2003 and ti2003. A brief description of
this is given below.
1.2
3
738 Hz: 0.29
1
76 Hz: 0.07
0.005
0.01
1.5
0.4
1
0.2
0.5
0
0.015
0
0.005
time interval(sec)
0.01
0
0.015
time interval(sec)
(a) No object
(b) Object at 50 cm
1.4
2
1.8
1.2
1.6
232 Hz: 0.23
1
1.2
1
Amplitude
1.4
100 Hz: 0.10
0.8
0.6
Amplitude
0
0.6
2
Amplitude
0.8
Amplitude
2.5
0.8
0.6
0.4
0.4
0.2
0.2
0
0.005
0.01
time interval(sec)
(c) Object at 100 cm
0
0.015
0
0.005
0.01
0
0.015
time interval(sec)
(d) Object at 200 cm
Figure B.1: The temporal profiles of stabilised auditory image for a 500ms signal recorded in the
conference room (Experiment 1) at 495ms time frame. The blue dot indicates the highest peak and
the corresponding values indicates the pitch strength (calculated using equation 4.10) and frequency
in Hz (calculated by using the inverse relationship of time and frequency, f = 1/t).
APPENDIX B. AUDITORY MODELS
80
Initially, the sf2003 module uses an adaptive strobe threshold to issue a strobe on the
NAP. After the strobe is initiated the threshold initially rises along a parabolic path
and then returns to the linear decay to avoid spurious strobes (cf Figure 4.2). Once the
strobes are computed for each frequency channel of the NAP then the ti2003 module
uses the strobes to initiate a temporal integration.
The time interval between the strobe and the NAP value determines the position of
where the NAP value is entered in the SAI. For example if a strobe is identified in the
200Hz channel of the NAP at 5ms time instant than the level of the NAP sample at
5ms time instant is added to the 1st position of the 200Hz channel in the SAI. The next
sample of the NAP is added to the 2nd position of the SAI. This process of adding the
level of the NAP samples continues for 35ms and terminates if no further strobes are
identified.
In the case of strobes detected within the 35ms interval, each strobe initiates a temporal
integration process. To preserve the shape of the SAI to that of the NAP, ti2003 uses
a weighting concept i.e the new strobes are initially weighted high (also the weights are
normalized such that the sum of the weights is equal to 1) so that the older strobes
contribute relatively less to the SAI. In this way the time axis of the NAP is converted
into a time interval axis of the SAI.
The temporal profile in the sub figures of Figure B.1 was generated by summing the SAI
along the centre frequencies. The results from Figure B.1 show that the recordings with
the object when compared with the recording without the object had pitch strength
greater than 0.1 at the corresponding frequencies of the repetition pitch. However,
whether or not this is the case for all the recordings has to be verified.
As previous research (Yost 1996; Patterson et al 1996) quantified the repetition pitch perception using the autocorrelation theory the thesis followed in their foot steps assuming
that the autocorrelation is the way repetition pitch is perceived. The autocorrelation
results in Chapter 5 justified this assumption. To quantify how the strobe temporal
integration could explain the pitch perception that is known to be useful for human
echolocation, a detailed analysis using the strobe temporal integration module of the
AIM has to be done. This is left for future work.