Master Thesis Electrical Engineering Thesis no: December 2015 IMPLEMENTATION AND EVALUATION OF AUDITORY MODELS FOR HUMAN ECHOLOCATION VIJAY KIRAN GIDLA Department of Applied Signal Processing Blekinge Institute of Technology 37179 Karlskrona Sweden This thesis is submitted to the Department of Applied Signal Processing at Blekinge Institute of Technology in partial fulfillment of the requirements for the degree of Master of Science in Electrical Engineering. Contact Information Author: VIJAY KIRAN GIDLA E-mail: [email protected] University advisor: Docent BO SCHENKMAN Blekinge Institute of Technology Department of Applied Signal Processing Blekinge Institute of Technology 371 79 KARLSKRONA SWEDEN Internet: www.bth.se/ing Phone: +46 455 385000 SWEDEN Abstract Blind people use echoes to detect objects and to find their way, the ability being known as human echolocation. Previous research have found some of the favorable conditions for the detection of the object, with many factors yet to be analyzed and quantified. Studies have also shown that blind people are more efficient than the sighted in echolocating, with the performance varying among the individuals. This motivated the research in human echolocation to move in a new direction to get a fuller understanding for the high detection of the blind. The psychoacoustic experiments solely cannot determine how the superior echo detection of the blind listeners should be attributed to perceptual or physiological causes. Along with the perceptual results it is vital to know how the sounds are processed in the auditory system. Hearing research has led to the development of several auditory models by combining the physiological and psychological results with signal analysis methods. These models try to describe how the auditory system processes the signals. Hence, to analyze how the sounds are processed for the high detection of the blind, auditory models available in the literature were used in this thesis. The results suggest that repetition pitch is useful at shorter distances and is determined from the peaks in the temporal profile of the autocorrelation function computed on the neural activity pattern. Loudness attribute also plays a role in providing information for the listeners to echolocate at shorter distances. At longer distances timbre aspects such as sharpness information might be used by the listeners to detect the objects. It was also found that the repetition pitch, loudness and sharpness attributes in turn depend on the room acoustics and type of the stimuli used. These results show the fruitfulness of combining results from different disciplines through a mathematical framework given by signal analysis. Keywords: Human echolocation, Psychoacoustics, Physiology, Signal analysis, Auditory models. i Acknowledgment Firstly, I would like to express my sincere gratitude to my advisor Docent Bo Schenkman who supported me throughout my master thesis. I would have not been able to complete my thesis without his support, patience and motivation. His guidance helped me to think critically on the results that I have found from the analysis in this thesis. His valuable comments during the writing of my thesis helped me to order my analysis into a good framework. I am really grateful for having such an advisor for my master thesis. Beside my advisor, I would like to thank my examiner Sven Johansson who was patient and cooperative with the submission of my thesis. I would like to thank Professor Brian C. J. Moore and Professor Jan Schnupp for allowing me to use the figures from their books in my thesis. My sincere thanks also goes to my senior Abel Gladstone Mangam who suggested me to my advisor to perform the research in human echolocation. I thank the staff at the BTH library and IT help desk who were very supportive in providing me with the literature and software I needed for my thesis. Last but not the least, I would like to thank my parents who supported me throughout my thesis. ii Contents Abstract i Contents iii List of Figures v List of Tables vii Abbreviations x 1 Introduction 1 2 Physiology and Perception 2.1 Physiology of hearing . . . . . . . . . . . 2.1.1 Auditory periphery . . . . . . . . 2.1.2 Central auditory nervous system 2.2 Perception . . . . . . . . . . . . . . . . . 2.2.1 Loudness . . . . . . . . . . . . . 2.2.2 Pitch . . . . . . . . . . . . . . . . 2.2.3 Timbre . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 4 4 7 8 8 9 11 3 Room acoustics 3.1 Review of studies analyzing acoustic signals 3.2 Sound recordings . . . . . . . . . . . . . . . 3.3 Signal analysis . . . . . . . . . . . . . . . . 3.3.1 Sound Pressure Level (SPL) . . . . . 3.3.2 Autocorrelation Function (ACF) . . 3.3.3 Spectral Centroid (SC) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 12 13 14 14 15 16 4 Auditory models 4.1 Description of the auditory image model . . . . . . . 4.1.1 Pre Cochlear Processing (PCP) . . . . . . . . 4.1.2 Basilar Membrane Motion (BMM) . . . . . . 4.1.3 Neural Activity Pattern (NAP) . . . . . . . . 4.1.4 Strobe Temporal Integration (STI) . . . . . . 4.1.5 Autocorrelation Function (ACF) . . . . . . . 4.2 Auditory analysis . . . . . . . . . . . . . . . . . . . . 4.2.1 Loudness analysis . . . . . . . . . . . . . . . 4.2.2 Auto correlation analysis for pitch perception 4.2.3 Sharpness analysis for timbre perception . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 22 22 23 23 24 25 25 25 29 41 . . . . . . . iii 5 Analysis of the perceptual results 5.1 Description of the non parametric modeling: 5.2 Analysis . . . . . . . . . . . . . . . . . . . . 5.2.1 Distance . . . . . . . . . . . . . . . . 5.2.2 Loudness . . . . . . . . . . . . . . . 5.2.3 Pitch . . . . . . . . . . . . . . . . . . 5.2.4 Sharpness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 43 44 44 46 46 47 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . echolocation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 49 50 50 51 51 52 53 7 General Conclusion 7.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 54 55 Bibliography 56 Appendices 60 A Room acoustics A.1 Calibration Constant . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.2 Sound Pressure Level . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.3 Spectral Centroid . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 60 61 65 B Auditory models B.1 Loudness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B.2 Sharpness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B.3 Pitch strength using strobe temporal integration . . . . . . . . . . . . . . . . . . . . . 73 73 76 79 6 Discussion 6.1 Echolocation and loudness . . . . . . . . 6.2 Echolocation and pitch . . . . . . . . . . 6.3 Echolocation and sharpness . . . . . . . 6.4 Echolocation and room acoustics . . . . 6.5 Echolocation and binaural information . 6.6 Advantages or disadvantages of auditory 6.7 Theoretical implications of thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . model approach to . . . . . . . . . . . iv . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . human . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . List of Figures 2.1 2.2 2.3 2.4 2.5 2.6 2.7 Anatomy of the human ear. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Cochlea unrolled, in cross section. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Cross section of the cochlea, and the schematic view of the organ of corti. . . . . . . . An illustration of the most important pathways and nuclei from the ear to the auditory cortex. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Basic structure of the models used for the calculation of loudness. . . . . . . . . . . . . A simulation of the basilar membrane motion for a 200 Hz sinusoid. . . . . . . . . . . A simulation of the basilar membrane motion for a 500ms iterated ripple noise with gain=1, delay=10ms and no of iterations = 2. . . . . . . . . . . . . . . . . . . . . . . . 3.1 3.2 Sound recordings made in the anechoic, conference and the lecture room. . . . . . . . The autocorrelation function of a 5ms signal recorded in the anechoic chamber (Experiment 1) with reflecting object at 100cm. . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 The autocorrelation function of a 500ms signal recorded in the anechoic chamber (Experiment 1) with reflecting object at 100cm. . . . . . . . . . . . . . . . . . . . . . . . . 3.4 The autocorrelation function of a 5ms signal recorded in the conference room (Experiment 1) with reflecting object at 100cm. . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5 The autocorrelation function of a 500ms signal recorded in the conference room (Experiment 1) with reflecting object at 100cm. . . . . . . . . . . . . . . . . . . . . . . . . 3.6 The autocorrelation function of a 5ms signal recorded in the lecture room (Experiment 2) with reflecting object at 100cm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.7 The autocorrelation function of a 500ms signal recorded in the lecture room (Experiment 2) with reflecting object at 100cm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.8 The mean of the spectral centroid for the 10 versions as a function of time of the left ear 500ms recording in the anechoic chamber (Experiment 1). . . . . . . . . . . . . . . 3.9 The mean of the spectral centroid for the 10 versions as a function of time of the left ear 500ms recording in the conference room (Experiment 1). . . . . . . . . . . . . . . . 3.10 The mean of the spectral centroid for the 10 versions as a function of time of the left ear 500ms recording in the lecture room (Experiment 2). . . . . . . . . . . . . . . . . . 4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8 4.9 4.10 4.11 The frequency response used to design the gm2002 filter of the PCP module in the AIM. The NAP of a 200 Hz signal in the 1209 Hz frequency channel. . . . . . . . . . . . . . The Dual profile of a 5ms signal recorded in the anechoic room (Experiment 1). . . . . The Dual profile of a 5ms signal recorded in the conference room (Experiment 1). . . . The Dual profile of a 5ms signal recorded in the lecture room (Experiment 2). . . . . . The Dual profile of a 50ms signal recorded in the anechoic room (Experiment 1). . . . The Dual profile of a 50ms signal recorded in the conference room (Experiment 1). . . The Dual profile of a 500ms signal recorded in the anechoic room (Experiment 1). . . The Dual profile of a 500ms signal recorded in the conference room (Experiment 1). . The Dual profile of a 500ms signal recorded in the lecture room (Experiment 2). . . . An example to illustrate the pitch strength measure computed using the pitch strength module of the AIM. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v 4 5 6 7 8 9 10 13 17 17 18 18 19 19 20 21 21 22 24 31 32 33 34 35 36 37 38 39 5.1 5.2 5.3 The parametric (Weibull fit) and non parametric (Local linear fit) modeling of the mean proportion of correct responses of the blind participants as a function of distance. (a) For the 5ms recordings in anechoic chamber. (b) For the 5ms recording in conference room. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The parametric (Weibull fit) and non parametric (Local linear fit) modeling of the mean proportion of correct responses of the blind participants as a function of distance. (a) For the 50ms recordings in anechoic chamber. (b) For the 50ms recording in conference room. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The parametric (Weibull fit) and non parametric (Local linear fit) modeling of the mean proportion of correct responses of the blind participants as a function of distance. (a) For the 500ms recordings in anechoic chamber. (b) For the 500ms recording in conference room. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.1 The spectral centroid as a function of time for the 10 versions (marked in different colors for each subplot) of the left ear 5ms recording in the anechoic chamber (Experiment 1). A.2 The spectral centroid as a function of time for the 10 versions (marked in different colors for each subplot) of the right ear 5ms recording in the anechoic chamber (Experiment 1). A.3 The spectral centroid as a function of time for the 10 versions (marked in different colors for each subplot) of the left ear 5ms recording in the conference room (Experiment 1). A.4 The spectral centroid as a function of time for the 10 versions (marked in different colors for each subplot) of the right ear 5ms recording in the conference room (Experiment 1). A.5 The spectral centroid as a function of time for the 10 versions (marked in different colors for each subplot) of the left ear 5ms recording in the lecture room (Experiment 2). A.6 The spectral centroid as a function of time for the 10 versions (marked in different colors for each subplot) of the right ear 5ms recording in the lecture room (Experiment 2). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.7 The spectral centroid as a function of time for the 10 versions (marked in different colors for each subplot) of the left ear 50ms recording in the anechoic chamber (Experiment 1). A.8 The spectral centroid as a function of time for the 10 versions (marked in different colors for each subplot) of the right ear 50ms recording in the anechoic chamber (Experiment 1). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.9 The spectral centroid as a function of time for the 10 versions (marked in different colors for each subplot) of the left ear 50ms recording in the conference room (Experiment 1). A.10 The spectral centroid as a function of time for the 10 versions (marked in different colors for each subplot) of the right ear 50ms recording in the conference room (Experiment 1). A.11 The spectral centroid as a function of time for the 10 versions (marked in different colors for each subplot) of the left ear 500ms recording in the anechoic chamber (Experiment 1). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.12 The spectral centroid as a function of time for the 10 versions (marked in different colors for each subplot) of the right ear 500ms recording in the anechoic chamber (Experiment 1). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.13 The spectral centroid as a function of time for the 10 versions (marked in different colors for each subplot) of the left ear 500ms recording in the conference room (Experiment 1). A.14 The spectral centroid as a function of time for the 10 versions (marked in different colors for each subplot) of the right ear 500ms recording in the conference room (Experiment 1). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.15 The spectral centroid as a function of time for the 10 versions (marked in different colors for each subplot) of the left ear 500ms recording in the lecture room (Experiment 2). . A.16 The spectral centroid as a function of time for the 10 versions (marked in different colors for each subplot) of the right ear 500ms recording in the lecture room (Experiment 2). B.1 The temporal profiles of stabilised auditory image for a 500ms signal recorded in the conference room (Experiment 1). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi 44 44 45 65 65 66 66 67 67 68 68 69 69 70 70 71 71 72 72 79 List of Tables 3.1 3.2 4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8 4.9 5.1 5.2 5.3 Mean of the sound pressure levels (dBA) for the left and right ears over the 10 versions of the 500ms duration signals in the anechoic and conference room of Experiment 1. . Mean of the sound pressure level (dBA) for the left and right ears over the 10 versions of the 500ms duration signals in the lecture room of Experiment 2. . . . . . . . . . . . Mean of the maximum of the Short Term Loudness in sones of 10 versions for the recordings in anechoic conference and the lecture room with 5ms duration signal. . . . Mean of the maximum of the Short Term Loudness in sones of 10 versions for the recordings in anechoic conference and the lecture room with 50ms duration signal. . . Mean of the maximum of the Short Term Loudness in sones of 10 versions for the recordings in anechoic, lecture and conference room with 500ms duration signal. . . . Mean of the pitch strength (autocorrelation index) of 10 versions for the recordings in anechoic conference and the lecture room with 5ms duration signal. . . . . . . . . . . . Mean of the pitch strength (autocorrelation index) of 10 versions for the recordings in anechoic conference and the lecture room with 50ms duration signal. . . . . . . . . . . Mean of the pitch strength (autocorrelation index) of 10 versions for the recordings in anechoic, lecture and conference room with 500ms duration signal. . . . . . . . . . . . Mean of the 10 versions of mean of median of the sharpness (acums) for the recordings in anechoic, conference and the lecture room with 5ms duration signal. . . . . . . . . . Mean of the 10 versions of mean of median of the sharpness (acums) for the recordings in anechoic, conference and the lecture room with 50ms duration signal. . . . . . . . . Mean of the 10 versions of mean of median of the sharpness (acums) for the recordings in anechoic, conference and the lecture room with 500ms duration signal. . . . . . . . 15 15 28 28 29 40 40 40 41 42 42 Detection thresholds of object distance (cm) for duration, room, and listener groups. . Threshold values of loudness (sones) for duration, room, and listener groups. . . . . . Threshold values of the pitch strength (autocorrelation index) for duration, room, and listener groups. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Threshold values of the mean of the mean of median sharpness (acums) for duration, room, and listener groups. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 46 A.1 Calibrated levels with and without A weighting. . . . . . . . . . . . . . . . . . . . . . A.2 SPL values (dBA) for 10 versions of the left ear recordings in the anechoic chamber with 5ms duration signal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.3 SPL values (dBA) for 10 versions of the right ear recordings in the anechoic chamber with 5ms duration signal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.4 SPL values (dBA) for 10 versions of the left ear recordings in the lecture chamber with 5ms duration signal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.5 SPL values (dBA) for 10 versions of the right ear recordings in the lecture chamber with 5ms duration signal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.6 SPL values (dBA) for 10 versions of the left ear recordings in the conference chamber with 5ms duration signal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.7 SPL values (dBA) for 10 versions of the right ear recordings in the conference chamber with 5ms duration signal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 5.4 vii 47 48 61 61 61 61 62 62 A.8 SPL values (dBA) for 10 versions of the left ear recordings in the anechoic chamber with 50ms duration signal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.9 SPL values (dBA) for 10 versions of the right ear recordings in the anechoic chamber with 50ms duration signal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.10 SPL values (dBA) for 10 versions of the left ear recordings in the conference chamber with 50ms duration signal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.11 SPL values (dBA) for 10 versions of the right ear recordings in the conference chamber with 50ms duration signal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.12 SPL values (dBA) for 10 versions of the left ear recordings in the anechoic chamber with 500ms duration signal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.13 SPL values (dBA) for 10 versions of the right ear recordings in the anechoic chamber with 500ms duration signal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.14 SPL values (dBA) for 10 versions of the left ear recordings in the conference chamber with 500ms duration signal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.15 RSPL values (dBA) for 10 versions of the right ear recordings in the conference chamber with 500ms duration signal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.16 SPL values (dBA) for 10 versions of the left ear recordings in the lecture chamber with 500ms duration signal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.17 SPL values (dBA) for 10 versions of the right ear recordings in the lecture chamber with 500ms duration signal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B.1 Maximum of the Short Term Loudness in sones of 10 versions for the recordings in the anechoic chamber with 5ms duration signal. . . . . . . . . . . . . . . . . . . . . . . . . B.2 Maximum of the Short Term Loudness in sones of 10 versions for the recordings in the anechoic chamber with 50ms duration signal. . . . . . . . . . . . . . . . . . . . . . . . B.3 Maximum of the Short Term Loudness in sones of 10 versions for the recordings in the anechoic chamber with 500ms duration signal. . . . . . . . . . . . . . . . . . . . . . . . B.4 Maximum of the Short Term Loudness in sones of 10 versions for the recordings in the conference room with 5ms duration signal. . . . . . . . . . . . . . . . . . . . . . . . . . B.5 Maximum of the Short Term Loudness in sones of 10 versions for the recordings in the conference room with 50ms duration signal. . . . . . . . . . . . . . . . . . . . . . . . . B.6 Maximum of the Short Term Loudness in sones of 10 versions for the recordings in the conference room with 500ms duration signal. . . . . . . . . . . . . . . . . . . . . . . . B.7 Maximum of the Short Term Loudness in sones of 10 versions for the recordings in the lecture room with 5ms duration signal. . . . . . . . . . . . . . . . . . . . . . . . . . . . B.8 Maximum of the Short Term Loudness in sones of 10 versions for the recordings in the lecture room with 5ms duration, 32 clicks signal. . . . . . . . . . . . . . . . . . . . . . B.9 Maximum of the Short Term Loudness in sones of 10 versions for the recordings in the lecture room with 5ms duration 64 clicks signal. . . . . . . . . . . . . . . . . . . . . . . B.10 Maximum of the Short Term Loudness in sones of 10 versions for the recordings in the lecture room with 500ms duration signal. . . . . . . . . . . . . . . . . . . . . . . . . . B.11 Median of the sharpness in acums of 10 versions for the recordings in the anechoic room (Experiment 1) with 5ms duration signal. . . . . . . . . . . . . . . . . . . . . . . . . . B.12 Median of the sharpness in acums of 10 versions for the recordings in the anechoic room (Experiment 1) with 50ms duration signal. . . . . . . . . . . . . . . . . . . . . . . . . . B.13 Median of the sharpness in acums of 10 versions for the recordings in the anechoic room (Experiment 1) with 500ms duration signal. . . . . . . . . . . . . . . . . . . . . . . . . B.14 Median of the sharpness in acums of 10 versions for the recordings in the conference room (Experiment 1) with 5ms duration signal. . . . . . . . . . . . . . . . . . . . . . . B.15 Median of the sharpness in acums of 10 versions for the recordings in the conference room (Experiment 1) with 50ms duration signal. . . . . . . . . . . . . . . . . . . . . . B.16 Median of the sharpness in acums of 10 versions for the recordings in the conference room (Experiment 1) with 500ms duration signal. . . . . . . . . . . . . . . . . . . . . . viii 62 62 63 63 63 63 64 64 64 64 73 73 73 74 74 74 74 74 75 75 76 76 76 77 77 77 B.17 Median of the sharpness in acums of 10 versions for the recordings in the lecture room (Experiment 2) with 5ms duration signal. . . . . . . . . . . . . . . . . . . . . . . . . . B.18 Median of the sharpness in acums of 10 versions for the recordings in the lecture room(Experiment 2) with 5ms duration, 32 clicks signal. . . . . . . . . . . . . . . . . . B.19 Median of the sharpness in acums of 10 versions for the recordings in the lecture room (Experiment 2) with 5ms duration, 64 clicks signal. . . . . . . . . . . . . . . . . . . . . B.20 Median of the sharpness in acums of 10 versions for the recordings in the lecture room (Experiment 2) with 500ms duration signal. . . . . . . . . . . . . . . . . . . . . . . . . ix 78 78 78 78 Abbreviations ACF AIM-MAT AIM BMM CC ELC ERB FIR GLM H-C-L H-L HP-AF ILD IRN ITD MAF MAP NAP PCP PS RMS RP SAI SF SPL STI TI autocorr cGC dcGC fMRI gm2002 pGC sf2003 ti2003 Autocorrelation Function. Auditory Image Model in Matlab. Auditory Image Model. Basilar Membrane Motion. Calibrated Constant. Equal Loudness Contour. Equivalent rectangular bandwidth. Finite Impulse Response. Generalized linear models. Halfwave rectification, Compression, Lowpass filtering. Halfwave rectification, Lowpass filtering. High Pass Asymmetric Function. Interaural Level Difference. Iterated Rippled Noise. Interaural Time Difference. Minimum Audible Field. Minimum Audible Pressure. Neural Activity Pattern. Pre Cochlear Processing. Pitch Strength. Root Mean Square. Repetition Pitch. Stabilized Auditory Image. Strobe Finding. Sound Pressure Level. Strobe Temporal Integration. Temporal Integration. Autocorrelation Module in Auditory Image Model. Compressive Gamma Chirp. Dynamic Compressive Gamma Chirp. Functional Magnetic Resonance Imaging. Glasberg and Moore 2002. Passive Gamma Chirp. Strobe Finding 2003. Temporal Integration 2003. x Chapter 1 Introduction Human echolocation formerly known as “facial vision” or “obstacle sense” is the ability of the blind to detect objects in their environment, audition being the sensory basis for this ability (Dallenbach and Supa, 1944; Dallenbach and Cotzin, 1950). A blind person may use his or her self-generated sounds, e.g. by the voice, but it is also usual to use sounds generated by mechanical means such as the shoes, a cane, or some device like a clicker to detect an object (Schenkman and Nilsson, 2010). There are different factors that influence this ability of the blind and researchers over the years have performed various experiments to understand this ability. The discriminating power of this ability was initially studied and it was found that, both the blind and the sighted listeners could detect and discriminate object’s (Kellogg, 1962; Köhler, 1964; Rice, Feinstein, and Schusterman, 1965; as cited in Arias and Ramos, 1997). Later, the effect of various factors influencing the echolocation ability of the blind was studied by e.g Schenkman (1985), concluding that self made vocalizations and clicks were the most effective echolocation signals and an auditory analysis similar to the autocorrelation function (Bilsen and Ritsma, 1969; Yost, 1996) could represent its underlying psychophysical mechanism. The influence of the precedence effect on human echolocation was investigated by Seki, Ifukube, and Tanaka (1994) who performed the localization task in the vertical plane and found that the blind were more resistant to the precedence effect and the performance accuracy decreasing with the decreasing distance of the (reflected) sound source. Studies were also made to find the influence of the exploratory movements in echolocation. It was found that, for some distances, participants were somewhat more accurate when moving than being stationary (Miura et al, 2008). Later studies by Rowan et al (2013); Wallmeier, Geßele, and Wiegrebe (2013) also showed that binaural information is useful in locating the objects, when using echolocation. Experiments were done to find the environmental conditions and the type of signals that would favour echolocation. Schenkman and Nilsson (2010) analysed the effect of reverberation on the performance of the blind by using the signals recorded in an anechoic and a conference room. They found that the blind performed better for longer distances in the latter case. However, Kolarik et al (2014) say that the reverberation time in the study of Schenkman and Nilsson (2010) was rather low (T60 = 0.4s), and it is possible that longer reverberation times would lead to impaired rather than improved performance. The effects of reverberation time on echolocation performance have yet to be quantified. 1 CHAPTER 1. INTRODUCTION 2 Regarding the type of signals favourable for echolocation, Rojas et al (2009, 2010) suggested that short sounds generated at the palate are the most effective for echolocation. On the other hand Schenkman and Nilsson (2010) reported that longer duration signals are beneficial for echolocation. Therefore, to find the type of signals favorable for echolocation Schenkman, Nilsson, and Grbic (2011) studied the influence of click trains and longer duration noise signals on the echolocation performance. They found that the detection of the object at 100cm was best with both 32 clicks/500ms and 500ms noise; and at 150 cm with 32 clicks/500ms rather than the 500ms noise signal, contradicting the results of their previous experiment which favored the longer duration signals. Schenkman, Nilsson, and Grbic (2011) assumed that the decrease in performance was due to the difference in the experimental setup. In order to clarify the cause for the decrease in the performance, a physical analysis was made on the stimuli used in the experiments of Schenkman, Nilsson, and Grbic (2011) and is presented in the room acoustic chapter of this thesis. Although the analysis was made to explain the decrease in the performance, it is to be noted that the experiments performed by Schenkman and Nilsson (2010); Schenkman, Nilsson, and Grbic (2011) excluded exploratory movements, which probably are considered to be advantageous for the blind (Miura et al, 2008). Hence, more experimental testing is required by considering all these factors to conclude which types of signals are favourable for echolocation. Another aspect that has been the focus of the recent research in human echolocation is the variability of echolocation ability among the blind and sighted. Several studies have reported that blind participants have echolocation abilities superior to those of sighted participants (Dufour et al, 2005; Schenkman and Nilsson, 2010; Schenkman and Nilsson, 2011; Kolarik et al, 2013), with variability among the individuals (Schenkman and Nilsson, 2010; Teng and Whitney, 2011; Teng et al, 2012). However, the results from the psychoacoustic experiments could not explain whether the high echolocation ability of the blind is due to their extensive practice or brain plasticity or both. In some cases even the characteristics of the acoustic stimulus that determine the detection of the blind is not known. To discover whether the physiological differences are the cause for the high detection of the blind several researchers have analyzed the brain activity of the participants. Thaler, Arnott and Goodale (2011) conducted a study using functional magnetic resonance imaging (fMRI) in one early and one late blind participant and demonstrated that echolocation activates occipital and not auditory cortical areas, with stronger activation in the early blind participant. A more recent study by the same authors (Thaler et al, 2014), suggest that the echo-motion response in blind experts may represent reorganization rather than exaggeration of responses observed in sighted novices, and that there is the possibility that this reorganization involves the recruitment of visual cortical areas. However, the extent to which such recruitment contributes to the echolocation abilities of the blind remains unclear and a combined study using the neuroimaging techniques and psychoacoustic methods may give a clearer insight into the role of physiology in the high echolocation ability of the blind. Although it is expected that the combination of neuroimaging and psychoacoustic methods can give us some insight into the high echolocating ability of the blind, these do not reveal the information in the acoustic stimulus that determines it (at least when the information is not known) and how this information is represented in the human auditory system. A reasonable solution to find the information necessary for the high echolocation ability of the blind is by performing a signal analysis on the acoustic stimulus. However, such an analysis does not show us how the information is represented in CHAPTER 1. INTRODUCTION 3 the human auditory system. To solve this problem, one may use auditory models in the literature which try to mimic the human hearing. Analyzing the acoustic stimulus using these models may give us insight into the causes for the high echolocation ability of the blind. It is vital to use signal analysis and the auditory models in order to understand the differences between the listeners in human echolocation, since one needs to consider the transmission of the acoustic sound from the source to the internal representation of the listener. Initially, when the acoustic sound travels and undergoes transformation due to the room acoustics, one should first understand which information is being received at the human ear. This is where signal analysis comes into play, as we can analyze the characteristics of the acoustic sound which are transformed due to various room conditions. The second step is to analyze how the desired characteristic of the acoustic sound that contains the information is represented in the auditory system. This is where the auditory models come into play. The desired information is transformed in a similar way to how the auditory system might processes it. Therefore by keeping track of the information from the outer ear to the central nervous system one may understand the cause for the differences between the participants and this is the research strategy of this thesis. To model the auditory analysis performed by the human auditory system the auditory image model of Patterson, Allerhand, and Giguere (1995), loudness models of Glasberg and Moore (2002, 2007) and the sharpness model of Fastl and Zwicker (2007) were considered in this thesis. Matlab was chosen as the implementation environment. The auditory image model was implemented in matlab by Bleeck, Ives, and Patterson (2004b) and the current version is known as AIM-MAT. The loudness and the sharpness models were implemented in PsySound3 (Cabrera, Ferguson, and Schubert, 2007), a GUIdriven Matlab environment for analysis of audio recordings. AIM-MAT and PsySound3 were downloaded from https://code.soundsoftware.ac.uk/projects/aimmat and http:// www.psysound.org, respectively and used in the thesis. AIMS OF THE THESIS: (1) To find out the information in the acoustic stimulus, that determines the high echolocation ability of the blind. (2) To find out how this acoustic information which determines high echolocation ability of the blind might be represented in the human auditory system. For this we use the recordings of Schenkman and Nilsson (2010) and Schenkman, Nilsson, and Grbic (2011), denoted Experiment 1 and 2, respectively. OUTLINE OF THE THESIS: The thesis is formulated as follows: As the auditory models are developed based on research in physiology and perception, initially a detailed review of relevant parts of these subjects is presented in Chapter 2. In Chapter 3, the signal analysis done on the recordings of Schenkman and Nilsson (2010) and Schenkman, Nilsson, and Grbic (2011) to find out the information used to detect the objects is presented. Chapter 4 describes how the auditory models were designed and implemented. The analysis of the recordings of Schenkman and Nilsson (2010), and Schenkman, Nilsson, and Grbic (2011) using the auditory models is also presented in this chapter. The results from the auditory models are compared with the perceptual results in Chapter 5. A discussion of the results is presented in Chapter 6 followed by the conclusion in Chapter 7. Chapter 2 Physiology and Perception A signal processing model of human auditory system is designed on the basis of research in physiology and psychology of hearing. Therefore, it is vital to give a background to the physiological and psychological aspects of hearing for understanding how the models may explain human echolocation. 2.1 Physiology of hearing The auditory system consists of the auditory periphery and the central nervous system which encodes and processes the acoustic sound respectively. A brief description of how this is done is presented below. 2.1.1 Auditory periphery The peripheral part of the auditory system consists of the ear, which transduces the sound waves from the environment into neural responses and strengthen the perception of the sound. Figure 2.1 shows the structure of the human ear, which is further subdivided Pinna Stapes (attached to oval window) Semicircular Canals Vestibular Nerve Incus Malleus Cochlear Nerve Concha External Auditory Canal Cochlea Tympanic Cavity Eustachian Tube Tympanic Membrane Round Window Figure 2.1: Anatomy of the human ear, Figure adapted from, Chittka L, Brockmann A [CC-BY-2.5 (http:// creativecommons.org/licenses/by/2.5)], via Wikimedia Commons. 4 CHAPTER 2. PHYSIOLOGY AND PERCEPTION 5 into outer, middle and inner ear. Initially when the sound reaches the human ear, the head, torso and pinna attenuate the sound in a frequency dependent manner in which the sound pressure is decreased at high frequencies. After the attenuation due to the head, torso and pinna, the sound travels through the auditory canal via the concha (the cavity which helps to funnel sound into the canal). Since the resonance frequency of the concha is closer to 5 kHz and the resonant frequency of external auditory canal is about 2.5 kHz, the concha and external auditory canal cause an increase in sound pressure level (SPL) of about 10 to 15 dB in the frequency range 1.5 kHz to 7 kHz. The tympanic membrane vibrates as a result of sound waves travelling in the external auditory canal, and the vibrations are passed along the oscillating chain (Yost, 2007). The middle ear consists of the ossicular chain (malleus, incus and stapes), which provide effective means to deliver sound to the inner ear where the neural process of hearing begins. Due to the difference in surface area between tympanic membrane and the stapes foot plate, and also due to the lever action of the ossicles there is an increase in the pressure level between the ear drum and the inner ear by 30 dB or more. The actual pressure transformation depends on the frequency of the stimulus (Yost 2007, pp 75-79). Thus the middle ear works a little bit like a thumbtack, collecting pressure over a large area on the blunt, thumb end, and concentrating it on the sharp end (Schnupp, Nelken, and King, 2011). The vibratory patterns representing the acoustic message reach the cochlea via the stapes. Along the entire length of the cochlea runs a structure known as the basilar membrane, which is narrow and stiff at the basal end of the cochlea (i.e. near the oval and round windows), but wide and floppy at the far, apical end. The basilar membrane subdivides the fluid-filled spaces inside the cochlea into upper compartments (the scala vestibuli and scala media) and lower compartments (the scala tympani). Thus the cochlea is equipped with two sources of mechanical resistance, one provided by the stiffness of the basilar membrane, the other by the inertia of the cochlear fluids. The stiffness gradient decreases as we move farther away from the oval window, but the inertial gradient increases. As the inertial resistance is frequency dependent, the path of overall lowest resistance depends on the frequency. It is long for low frequencies which are less affected by inertia (i.e Path B in Figure 2.2) and increasingly short for high frequencies (i.e Path A in Figure 2.2). Hence every time the stapes pushes against the oval window, low frequencies cause vibrations at the apex of the basilar membrane, and the high frequencies cause vibration at the base. This property makes the cochlea to Oval Window Stapes Basilar Membrane A Round Window Bony Wall B Helicotrema Figure 2.2: Cochlea unrolled, in cross section. The grey shading represents the inertial gradient of the fluids and the stiffness gradient of the basilar membrane. Note the gradients run in opposite direction. Figure, redrawn with permission from Schnupp, Nelken, and King (2011) CHAPTER 2. PHYSIOLOGY AND PERCEPTION 6 operate as a mechanical frequency analyser. However it is to be noted that the cochlea does not have a sharp frequency resolution and it is perhaps more useful to think of the cochlea as a set of mechanical filters (Schnupp, Nelken, and King 2011, pp 55-64). Another important phenomena that the basilar membrane exhibits is the travelling wave phenomena. However, Schnupp, Nelken, and King (2011), say that describing the travelling wave as a manifestation of the sound energy can be misleading and suggest that it is probably more accurate to imagine the mechanical vibrations as travelling along the membrane only in the sense that they travel mostly through the fluid next to the membrane and then pass through the membrane as they come near the point of lowest resistance. The travelling wave may then be mostly a curious side effect of the fact that the mechanical filters created by each small piece of basilar membrane, together with the associated cochlear fluid columns, all happen to be slightly out of phase with each other. The mechanical vibrations from the basilar membrane are transduced into electrical potentials by the shearing against the tectorial membrane of the stereocilia in the organ of corti (cf Figure 2.3). This happens as follows: A structure in the organ of corti, named scala vascularis, leaks the K + ions from the bloods stream into the scala media. The scala vascularis also sets up an electrical voltage gradient across the basilar membrane. As the stereocilia in each bundle are not all of the same length, and as the tips are connected with each other, by fine protein fiber strands known as “tip links”, the ion channels, open in response to stretch (increase in the tension) on the tip links, allowing the K + ions to flow through the hair cells. The hair cells then form glutamatergic, excitatory synoptic contacts with the spiral ganglion neurons along their lower end. These neurons form the long axons that travel through the auditory nerve and reach the cochlear nucleus (Schnupp, Nelken, and King, 2011). Cross sectoin of the cochlea Organ of Corti Tectorial membrane Stria vascularis Scala Vestibuli Scala media Basilar membrane Outer hair cells Inner hair cells Basilar membrane Scala tympani Figure 2.3: Cross section of the cochlea, and the schematic view of the organ of corti. Figure, redrawn with permission from Schnupp, Nelken, and King (2011) As can be seen in Figure 2.3, there are two types of hair cells, outer and inner hair cells. The inner and outer hair cells are connected to type I and type II fibers respectively. Anatomically, type II fibers are unsuited to provide fast through output of the encoded information (Schnupp, Nelken, and King, 2011). Hence, only the inner hair cells are known to be the biological transducers. Although the outer hair cells do not provide CHAPTER 2. PHYSIOLOGY AND PERCEPTION 7 any neural transduction, they are known to exhibit motility which cause the non linear cochlear amplification. A detailed description of how this non linear cochlear amplification can be modeled using the signal processing techniques is presented in Chapter 4. 2.1.2 Central auditory nervous system As discussed in the above section, the auditory periphery transduces the acoustic sound. However hearing involves more than neural coding of the sound, i.e. processing of the encoded sound. This processing is done by the central auditory nervous system. The central auditory nervous system consists of the cochlear nucleus, the superior olivary complex, the inferior coliculus, the medial geniculate body and the auditory cortex, and also other structures, Figure 2.4, illustrates this. Left hemisphere Medial geniculate Right hemisphere Inferior colliculus Lateral lemniscus Superior olive Dorsal cochlear nucleus Left ear Right ear Ventral cochlear nucleus Figure 2.4: An illustration of the most important pathways and nuclei from the ear to the auditory cortex. The nuclei illustrated are located in the brain stem. Figure redrawn with permission from Moore (2013). There is evidence that many cells in the dorsal cochlear nucleus react in a manner that suggest a lateral inhibition network, which helps in sharpening the neural representation CHAPTER 2. PHYSIOLOGY AND PERCEPTION 8 of the spectral information (Yost 2007, pp 240). As the information from the left and right ears converge at the olivary nuclei they are assumed to process the spatial perception of sound (Schnupp, Nelken, and King, 2011). The spectral and spatial information from the cochlear nucleus and the superior olivary complex are further processed and combined by the inferior coliculus. Finally, the region in auditory cortex processes the complex sound. 2.2 Perception The physiological background remains one main inspiration for the auditory models but they are also based on how the physical and perceptual attributes of the acoustic sound are encoded in the auditory system. Loudness, pitch and timbre are three subjective attributes of the acoustic sound that are relevant for human echolocation. Therefore this section discusses how these attributes are encoded in the auditory system. 2.2.1 Loudness Loudness is the perceptual attribute of intensity and is defined as that attribute of auditory sensation in terms of which sounds can be ordered on a scale extending from quiet to loud (ASA, 1973). Regarding the underlying mechanisms of how loudness is perceived, there is no full understanding. The dynamic range of auditory system is wide and different mechanisms play a role in intensity discrimination. Psychophysical experiments suggest that neuron firing rates, spread of excitation and phase locking play a role in intensity perception, but the latter two may not always be essential. A disadvantage with the neuron firing rates is that, although the single neurons in the auditory nerve can be used to explain the intensity discrimination, this does not explain why the intensity discrimination is not better than observed, suggesting that the discrimination is limited by the capacity of the higher levels in the auditory system, which may also play a role in intensity discrimination (Moore, 2013). Stimulus Fixed filter for transfer of outer/ middle ear Transform spectrum to excitation pattern Transform excitation pattern to specific loudness Calculate the area under the specific loudness pattern Figure 2.5: Basic structure of the models used for the calculation of loudness. Figure, redrawn from Moore (2013). Several models (cf, Moore, 2013, pp 139 - 140) have been proposed to calculate the average loudness that would be perceived by a large group of listeners. Figure 2.5, shows the basic structure of a model used to calculate loudness. The model performs the outer and middle ear transformations and then calculates the excitation pattern. The excitation pattern is transformed into specific loudness, which involves a compressive non-linearity. The total area under the specific loudness pattern is assumed to be proportional to the overall loudness. Therefore, whatever may be the mechanism underlying the perception of loudness, the excitation pattern seems to be the essential information that should be used to design an auditory model of loudness. CHAPTER 2. PHYSIOLOGY AND PERCEPTION 2.2.2 9 Pitch Pitch is defined as “that attribute of auditory sensation in terms of which sounds may be ordered on a musical scale” (ASA, 1960). Regarding the underlying mechanisms of how pitch is encoded is still a matter of debate. One view is that, as the cochlea is assumed to perform the spectrum analysis, the acoustic vibrations are transformed into a spectrum, coded as a profile of discharge rate across the auditory nerve. An alternative view proposes that the role of the cochlea is to transduce the acoustic vibrations into temporal patterns of neural firing. These two views are known as place and time hypotheses. Figure 2.6 shows a simulation of the basilar membrane motion of a 200 Hz sinusoid, generated using dynamic gammachirp filterbank module available in AIM-MAT. It can be seen that both the frequency and the temporal patterns are preserved. According to the place hypothesis, pitch is determined from the position of maximum excitation along the basilar membrane, within the cochlea. This explains how the pitch is perceived by the pure tones at low levels, but it fails to explain pure tones at higher levels i.e at higher levels, due to non linearity of the basilar membrane (as described in the physiology section) the peaks become broader and tends to shift towards a lower frequency place. This should lead to a decrease in pitch; however the psychophysical experiments show that the pitch is stable. Another case where the place hypothesis fails is, its inability to explain the pitch of the stimuli whose fundamental is absent. According to the paradox of the missing fundamental, the pitch evoked by a pure tone remains the 8 Frequency (kHz) 4.7 2.8 1.7 0.9 0.5 0.2 0 0 5 10 15 20 25 30 35 40 time (ms) Figure 2.6: A simulation of the basilar membrane motion for a 200 Hz sinusoid. The figure was generated by using dynamic gamma chirp filter bank module available in AIM-MAT. It can be seen that both the place and the temporal information is preserved. CHAPTER 2. PHYSIOLOGY AND PERCEPTION 10 6 3.9 Frequency (kHz) 2.5 1.5 0.9 0.5 0.3 0.1 0 5 10 15 20 25 30 35 40 time (ms) Figure 2.7: A simulation of the basilar membrane motion for a 500ms iterated ripple noise with gain=1, delay=10ms and no of iterations = 2. The figure was generated using dynamic gamma chirp filter bank module available in AIM-MAT. It can be seen that there are no periodic repetitions to support the time hypothesis. same if we add additional tones with frequencies that are integer multiples of that of the original pure tone (harmonics). It also does not change if we then remove the original pure tone (the fundamental) (De Cheveigné, 2010). On the other hand, as the time hypothesis states that pitch is derived from the periodic pattern of the acoustic waveform, it overcomes the problem of the missing fundamental. However the main difficulty with the time hypothesis is that it is not easy to extract one pulse per period, in a way that is reliable and fully general. Psychoacoustic studies also show that pitch exist for stimuli which is not periodic. An example of such a stimuli is iterated ripple noise (IRN), a stimuli that models some of the human echolocation signals (cf Figure 2.7). In order to overcome the limitations of the place and time hypothesis two new theories were proposed, pattern matching (De Boer 1956, cited in De Cheveigné 2010) , and a theory based on autocorrelation (Licklider 1951, cited in De Cheveigné 2010). De Boer (1956) described the concept of pattern matching in his thesis. It states that the fundamental partial is the necessary correlate of pitch, but it may be absent if other parts of the pattern are present. In this way pattern matching supports the place hypothesis. Later Goldstein (1973), Wightman (1973) and Terhardt (1974) described different models for pattern matching. One problem with the pattern matching theory is that it fails to account for pitch whose stimuli have no resolved harmonics. CHAPTER 2. PHYSIOLOGY AND PERCEPTION 11 The autocorrelation hypothesis assumes temporal processing in the auditory system. It states that, instead of detecting the peaks at regular intervals, the periodic neural pattern is processed by coincidence detector neurons that calculate the equivalent of an autocorrelation function (Licklider 1951, cited in De Cheveigné 2010). The spike trains are delayed within the brain by various time lags (using neural delay lines) and are combined or correlated with the original. When the lag is equal to the time delay between spikes the correlation is high and outputs of the coincidence detectors tuned to that lag are strong. Spike trains in each frequency channel are processed independently and the results combined into an aggregate pattern. However, De Cheveigné (2010) says that the autocorrelation hypothesis works too well: It predicts that, pitch should be equally salient for stimuli with resolved and unresolved partials, whereas this is not the case (De Cheveigné, 2010). An alternative to the theory based on an autocorrelation like function is the strobe temporal integration (STI) of Patterson et al (1995). According to STI, the auditory image underlying the perception of pitch is obtained by using triggered, quantised, temporal integration, instead of an autocorrelation like function. The STI works by finding the strobes from the neural activity pattern and integrating it over a certain period. To summarize, there is no full understanding of how pitch is perceived. Whether temporal, spectral or multi mechanisms determine the pitch perception, the underlying information that the auditory system uses to detect the pitch is the excitation pattern. Hence, the excitation pattern remains the crucial information that should be simulated to design an auditory model of pitch perception. 2.2.3 Timbre When the loudness and pitch of an acoustic sound are similar, the subjective attribute of sound which is used to distinguish the sound is the timbre. Timbre has been defined as that attribute of auditory sensation which enables a listener to judge that two non identical sounds, similarly presented and having same loudness and pitch , are dissimilar (ANSI, 1994). One example is the difference between two musical instruments playing the same tone e.g guitar and piano. Timbre is a multidimensional percept and there is no single scale on which we can order timbre. To quantize timbre one approach is to consider the overall distribution of the spectral energy. Plomp and his colleagues, showed that the perceptual differences between different sounds, were closely related to the levels in 18 1/3 octave bands, thus relating the timbre to the relative level produced by the sound in each critical band. Hence, generally, for both speech and non speech sounds, the timbre of steady tones are determined by their magnitude spectra, although the relative phases may play a small role (Plomp as cited in Moore, 2013). When we consider time varying patterns, there are several factors that influence the perception of timbre, which include:(i) periodicity; (ii) variation in the envelope of the waveform; (iii) spectrum changing over time; (iv) what the preceding and following sounds are like. The timbre information can be assessed using the auditory models from the levels in the spectral envelope and variation of the temporal envelope. Another way to preserve the fine grain time interval information that is necessary for timbre perception is by the strobe temporal integration of Patterson et al (1995). Chapter 3 Room acoustics Before analyzing how an acoustic sound might be represented in the auditory system using auditory models, it is vital to study the physics and room acoustics of the sound that determines human echolocation. Hence, this chapter initially reviews the studies analyzing the acoustic signals. 3.1 Review of studies analyzing acoustic signals As discussed in chapter 2, the iterated ripple noise stimuli models some of the human echolocation signals. Initially, a brief review of the studies performed on this stimuli in the literature is presented. Thereafter, the review of the studies of other acoustic stimuli used for understanding human echolocation is presented. Bassett and Eastmond (1964) examined the physical variations in the sound field close to a reflecting wall. They used a loudspeaker which generated Gaussian noise, placed at more than 5 m from a large horizontal reflecting panel, in an anechoic chamber. A microphone was placed at number of points between the loudspeaker and the panel and an interference pattern was observed. Bassett and Eastmond reported a perceived pitch caused by the interference of direct and reflected sound at different distances from the wall; the pitch value being equal to the inverse of the delay. In a similar way, (Small JR and McClellan as cited in Bilsen 1966), delayed identical pulses and found the pitch perceived was equal to the inverse of the delay, naming the pitch as time separation pitch. Later, Bilsen and Ritsma (1969) stated that, when a sound and the repetition of that sound are listened to, a subjective tone is perceived with a pitch corresponding to reciprocal value of the delay time and termed the pitch perceived as repetition pitch. Bilsen, tried to explain repetition pitch phenomenon by using autocorrelation peaks or the spectral peaks. Yost (1996) performed experiments using iterated ripple noise stimuli and concluded that autocorrelation is the underlying mechanism, used by the listeners to detect the repetition pitch phenomenon. Regarding other acoustic stimuli used for understanding human echolocation, Rojas et al (2009, 2010), conducted a physical analysis on the acoustic characters of orally produced pulses and finger produced pulses, showing that the former were better for echolocation. Papadopoulos et al (2011) examined the acoustic signals used in the study of Dufour et al (2005) and stated that the information for obstacle discrimination were found in the frequency dependent inter aural level differences (ILD) especially in the range from 5.5 to 6.5 kHz, rather than on inter aural time differences (ITD). Pelegrin Garcia, Roozen, and Glorieux (2013) performed a study using the boundary element method and found that frequencies above 2 kHz provide information for localization of the object, whereas the lower frequency range would be used for size determination. Similar analysis was performed by Rowan et al (2013) using a virtual auditory space technique and came to 12 CHAPTER 3. ROOM ACOUSTICS 13 the same conclusion, viz. that performance was primarily based on information above 2 kHz. In view of the above studies several analysis were performed for this thesis and are presented in the remaining part of this chapter to identify the information necessary for the detection of the object. 3.2 Sound recordings The sound recordings of Schenkman and Nilsson (2010), Schenkman, Nilsson, and Grbic (2011) are used in our study. A brief description of how the recordings were made is given here. In Schenkman and Nilsson (2010), the binaural sound recordings were conducted in an ordinary conference room and in an anechoic chamber using an artificial manikin. The object was a reflecting 1.5 mm thick aluminium disk with a diameter of 0.5 m. Recordings were conducted at a 0.5, 1, 2, 3, 4, and 5 m distance between microphones and the reflecting object. In addition, recordings were made with no obstacle in front of (a) (b) (c) Figure 3.1: Sound recordings made in the Experiment 1, a) anechoic room b) conference room, with loudspeaker on the chest of the artificial manikin and in the Experiment 2, c) lecture room with loudspeaker behind the artificial manikin. The pictures are reproduced with permission from Bo Schenkman. CHAPTER 3. ROOM ACOUSTICS 14 the artificial manikin. The following durations of the noise signal were used: 500, 50, and 5 ms; the shortest corresponds perceptually to a click. The electrical signal was a white noise. However, the emitted sound was not perfectly white, because of the non-linear frequency response of the loudspeaker and the system. A loudspeaker generated the sounds, resting on the chest of the artificial manikin. The sound recording set ups can be seen in Figures 3.1(a), 3.1(b). In Schenkman, Nilsson, and Grbic (2011), recordings were conducted in an ordinary lecture room. Recordings were conducted at 100 and 150 cm distance between microphones and the reflecting object. The emitted sound were either bursts of 5 ms each, varying in rates from 1 to 64 bursts per 500 ms or a 500 ms white noise. These sounds were generated by a loudspeaker placed 1 m straight behind the center of the head of the artificial manikin. The sound recording set up can be seen in Figure 3.1(c). From now on the recordings of Schenkman and Nilsson (2010), Schenkman, Nilsson, and Grbic (2011) will be referred to as Experiment 1 and Experiment 2 respectively. A detailed description of the recordings can be found in Schenkman and Nilsson (2010) and in Schenkman, Nilsson, and Grbic (2011). 3.3 Signal analysis To find out the information used for detecting an object and to analyze how the acoustics of the room affect human echolocation, a number of different analysis were performed namely: sound pressure level, autocorrelation, and spectral centroid. Before analyzing the recordings, the recordings were calibrated by the calibrating constants (CC), using equation 3.1. Based on the SPL of 77, 79 and 79 dBA for the 500ms recording without the object at the ear of the artificial manikin in the anechoic, conference and lecture room of Experiment 1 and Experiment 2, the CC’s were calculated to be 2.4663, 2.6283 and 3.5021 respectively. 1 CC = 10 SP L−20∗log10 ( rms(signal) ) 20∗10−6 20 (3.1) As the recordings were binaural both the left and right ear recordings were analyzed. The recordings in Experiment 1 and Experiment 2, had 10 versions of each duration and distance. It should be noted that the recordings vary over the versions causing the term ”rms(signal)” in equation 3.1 to vary, thereby varying the calibrated constants with the versions. However, as the variation is very small in this thesis only the 9th version of the 500ms first recording without the object (NoObject rec1) in Experiment 1 and the 9th version of the 500ms recording without the object in Experiment 2 were used to find the above calibrated constants. Another reason to choose only the 9th version is that although the other versions may not have the same CC’s they will be relatively calibrated with respect to the recording of version 9. For example, suppose the recording in the anechoic chamber version 1 had 67 dB SPL and version 9 had 66 dB SPL before calibration, then the levels obtained by calibrating the recordings to 77 dB SPL using the CC of the 9th version would be 78 dB SPL for version 1 and 77 dB SPL version 9. In other words, they will give the same level difference, even after calibration. 3.3.1 Sound Pressure Level (SPL) The detection of the objects may to a certain extent based on an intensity difference. Hence, the SPL in dBA were calculated using equation 3.2, where ”RMS” is the root 1 The A weighting was not included in equation 3.1. However, the difference was found to be less than 0.5 dB and hence was neglected. See section A.1 of the appendix for more details. CHAPTER 3. ROOM ACOUSTICS 15 mean square amplitude of the signal analyzed. The results of the 500ms recordings in Experiment 1 and Experiment 2 are tabulated in table 3.1 and 3.2. A detailed analysis of the SPL values for all the 10 versions of 5, 50 and 500ms recordings are presented in Tables A.2 to A.17 in Appendix A. CC ∗ rms(signal) (3.2) SP L = 20 ∗ log10 20 ∗ 10−6 Recording NoObject rec1 NoObject rec2 Object50cm Object100cm Object200cm Object300cm Object400cm Object500cm Anechoic chamber Left ear Right ear 77.153 77.866 77.592 77.374 85.182 88.216 81.877 82.550 77.097 78.044 76.975 78.211 77.051 77.986 76.987 78.033 Conference room Left ear Right ear 79.003 78.817 78.993 78.824 87.539 87.457 82.827 82.377 79.598 79.481 78.926 78.898 79.016 78.860 79.009 78.798 Table 3.1: Mean of the sound pressure levels (dBA) for the left and right ears over the 10 versions of the 500ms duration signals in the anechoic and conference room of Experiment 1. Recording NoObject Object100cm Object150cm Lecture room Left ear Right ear 79.165 79.577 79.594 81.545 79.412 79.681 Table 3.2: Mean of the sound pressure level (dBA) for the left and right ears over the 10 versions of the 500ms duration signals in the lecture room of Experiment 2. The tabulated SPL values in Tables 3.1 and 3.2 show the effect of room acoustics in the form of level differences (both between the ears and also among the rooms). The level differences between the recording without object and recording with object at 100 and 150cm were less in Experiment 2 when compared to Experiment 1. This may be due to the differences in experimental setup (cf Figure 3.1) and the acoustics of the room. However the extent to which this information is used by the participants is not straight forward as loudness perceived by the human auditory system cannot be related directly to the SPL (Moore, 2013). This issue is further discussed in Chapter 4. 3.3.2 Autocorrelation Function (ACF) Generally, intensity differences play a role in human echolocation. However, Schenkman and Nilsson (2011) showed that repetition pitch is the more important information used by the participants rather than the loudness in order to detect the objects. As discussed in pitch perception section of Chapter 2, pitch perception can often be explained using the peaks in the autocorrelation function, hence an autocorrelation analysis is performed in this section. The repetition pitch for the recordings in Experiment 1 and Experiment 2 can be theoretically calculated using equation 3.3. The corresponding values for recordings with objects at 50, 100, 150, 200, 300, 400 and 500 cm would be approximately 344, 172, 114, 86, 57, 43 and 34.4 Hz (assumed sound velocity to be 344m/s). As the theory based CHAPTER 3. ROOM ACOUSTICS 16 on autocorrelation uses temporal information, repetition pitch perceived at the above frequencies can be explained by the peaks in the autocorrelation function (ACF) at the inverse of the frequencies i.e, approximately 2.9, 5.8, 8.7, 11.6, 17.4, 23.2 and 29 ms respectively. Therefore, the autocorrelation analysis was performed using a 32 ms frame which would cover the required pitch period. A 32ms hop size was to analyze the ACF for the next time instants 64ms, 96ms etc. In order to compare the peaks among all the recordings the ACF was not normalized to the limits -1 to 1. RP = speed of the sound 2 ∗ distance of the object (3.3) In Experiment 1 the participants performed well with the longer duration signals (cf Schenkman and Nilsson (2010)). They, assumed that the higher detection ability of the participants for the longer duration signals may be that although the subject may miss the RP at the first repetition, they may perceive it in the later repetitions. This can be visualized using the ACF in Figure 3.2 and 3.3 , where for the 5ms recording the peak was present only for the initial 32ms frame (note that for each duration signals an additional 450ms silence was padded and presented to the participants, the ACF were analyzed in the same manner hence the 5ms duration signal had total duration of 455ms and 500ms signal had total duration of 950ms) whereas for the 500ms recording the peak was also present for frames with time instants greater than 32 ms. The assumption of Schenkman and Nilsson (2010) could explain the reason for the high echolocation ability of the participants for higher duration signals in Experiment 1. However, in Experiment 2 the performance decreased although the repetitions were present for the frames with time instant greater than 32ms (cf Figure 3.6 and 3.7). Therefore, drawing the conclusion that longer duration signals are always beneficial for human echolocation cannot be made based on the available results. The peak heights at the pitch period for the recordings with object at 100cm for the 5ms duration signal in conference room when compared with the lecture room show that the peak height is greater for the recording in conference room (cf Figure 3.4 and 3.6). For the 500ms duration signal with object at 100cm in the lecture room when compared to the 5ms signal recording in the conference room although had a greater peak height (cf Figure 3.4 and 3.7) the peak is not distinct enough when compared with the 500ms duration signal in the conference room (cf Figure 3.5 and 3.7). The reason for these differences in the peak heights between the conference room and the lecture room may be due to the room acoustics. As ACF depends on the spectrum of the signal the acoustics of the room certainly influences the peaks in the ACF. The reverberation time T60 for conference and lecture room was 0.4 and 0.6 seconds respectively, indicating that the acoustics of the room may influence the ACF and in turn the echolocation ability. How this information of the peaks is represented in the auditory system is further discussed in Chapter 4. 3.3.3 Spectral Centroid (SC) Detection of an object may also be based on the efficient use of the timbre information available in the stimuli. To test this hypothesis one has to describe that attributes of the acoustic sound which contribute to the timbre perception. An attribute that describes the timbre perception is the spectral centroid (Peeters et al, 2011). The spectral centroid gives a time varying value characterizing the subjective center of the timbre for a sound. Therefore, the spectral centroid analysis is performed on the recordings and presented in this section. CHAPTER 3. ROOM ACOUSTICS 17 10 10 5 0 −5 −10 32ms frame at 64ms time instant ACF Index ACF Index 32ms frame at 32ms time instant 0 500 1000 5 0 −5 −10 1500 0 500 Lag 1000 10 10 5 32ms frame at 128ms time instant ACF Index ACF Index 32ms frame at 96ms time instant 0 −5 −10 0 500 1000 5 0 −5 −10 1500 0 500 Lag 1000 10 32ms frame at 160ms time instant 5 32ms frame at 192ms time instant ACF Index ACF Index 1500 Lag 10 0 −5 −10 1500 Lag 0 500 1000 5 0 −5 −10 1500 0 500 Lag 1000 1500 Lag Figure 3.2: The autocorrelation function of a 5ms signal recorded in the anechoic chamber (Experiment 1) with reflecting object at 100cm. The sub plots show the autocorrelation function at 32,64,96,128,160 and 192 ms time instants of the signal respectively. As the recording is only 5ms in duration the autocorrelation function is only present in the first 32ms frame. 50 50 0 −50 32ms frame at 64ms time instant ACF Index ACF Index 32ms frame at 32ms time instant 0 500 1000 0 −50 1500 0 500 Lag 1000 50 50 32ms frame at 128ms time instant ACF Index ACF Index 32ms frame at 96ms time instant 0 −50 0 500 1000 0 −50 1500 0 500 Lag 1000 50 32ms frame at 192ms time instant ACF Index 32ms frame at 160ms time instant ACF Index 1500 Lag 50 0 −50 1500 Lag 0 500 1000 Lag 1500 0 −50 0 500 1000 1500 Lag Figure 3.3: The autocorrelation function of a 500ms signal recorded in the anechoic chamber (Experiment 1) with reflecting object at 100cm. The sub plots show the autocorrelation function at 32,64,96,128,160 and 192 ms time instants of the signal respectively. CHAPTER 3. ROOM ACOUSTICS 18 10 10 5 0 −5 −10 32ms frame at 64ms time instant ACF Index ACF Index 32ms frame at 32ms time instant 0 500 1000 5 0 −5 −10 1500 0 500 Lag 1000 10 10 5 32ms frame at 128ms time instant ACF Index ACF Index 32ms frame at 96ms time instant 0 −5 −10 0 500 1000 5 0 −5 −10 1500 0 500 Lag 1000 10 32ms frame at 160ms time instant 5 32ms frame at 192ms time instant ACF Index ACF Index 1500 Lag 10 0 −5 −10 1500 Lag 0 500 1000 5 0 −5 −10 1500 0 500 Lag 1000 1500 Lag Figure 3.4: The autocorrelation function of a 5ms signal recorded in the conference room (Experiment 1) with reflecting object at 100cm. The sub plots show the autocorrelation function at 32,64,96,128,160 and 192 ms respectively. As the recording is only 5ms in duration the autocorrelation function is only present in the first 32ms frame. 50 50 0 −50 32ms frame at 64ms time instant ACF Index ACF Index 32ms frame at 32ms time instant 0 500 1000 0 −50 1500 0 500 Lag 1000 50 50 32ms frame at 128ms time instant ACF Index ACF Index 32ms frame at 96ms time instant 0 −50 0 500 1000 0 −50 1500 0 500 Lag 1000 50 32ms frame at 192ms time instant ACF Index 32ms frame at 160ms time instant ACF Index 1500 Lag 50 0 −50 1500 Lag 0 500 1000 Lag 1500 0 −50 0 500 1000 1500 Lag Figure 3.5: The autocorrelation function of a 500ms signal recorded in the conference room (Experiment 1) with reflecting object at 100cm. The sub plots show the autocorrelation function at 32,64,96,128,160 and 192 ms respectively. CHAPTER 3. ROOM ACOUSTICS 19 10 10 5 0 −5 −10 32ms frame at 64ms time instant ACF Index ACF Index 32ms frame at 32ms time instant 0 200 400 600 800 1000 1200 5 0 −5 −10 1400 0 200 400 600 Lag 800 10 0 −5 0 200 400 600 800 1000 1200 5 0 −5 −10 1400 0 200 400 600 Lag 800 1000 1200 1400 Lag 10 10 32ms frame at 160ms time instant 5 32ms frame at 192ms time instant ACF Index ACF Index 1400 32ms frame at 128ms time instant ACF Index ACF Index 5 0 −5 −10 1200 10 32ms frame at 96ms time instant −10 1000 Lag 0 200 400 600 800 1000 1200 5 0 −5 −10 1400 0 200 400 600 Lag 800 1000 1200 1400 Lag Figure 3.6: The autocorrelation function of a 5ms signal recorded in the lecture room (Experiment 2) with reflecting object at 100cm. The sub plots show the autocorrelation function at 32,64,96,128,160 and 192 ms respectively. As the recording is only 5ms in duration the autocorrelation function is only present in the first 32ms frame. 50 50 0 −50 32ms frame at 64ms time instant ACF Index ACF Index 32ms frame at 32ms time instant 0 200 400 600 800 1000 1200 0 −50 1400 0 200 400 600 Lag 800 50 0 200 400 600 800 1000 1200 0 −50 1400 0 200 400 600 Lag 800 1000 1200 1400 Lag 50 50 32ms frame at 192ms time instant ACF Index 32ms frame at 160ms time instant ACF Index 1400 32ms frame at 128ms time instant ACF Index ACF Index 0 0 −50 1200 50 32ms frame at 96ms time instant −50 1000 Lag 0 200 400 600 800 Lag 1000 1200 1400 0 −50 0 200 400 600 800 1000 1200 1400 Lag Figure 3.7: The autocorrelation function of a 500ms signal recorded in the lecture room (Experiment 2) with reflecting object at 100cm. The sub plots show the autocorrelation function at 32,64,96,128,160 and 192 ms respectively. CHAPTER 3. ROOM ACOUSTICS 20 To compute the spectral centroid, the recordings were analyzed using a 32ms frame with a 2ms overlap. The spectral centroid for each frame was computed using equation 3.4. As the spectral centroid for each frame is a time varying function, it is plotted as a function of time. The mean of the spectral centroid for the 10 versions at each condition for the 500ms left ear recordings is plotted in Figures 3.8 to 3.10. A detailed analysis of all the recordings can be seen in section A.3 of the Appendix. Figures A.1 to A.14 show the spectral centroid for the left and right ear recordings in Experiment 1 and A.5 to A.16 show the spectral centroid for the left and right ear recordings in Experiment 2. (F requency ∗ F F T (f rame)) (3.4) SpectralCentroid = (F F T (f rame) 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Time(sec) Mean of the spectral centroid over 10 recordings for Object at 50cm 10000 5000 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Time(sec) Mean of the spectral centroid over 10 recordings for Object at 200cm 10000 5000 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Time(sec) Mean of the spectral centroid over 10 recordings for Object at 400cm 10000 5000 0 0 0.1 0.2 0.3 0.4 0.5 Time(sec) 0.6 0.7 0.8 0.9 Frequency (Hz) 0 Frequency (Hz) 5000 Frequency (Hz) Mean of the spectral centroid over 10 recordings for NoObject rec1 10000 Frequency (Hz) Frequency (Hz) Frequency (Hz) Frequency (Hz) Frequency (Hz) In Experiment 1 the spectral centroid for all the recordings without the object were approximately below 5000 Hz. The recordings with object at 50 and 100cm were approximately above 5000Hz (eg: cf Figure 3.8), which would provide some information to distinguish them from the recordings without the object. The recordings with object at 200 to 500cm did not vary much when compared with the recording without the object. In Experiment 2 the spectral centroid was approximately 6000Hz for all recordings (cf Figure 3.10), showing very small changes which may not be useful for detection. The analysis thus showed that there was variation in the spectral centroid in the recordings of Experiment 1 with object at shorter distances (distance less than 200cm) but for longer distances the difference in the spectral centroid was almost negligible. Mean of the spectral centroid over 10 recordings for NoObject rec2 10000 5000 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Time(sec) Mean of the spectral centroid over 10 recordings for Object at 100cm 10000 5000 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Time(sec) Mean of the spectral centroid over 10 recordings for Object at 300cm 10000 5000 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Time(sec) Mean of the spectral centroid over 10 recordings for Object at 500cm 10000 5000 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Time(sec) Figure 3.8: The mean of the spectral centroid for the 10 versions as a function of time of the left ear 500ms recording in the anechoic chamber (Experiment 1). On the other hand as the spectral analysis performed by the auditory system is more complex than FFT, which was used to compute the spectral centroid. It will be shown later in Chapter 4 that the above conclusion will be modified, when we take in to account the results of auditory models. It should also be noted that there are other attributes that describe the timbre perception. Spectral centroid is considered in this thesis because it is believed to be important feature of timbre. 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Time(sec) Mean of the spectral centroid over 10 recordings for Object at 50cm 10000 5000 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Time(sec) Mean of the spectral centroid over 10 recordings for Object at 200cm 10000 5000 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Time(sec) Mean of the spectral centroid over 10 recordings for Object at 400cm 10000 5000 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Frequency (Hz) 5000 Frequency (Hz) Mean of the spectral centroid over 10 recordings for NoObject rec1 10000 Frequency (Hz) 21 Frequency (Hz) Frequency (Hz) Frequency (Hz) Frequency (Hz) Frequency (Hz) CHAPTER 3. ROOM ACOUSTICS Mean of the spectral centroid over 10 recordings for NoObject rec2 10000 5000 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Time(sec) Mean of the spectral centroid over 10 recordings for Object at 100cm 10000 5000 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Time(sec) Mean of the spectral centroid over 10 recordings for Object at 300cm 10000 5000 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Time(sec) Mean of the spectral centroid over 10 recordings for Object at 500cm 10000 5000 0 0 0.1 0.2 0.3 Time(sec) 0.4 0.5 0.6 0.7 0.8 0.9 Time(sec) Figure 3.9: The mean of the spectral centroid for the 10 versions as a function of time of the left ear 500ms recording in the conference room (Experiment 1). Mean of the spectral centroid over 10 recordings for Object at 100cm 10000 8000 8000 Frequency (Hz) Frequency (Hz) Mean of the spectral centroid over 10 recordings for NoObject 10000 6000 4000 2000 6000 4000 2000 0 0 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 Time(sec) 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 Time(sec) Mean of the spectral centroid over 10 recordings for Object at 150cm Frequency (Hz) 10000 8000 6000 4000 2000 0 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 Time(sec) Figure 3.10: The mean of the spectral centroid for the 10 versions as a function of time of the left ear 500ms recording in the lecture room (Experiment 2). Chapter 4 Auditory models 4.1 Description of the auditory image model The auditory image model (AIM) is a time-domain, functional model of the signal processing performed in the auditory pathway as the system converts a sound wave into the initial perception that we experience when presented with that sound. This representation is referred to as an auditory image by analogy with the visual image of a scene that we experience in response to optical stimulation (Patterson et al, 1992 ; Patterson et al, 1995). As discussed in Chapter 2, in order to simulate the internal representation of an acoustic sound in the human auditory system, one should simulate the mechanisms of both the peripheral and the central auditory system. The AIM simplifies these concepts into different modules. How the modules were implemented using different signal processing strategies are described below. 4.1.1 Pre Cochlear Processing (PCP) The outer middle ear transformation of the acoustic sound is simulated in AIM using a PCP module. The PCP module consists of four different FIR filters, designed for different applications.(i) Minimum audible field (MAF), which is suitable for signals presented in free field. (ii) Minimum audible pressure (MAP), which is suitable for 10 5 Relative transmission (dB) 0 −5 −10 −15 −20 −25 −30 −35 −40 20 50 100 200 500 1000 2000 Frequency (Hz) 5000 10000 20000 Figure 4.1: The frequency response used to design the gm2002 filter of the PCP module in the AIM. The frequency response was obtained from the frontal field to cochlea correction data of Glasberg and Moore (2002). 22 CHAPTER 4. AUDITORY MODELS 23 systems which produce a flat frequency response. (iii) Equal loudness contour (ELC) and (iv) Glaberg and Moore 2002 (gm2002) are almost the same and include the factors associated with the extra internal noise at low and high frequencies. However, gm2002 uses more recent data of Glasberg and Moore (2002) . The MAF, MAP, ELC are designed using Parks-McClellan optimal equiripple FIR filter design algorithm and the gm2002 is designed using a frequency sampling method. An example of the frequency response used to generate a PCP filter is shown in Figure 4.1. The transmission of the acoustic sound through the PCP filter can be modelled using equation 4.1. Where Signalinput is the input to the AIM and Signalpcp is the filtered output of the corresponding P CP filter. Signalpcp = f ilter(P CPf ilter , Signalinput ) 4.1.2 (4.1) Basilar Membrane Motion (BMM) An important feature of the peripheral auditory system is the non linear spectral response of the basilar membrane. This is implemented in the AIM using a dynamic compressive gammachirp filter bank dcGC (Irino and Patterson, 2006). Two properties of the BMM are the asymmetry and the compression of the auditory filters made in proportion to the level. These properties are designed using a compressive gammachirp filter. The compressive gammachirp (cGC) filter is a generalized form of the gammatone filter, which was derived with operator techniques (Irino and Patterson, 1997). The development of both the gammatone and gammachirp filters is described in Patterson, Unoki, and Irino 2003, Appendix A. The cGC is simulated by cascading a passive gammachirp filter (pGC) with a high pass asymmetric function (HP-AF). The asymmetrical property is simulated by the pGC filter and the output of the pGC is used to adjust the level dependency of the active part i.e the HP-AF. There are also other options available for generating BMM in AIM namely, the gammatone function and the pole zero filter cascade. However, the gammatone function does not depict the non-linearity of the basilar membrane. The default filterbank dcGC, was used to simulate the BMM in this thesis. The transformation of the BMM can be modelled using equation 4.2. Where SignalpGC (fc ) is the filtered output of the pGC filterbank, fc is the centre frequency of the filter, ACF (fc ) is the high pass asymmetric compensation filters and SignalcGC (fc ) is the final compressed output of the BMM stage. For a detailed description of the pGC and cGC the reader is advised to refer to Irino and Patterson (2006). 4.1.3 SignalpGC (fc ) = f ilter(pGC(fc ), Signalpcp ) (4.2) SignalcGC (fc ) = f ilter(ACF (fc ), SignalpGC (fc )) (4.3) Neural Activity Pattern (NAP) The basilar membrane motion is transduced into an electrical potential by the inner hair cells. As discussed in Chapter 3 the stretch in the tip links of the stereocilia will only produce the K + ions to flow through them therefore the NAP can be simulated using a signal processing concept of half wave rectification. This is implemented in AIM using half wave rectification followed by a low pass filtering. The low pass filtering is done as the phase locking is not possible for high frequencies. CHAPTER 4. AUDITORY MODELS 24 There are three modules to generate the NAP i.e (i) half wave rectification followed by compression followed by low pass filtering (H-C-L) (ii) half wave rectification followed by low pass filtering (H-L) (iii) two dimensional adaptive threshold (same as H-C-L but has adaptation which is more realistic). The choice of the NAP modules depends on the choice of the BMM modules. As dcGC filter bank was used in this thesis, the compression of the basilar membrane is already simulated by it. Therefore, the H-L was chosen to generate the NAP and this transformation can be modeled using the equation 4.4. Where abs(Signalbmm (fc ) is the half wave rectified signal of the basilar membrane, fc is the centre frequency of the filter, LP F is the low pass filter and (Signalnap (fc ) is the modeled NAP. (4.4) Signalnap (fc ) = f ilter(LP F, abs(Signalbmm (fc )) 4.1.4 Strobe Temporal Integration (STI) The next stage in the AIM is the processing done by the central nervous system. Perceptual research suggests that the fine grain temporal information is needed to preserve the timbre information. General auditory models time average the NAP, which loses the fine grain information. To prevent this AIM uses a mechanism known as STI. This is subdivided into two modules in AIM i.e (i) strobe finding (SF) (ii) temporal integration (TI). Strobe Finding (SF) : AIM uses a sub module named sf2003 to find the strobes from the NAP. The sf2003 uses an adaptive strobe threshold to issue a strobe and the time of the strobe is that associated with the peak of the NAP pulse. After the strobe is initiated the threshold initially rises along a parabolic path and then returns to the linear decay to avoid spurious strobes. The duration of the parabola is proportional to the centre frequency of the channel and the height to the height of the strobe. After the parabolic section of the adaptive threshold, its level decreases linearly to zero in 30 ms. An additional feature of sf2003 is the inter channel interaction i.e a strobe in one channel reduces the threshold in the neighboring channels. An example of how the threshold varies and the strobes are calculated can be seen in Figure 4.2 . 0.6 0.4 0.3 NAP Threshold Variation Strobes amplitude 0.5 0.2 0.1 0 0.005 0.01 0.015 0.02 0.025 0.03 0.035 0 0.04 time (ms) Figure 4.2: The NAP of a 200 Hz pure tone in the 253 Hz frequency channel. The green line shows the threshold variation and the red dots indicate the calculated strobes. CHAPTER 4. AUDITORY MODELS 25 Temporal Integration (TI) : The temporal integration is implemented in AIM using a module called stabilized auditory image (SAI). The SAI uses a sub module called ti2003 to do this. The ti2003 changes the time dimension of the NAP into a time interval dimension. This works as follows: Initially, a temporal integration is initiated when a strobe is detected. If no further strobes are detected, the process continues for 35ms and then stops. In the case of strobes detected within the 35ms interval, each strobe initiates a temporal integration process. To preserve the shape of the SAI to that of the NAP, ti2003 uses a weighting concept i.e the new strobes are initially weighted high (also the weights are normalized such that the sum of the weights is equal to 1) so that the older strobes contribute relatively less to the SAI. 4.1.5 Autocorrelation Function (ACF) The AIM also offers an alternative module named autocorr to implement the processing done by the central nervous system. The module takes the NAP as the input and computes the ACF on each center frequency channel of the NAP by using a duration of 70 ms, hop time of 10ms and a maximum delay of 35ms. By using the autocorr module one can implement the autocorrelation hypothesis of Licklider (1951) mentioned in Chapter 3. This is how the AIM represents the internal representation of the acoustic sound. A detailed description of each module of AIM can be found at http://www.acousticscale. org/wiki/index.php/AIM2006_Documentation. The above mentioned modules were used to analyze the recordings in the thesis. All the processing modules of AIM were written in matlab and the current version is referred to as AIM-MAT. It can be downloaded from https://code.soundsoftware.ac.uk/projects/aimmat. The autocorr module was only present in the 2003 version of AIM and can be downloaded from http://w3.pdn.cam.ac. uk/groups/cnbh/aimmanual/download/downloadframeset.htm. The matlab code from AIMMAT was used as the implementation of the AIM for the analysis of the recordings in this thesis. 4.2 4.2.1 Auditory analysis Loudness analysis In the room acoustics chapter, the sound pressure level analysis was made to get a general picture of how the amplitude of acoustic sound may affect human echolocation ability. In this section a similar analysis is made using the loudness model of Glasberg and Moore (2002), as it takes account of human hearing since loudness depends not only on the frequency selectivity but also on the bandwidth and duration of the sound. The reason for choosing the model of Glasberg and Moore (2002) over AIM for the loudness analysis is clarified next. The loudness model of Glasberg and Moore (2002) computes the frequency selectivity and compression of the basilar membrane in two stages, i.e. by computing the excitation pattern and the specific loudness of the input signal. However, physiologically they are interlinked and a time domain filter bank which simulates both the selectivity and compression might be more appropriate. Although there are different time domain models of the level dependent auditory filters available in AIM (eg : dcGC) , they do not give a sufficiently good fit to the equal loudness contours in ISO 2006 (Moore, 2014). This was the main reason for not choosing the AIM to model loudness in this thesis. Therefore, instead we use the model of Glasberg and Moore (2002). CHAPTER 4. AUDITORY MODELS 26 As discussed in the perception of loudness section in chapter 2, a loudness model should consider the outer middle ear filtering, the non linearity of the basilar membrane and the temporal integration of the auditory system. The loudness model of Glasberg and Moore (2002), estimates the loudness of steady sounds and time varying sounds, by accounting for the above mentioned features of the human auditory system. Each stage of the model is described briefly. Outer middle ear transformation: The outer middle ear transformation was modeled using a FIR filter with 4097 coefficients and the response at the inner ear can be represented using equation 4.5, where x and yomt are the signals before and after transformation and h is the impulse response of the filter. yomt = f ilter(h, x) (4.5) Excitation pattern: The excitation pattern is defined as the magnitude of the output of each auditory filter plotted as a function of filter centre frequency. To compute the excitation pattern from the time domain signal Glasberg and Moore (2002) used six FFTs in parallel based on Hanning-windowed segments with durations of 2, 4, 8, 16, 32, and 64 ms, all aligned at their temporal centres. The windowed segments are zero padded, and all FFTs are based on 2048 sample points. All FFTs are updated at 1ms intervals. Each FFT was used to calculate the spectral magnitudes at specific frequency ranges, values outside the range were discarded. The running spectrum was given as the input to the auditory filters and the output of the auditory filters were calculated at the center frequency of 0.25 Equivalent rectangular bandwidth (ERB) intervals taking in to account the known variation of the auditory filter shape with center frequency and level. The excitation pattern is then defined as the output of the auditory filter as a function of center frequency (Glasberg and Moore, 2002). This can be represented using the equation 4.6 where W (fc ) is the frequency response of the auditory filter at center frequency fc , Yomt is the power spectrum of yomt calculated using six parallel FFTs as mentioned above over a 1-ms interval and E(fc ) is the magnitude of the output of each auditory filter with centre frequency fc . E(fc ) = Yomt ∗ W (fc ) (4.6) Specific loudness (SL): To model the non linearity of the basilar membrane the excitation pattern has to be converted to specific loudness. This was done in Glasberg and Moore (2002) using three conditions (cf equation 4.7). ⎧ 1.5 2E(fc ) ⎪ ⎪ + ((G ∗ E(fc ) + A)α − Aα ) if E(fc ) ≤ TQ (fc ) C ∗ ⎪ E(fc )+TQ (fc ) ⎨ SL(fc ) = ((G ∗ E(fc ) + A)α − Aα ) ⎪ 0.5 ⎪ ⎪ ⎩C ∗ E(fc ) 6 1.04∗10 if 1010 ≥ E(fc ) ≥ TQ (fc ) (4.7) if E(fc ) ≥ 1010 Where TQ (fc ) is the threshold of excitation which is frequency dependent. G represents the low level gain in the cochlear amplifier, relative to the gain at 500 Hz and above, and is also frequency dependent. The parameter A is used to bring the input-output function close to linear around the absolute threshold. α is a compressive exponent which varies between 0.27 and 0.2. C is a constant which scales the loudness to conform to the sone scale, where the loudness of 1 kHz tone at 40 dB SPL is 1 sone and C is equal to 0.047. CHAPTER 4. AUDITORY MODELS 27 Loudness depends not only on the intensity and bandwidth of the sound but also on other factors, especially duration of the sound. The influence of the duration on the loudness was modeled by Glasberg and Moore (2002) using three concepts namely, Instantaneous loudness, Short Term Loudness and Long Term Loudness. These depict the temporal integration of loudness in the auditory system and are described below. Instantaneous loudness (IL): The area under the specific loudness pattern is summed to give the instantaneous loudness. If the hearing is binaural the specific loudness pattern at the two ears is summed and the area under the sum of the specific loudness pattern is again summed to give the instantaneous loudness. It is to be noted that the instantaneous loudness is an intervening variable which is used for calculation. Thus it is not available for conscious perception. Short Term Loudness (STL): The Short Term loudness was calculated by averaging the instantaneous loudness using an attack constant αa = 0.045 and a decay constant αr = 0.02 (cf equation 4.8). The values of αa and αr were chosen so that the model will give reasonable predictions for variation of loudness with duration and amplitude modulated sounds (Moore, 2014) . αa ∗ ILn + (1 − αa ) ∗ ST Ln−1 if IL(n) ≥ ST L(n − 1) ST L(n) = (4.8) αr ∗ ILn + (1 − αr ) ∗ ST Ln−1 if IL(n) ≤ ST L(n − 1) Long Term Loudness (LTL): The Long Term loudness was calculated by averaging the short term loudness using an attack constant αa1 = 0.01 and a decay constant αr1 = 0.0005 (cf equation 4.9). The values of αa1 and αr1 were chosen so that the model to give reasonable predictions for overall loudness of sounds that are amplitude modulated at low rates (Moore, 2014). αa1 ∗ ST Ln + (1 − αa1 ) ∗ LT Ln−1 if ST L(n) ≥ LT L(n − 1) (4.9) LT L(n) = αr1 ∗ ST Ln + (1 − αr1 ) ∗ LT Ln−1 if ST L(n) ≤ LT L(n − 1) Another important characteristic that affects the loudness of a sound is the influence of the intensity at the two ears. To model the binaural loudness several psychoacoustic results have been considered (For details see Moore, 2014). Some early results measured to find the level difference required for equal loudness (LDEL) of monaurally and diotically presented sound was approximately 10 dB. As the loudness of a sound doubles with every 10 dB rise in intensity it was assumed in the loudness model of Glasberg and Moore (2002) that loudness sums across ears. However, recent results suggest that the LDEL is rather 5 to 6 dB. Moore and Glasberg described a model to account for these results using the concept of inhibition that a strong input to the one ear can inhibit the internal response evoked by a weaker input to the other ear (Moore, 2014). Moore and Glasberg implemented the inhibition between the ears by using a gain function. Initially the specific loudness pattern was smoothed using a Gaussian weighting function and the relative values of the smoothed function at the two ears was used to compute the gain functions of the ears. The gains were than applied to the specific loudness patterns of the two ears . The loudness for each ear was computed by summing the specific loudness over the centre frequencies and the binaural loudness was obtained by summing the loudness values across the two ears (Moore, 2014) . This procedure was used to compute the binaural loudness in this thesis. CHAPTER 4. AUDITORY MODELS 28 The binaural loudness model of Moore and Glasberg was implemented in PsySound3, a GUI-driven Matlab environment for analysis of audio recordings. The software can be downloaded from http://www.psysound.org. This matlab code was used to calculate the loudness of our recordings Glasberg and Moore (2002) assumed that the loudness of a brief sound is determined by the maximum of the short term loudness while the long term loudness may correspond to the memory for the loudness of an event that can last for several seconds. It is to be noted that for a time varying sound (eg amplitude modulated tone) it is appropriate to consider the long time loudness as a function of time to calculate the time varying loudness. However, in this thesis as the stimuli presented to the participants were noise bursts and can be considered steady and brief we follow the assumption of Glasberg and Moore (2002) of using the maximum of short time loudness as a measure of the loudness of the recordings. The results of the maximum of the short time loudness for recordings in Experiment 1 and Experiment 2 are tabulated in Tables B.1 to B.10 in Appendix B. NoObjectrec1 NoObjectrec2 Object50cm Object100cm Object150cm Object200cm Object300cm Object400cm Object500cm Anechoic 13.357 13.296 20.674 20.194 14.404 13.347 13.379 13.420 Conference 19.320 19.376 26.707 24.377 21.537 19.651 19.975 19.529 Lecture 15.497 17.160 16.179 - Table 4.1: Mean of the maximum of the Short Term Loudness in sones of 10 versions for the recordings in anechoic conference and the lecture room with 5ms duration signal. The blank cells indicate that there were no recordings made at those distances. NoObjectrec1 NoObjectrec2 Object50cm Object100cm Object150cm Object200cm Object300cm Object400cm Object500cm Anechoic 40.090 40.023 63.672 52.307 40.320 40.292 40.213 40.089 Conference 44.999 45.072 69.607 55.682 47.619 45.135 45.249 45.041 Lecture - Table 4.2: Mean of the maximum of the Short Term Loudness in sones of 10 versions for the recordings in anechoic conference and the lecture room with 50ms duration signal. The blank cells indicate that there were no recordings made at those distances. The mean of the maximum of the STL over the 10 version for 5, 50 and 500ms recordings in different room conditions are presented in Tables 4.1, 4.2 and 4.3. From the above tabulated data the loudness difference between the recording without the object and with the object at 100 cm was less in the case of lecture room than for the anechoic or conference room. This may be the reason for the low performance of the participants in the lecture room. Another comparison would be that the loudness values follow the CHAPTER 4. AUDITORY MODELS NoObjectrec1 NoObjectrec2 Object50cm Object100cm Object150cm Object200cm Object300cm Object400cm Object500cm 29 Anechoic 48.137 48.082 76.143 62.159 48.353 48.377 48.187 48.131 Conference 52.444 52.487 78.659 63.574 54.580 52.387 52.569 52.502 Lecture 52.013 54.712 52.466 - Table 4.3: Mean of the maximum of the Short Term Loudness in sones of 10 versions for the recordings in anechoic, lecture and conference room with 500ms duration signal. The blank cells indicate that there were no recordings made at those distances. same trend as in the sound pressure level analysis of the room acoustics chapter (cf 3.2 and 4.3). However, the values in Tables 4.1 to 4.3 are psychophysical and depict not only the acoustics of the rooms but also take into account relevant aspects of human hearing. A detailed comparison of loudness results with the performance of the participants is made in Chapter 5. 4.2.2 4.2.2.1 Auto correlation analysis for pitch perception Dual profile: As discussed in the room acoustics chapter, one of the phenomenon that the echo locators use in detecting the objects is the repetition pitch. The repetition pitch is generally perceived at a frequency which is equal to the inverse of the delay time between the sound and its reflection (Bilsen and Ritsma, 1969) . In the experiments of Schenkman and Nilsson (2010), Schenkman, Nilsson, and Grbic (2011), the objects were at 50, 100, 150, 200, 300, 400 and 500 cm. These distances correspond to a delay of 2.9, 5.8, 8.7, 11.6, 17.4, 23.2 and 29 ms. The frequency of the pitch perceived for these delays would be 344 , 172, 114, 86, 57, 43 and 34 Hz. However, it is to be noted that the actual delays may vary due to the different factors like the recording set up, speed of sound etc. To test the presence of repetition pitch at these frequencies and how this information would be represented in the auditory system the PCP, BMM and NAP modules mentioned in the description of the AIM (cf section 4.1) were used to analyze the recordings. Most of the previous research done to explain the repetition pitch perception of iterated rippled noise stimuli states that the peaks in the autocorrelation function are the basis for the repetition pith perception (Yost 1996; Patterson et al 1996). Hence, instead of using the strobe finding and the temporal integration modules, the autocorr module of the AIM was used as the final stage in this thesis to quantify the repetition pitch information. The reader should note that by not choosing the strobe temporal integration as the final stage in this thesis does not mean that it is not the way in which the pitch information is represented in the auditory system. As previous research on iterated rippled noise has quantified the repetition pitch perception using the autocorrelation theory, the thesis follows in their foot steps to quantify the repetition pitch perception that is known to be useful for echolocation using the same principle of autocorrelation. To know whether or not the strobe temporal integration is the way in which the repetition pitch that is known to be useful for human echolocation is represented in the auditory system, a further analysis is needed but is left as a future work. For interested reader an example CHAPTER 4. AUDITORY MODELS 30 figure of the results obtained using the strobe temporal integration module is presented in Appendix B.3. After generating the ACF using the autocorr module there is a dual profile development module in the AIM which as a dual profile sums up the ACF along both the temporal and the spectral domain. This is relevant to human hearing in depicting how the temporal and spectral information might be represented. An important feature of the dual profile model is that it plots both the temporal and the spectral sum on the frequency axis, in a single plot. The temporal profile and the spectral profile were scaled in the dual profile for this and the inverse relation of the time verses frequency ( f = 1/t) was used to plot both time and frequency on a frequency scale. As these features of the dual profile model are useful for analyzing the repetition pitch, this module was used to analyze the temporal and spectral results. The recordings with object at 300 to 500cm in Experiment 1 and 2,4,8,16,32 and 64, 5ms clicks in Experiment 2 do not provide any additional information and were not analyzed. It is to be noted that the temporal profile (blue line in the below figures) is calculated by summing the ACF output along 100 critical bands (50Hz to 8000Hz) at each time delay and the spectral profile (red line in the below figures) is calculated by summing the ACF output in each critical band along a 35ms time delay. Therefore, the temporal profile consists of 35ms delay samples and the spectral profile consists of 100 samples. When the recordings were presented to the participants they were presented as 5 or 50 or 500ms in duration plus an additional 450 ms of silence. Hence all the analysis of the recordings in this thesis were also done using the same principle i.e the whole signal was analyzed (eg 5ms recordings had 5ms duration plus 450ms of silence). However, for the sake of presenting the figures the first 70ms time interval of the recordings was used. In the analysis of 5ms recordings the peaks were identified both in the temporal profile (blue line) and the spectral profile (red line) (cf Figures 4.3 to 4.5). Note that the amplitude scale of the y axis is different in each sub figure of a particular figure. As the investigated attribute in this section is pitch the sub figures should be compared in reference with the No object sub figure of a particular figure. A distinct peak in any other sub figure which is absent in the No object sub figure indicates the possibility of a pitch perception. There were small spectral differences but this information do not indicate any pitch information. For the temporal profile the peaks were identified approximately at the theoretical frequency of the repetition pitch 86 Hz for recordings with object at 200cm in the conference room (Experiment 1) and 172, 114 Hz for recordings with object at 100, 150 cm in the lecture room (Experiment 2) (cf Figure 4.4(d), 4.5(b) and 4.5(c)). In the 50ms and 500ms signal recordings distinct peaks that explain the pitch perception were absent in the spectral profile (cf Figure 4.6 to 4.10). The temporal profile of the figures 4.6 to 4.10 might have some peaks approximately around the theoretical frequencies of the repetition pitch which are not clearly visible due to the scaling of the figures. Therefore, from the dual profile analysis it can be concluded that the spectral profile (red line) does not provide any information for pitch perception. On the other hand, to conclude that it is the temporal profile (blue line) that is necessary for the detecting the objects based on repetition pitch is not certain from this analysis, as the peaks were not clearly visible. A further analysis is needed which quantifies the peaks in the temporal profile. To determine if it is the temporal information that is necessary for detecting the objects based on repetition pitch, the pitch strength development module of AIM which measures CHAPTER 4. AUDITORY MODELS 31 the pitch perceived based on the peak strength was used. This is further discussed in the next subsection where it will be shown that the temporal profile has the peaks at the 4.5 0.6 spectral profile temporal profile 0.3 0.2 Scaled autocorrelation index 0.4 3.5 3 spectral profile temporal profile 2.5 2 1.5 1 Scaled autocorrelation index 4 0.5 0.1 0.5 0 0 100 200 400 800 1600 3200 6400 50 100 200 Frequency [Hz] 400 800 1600 3200 6400 Frequency [Hz] (a) No object (b) Object at 50 cm 2.5 0.5 spectral profile temporal profile spectral profile temporal profile 0.45 2 1 Scaled autocorrelation index 1.5 0.4 0.35 0.3 0.25 0.2 0.15 0.5 Scaled autocorrelation index 50 0.1 0.05 0 50 100 200 400 800 1600 Frequency [Hz] (c) Object at 100 cm 3200 6400 0 50 100 200 400 800 1600 3200 6400 Frequency [Hz] (d) Object at 200 cm Figure 4.3: The Dual profile of a 5ms signal recorded in the anechoic room (Experiment 1). Blue line is the sum of the ACF along the spectral axis and the red line is the sum of the ACF along the time delay axis for 70ms time interval. The temporal and spectral profiles are scaled to be compared to each other. The ’x’ axis is changed in the temporal profile by using the inverse relationship of time and frequency, f = 1/t. Note that the amplitude scale of the ’y’ axis is different in each sub figure. However, as the investigated attribute is pitch the sub figures should be compared in reference with the No object sub figure. A distinct peak in any other sub figure which is absent in the No object sub figure indicates the possibility of a pitch perception. CHAPTER 4. AUDITORY MODELS 32 theoretical frequencies of the repetition pitch which explains the perception of repetition pitch phenomenon. 1.2 9 spectral profile temporal profile spectral profile temporal profile 8 0.6 0.4 Scaled autocorrelation index 0.8 7 6 5 4 3 2 Scaled autocorrelation index 1 0.2 1 0 100 200 400 800 1600 3200 0 6400 50 100 200 Frequency [Hz] 400 800 1600 3200 6400 Frequency [Hz] (a) No object (b) Object at 50 cm 5 spectral profile temporal profile 1.8 spectral profile temporal profile 4.5 4 3 2.5 2 1.5 1.4 Peak approximately at the theoretical frequency (86Hz) of the repetition pitch Scaled autocorrelation index 3.5 1.6 1.2 1 0.8 0.6 0.4 1 0.2 0.5 0 50 100 200 400 800 1600 Frequency [Hz] (c) Object at 100 cm 3200 6400 Scaled autocorrelation index 50 0 50 100 200 400 800 1600 3200 6400 Frequency [Hz] (d) Object at 200 cm Figure 4.4: The Dual profile of a 5ms signal recorded in the conference room (Experiment 1). Blue line is the sum of the ACF along the spectral axis and the red line is the sum of the ACF along the time delay axis for 70ms time interval. The temporal and spectral profiles are scaled to be compared to each other. The ’x’ axis is changed in the temporal profile by using the inverse relationship of time and frequency, f = 1/t. Note that the amplitude scale of the ’y’ axis is different in each sub figure. However, as the investigated attribute is pitch the sub figures should be compared in reference with the No object sub figure. A distinct peak in any other sub figure which is absent in the No object sub figure indicates the possibility of a pitch perception. CHAPTER 4. AUDITORY MODELS 33 0.7 spectral profile temporal profile 0.5 0.4 0.3 0.2 Scaled autocorrelation index 0.6 0.1 0 50 100 200 400 800 1600 3200 6400 Frequency [Hz] (a) No object 0.9 0.8 spectral profile temporal profile 0.7 Peak approximately at the theoretical frequency (172Hz) of the repetition pitch 0.6 0.5 0.4 0.3 0.2 Scaled autocorrelation index 0.8 0.7 0.6 Peak approximately at the theoretical frequency (115 Hz) of the repetition pitch 0.5 0.4 0.3 0.2 0.1 0.1 0 50 100 200 400 800 1600 Frequency [Hz] (b) Object at 100 cm 3200 6400 Scaled autocorrelation index spectral profile temporal profile 0 50 100 200 400 800 1600 3200 6400 Frequency [Hz] (c) Object at 150 cm Figure 4.5: The Dual profile of a 5ms signal recorded in the lecture room (Experiment 2). Blue line is the sum of the ACF along the spectral axis and the red line is the sum of the ACF along the time delay axis for 70ms time interval. The temporal and spectral profiles are scaled to be compared to each other. The ’x’ axis is changed in the temporal profile by using the inverse relationship of time and frequency, f = 1/t. Note that the amplitude scale of the ’y’ axis is different in each sub figure. However, as the investigated attribute is pitch the sub figures should be compared in reference with the No object sub figure. A distinct peak in any other sub figure which is absent in the No object sub figure indicates the possibility of a pitch perception. CHAPTER 4. AUDITORY MODELS 34 8 70 spectral profile temporal profile 7 5 4 3 2 Scaled autocorrelation index 6 60 50 40 30 20 10 1 0 50 100 200 400 800 1600 3200 Scaled autocorrelation index spectral profile temporal profile 0 6400 50 100 200 400 Frequency [Hz] 800 1600 3200 6400 Frequency [Hz] (a) No object (b) Object at 50 cm 8 25 spectral profile temporal profile spectral profile temporal profile 7 10 6 5 4 3 2 Scaled autocorrelation index 15 Scaled autocorrelation index 20 5 1 0 50 100 200 400 800 1600 Frequency [Hz] (c) Object at 100 cm 3200 6400 0 50 100 200 400 800 1600 3200 6400 Frequency [Hz] (d) Object at 200 cm Figure 4.6: The Dual profile of a 50ms signal recorded in the anechoic room (Experiment 1). Blue line is the sum of the ACF along the spectral axis and the red line is the sum of the ACF along the time delay axis for 70ms time interval. The temporal and spectral profiles are scaled to be compared to each other. The ’x’ axis is changed in the temporal profile by using the inverse relationship of time and frequency, f = 1/t. Note that the amplitude scale of the ’y’ axis is different in each sub figure. However, as the investigated attribute is pitch the sub figures should be compared in reference with the No object sub figure. A distinct peak in any other sub figure which is absent in the No object sub figure indicates the possibility of a pitch perception. CHAPTER 4. AUDITORY MODELS 35 14 140 spectral profile temporal profile spectral profile temporal profile 8 6 4 100 80 60 40 2 20 0 100 200 400 800 1600 3200 0 6400 50 100 200 400 Frequency [Hz] 800 1600 3200 6400 Frequency [Hz] (a) No object (b) Object at 50 cm 15 45 spectral profile temporal profile spectral profile temporal profile 35 30 25 20 15 10 Scaled autocorrelation index 40 10 5 Scaled autocorrelation index 50 Scaled autocorrelation index 10 120 Scaled autocorrelation index 12 5 0 50 100 200 400 800 1600 Frequency [Hz] (c) Object at 100 cm 3200 6400 0 50 100 200 400 800 1600 3200 6400 Frequency [Hz] (d) Object at 200 cm Figure 4.7: The Dual profile of a 50ms signal recorded in the conference room (Experiment 1). Blue line is the sum of the ACF along the spectral axis and the red line is the sum of the ACF along the time delay axis for 70ms time interval. The temporal and spectral profiles are scaled to be compared to each other. The ’x’ axis is changed in the temporal profile by using the inverse relationship of time and frequency, f = 1/t. Note that the amplitude scale of the ’y’ axis is different in each sub figure. However, as the investigated attribute is pitch the sub figures should be compared in reference with the No object sub figure. A distinct peak in any other sub figure which is absent in the No object sub figure indicates the possibility of a pitch perception. CHAPTER 4. AUDITORY MODELS 36 14 120 spectral profile temporal profile spectral profile temporal profile 12 6 4 80 60 40 20 2 0 50 100 200 400 800 1600 3200 Scaled autocorrelation index 8 Scaled autocorrelation index 10 100 0 6400 50 100 200 400 Frequency [Hz] 800 1600 3200 6400 Frequency [Hz] (a) No object (b) Object at 50 cm 40 spectral profile temporal profile 12 spectral profile temporal profile 35 25 20 15 10 8 6 4 Scaled autocorrelation index 30 Scaled autocorrelation index 10 2 5 0 50 100 200 400 800 1600 Frequency [Hz] (c) Object at 100 cm 3200 6400 0 50 100 200 400 800 1600 3200 6400 Frequency [Hz] (d) Object at 200 cm Figure 4.8: The Dual profile of a 500ms signal recorded in the anechoic room (Experiment 1). Blue line is the sum of the ACF along the spectral axis and the red line is the sum of the ACF along the time delay axis for 70ms time interval. The temporal and spectral profiles are scaled to be compared to each other. The ’x’ axis is changed in the temporal profile by using the inverse relationship of time and frequency, f = 1/t. Note that the amplitude scale of the ’y’ axis is different in each sub figure. However, as the investigated attribute is pitch the sub figures should be compared in reference with the No object sub figure. A distinct peak in any other sub figure which is absent in the No object sub figure indicates the possibility of a pitch perception. CHAPTER 4. AUDITORY MODELS 37 250 18 spectral profile temporal profile 16 12 10 8 6 4 200 Scaled autocorrelation index 14 150 100 50 Scaled autocorrelation index spectral profile temporal profile 2 0 0 50 100 200 400 800 1600 3200 6400 50 100 200 400 Frequency [Hz] 800 1600 3200 6400 Frequency [Hz] (a) No object (b) Object at 50 cm 70 20 spectral profile temporal profile spectral profile temporal profile 18 60 40 30 20 14 12 10 8 6 Scaled autocorrelation index 50 Scaled autocorrelation index 16 4 10 2 0 50 100 200 400 800 1600 Frequency [Hz] (c) Object at 100 cm 3200 6400 0 50 100 200 400 800 1600 3200 6400 Frequency [Hz] (d) Object at 200 cm Figure 4.9: The Dual profile of a 500ms signal recorded in the conference room (Experiment 1). Blue line is the sum of the ACF along the spectral axis and the red line is the sum of the ACF along the time delay axis for 70ms time interval. The temporal and spectral profiles are scaled to be compared to each other. The ’x’ axis is changed in the temporal profile by using the inverse relationship of time and frequency, f = 1/t. Note that the amplitude scale of the ’y’ axis is different in each sub figure. However, as the investigated attribute is pitch the sub figures should be compared in reference with the No object sub figure. A distinct peak in any other sub figure which is absent in the No object sub figure indicates the possibility of a pitch perception. CHAPTER 4. AUDITORY MODELS 38 25 spectral profile temporal profile 15 10 Scaled autocorrelation index 20 5 0 50 100 200 400 800 1600 3200 6400 Frequency [Hz] (a) No object 25 25 spectral profile temporal profile spectral profile temporal profile 10 15 10 5 5 0 50 100 200 400 800 1600 Frequency [Hz] (b) Object at 100 cm 3200 6400 Scaled autocorrelation index 15 20 Scaled autocorrelation index 20 0 50 100 200 400 800 1600 3200 6400 Frequency [Hz] (c) Object at 150 cm Figure 4.10: The Dual profile of a 500ms signal recorded in the lecture room (Experiment 2). Blue line is the sum of the ACF along the spectral axis and the red line is the sum of the ACF along the time delay axis for 70ms time interval. The temporal and spectral profiles are scaled to be compared to each other. The ’x’ axis is changed in the temporal profile by using the inverse relationship of time and frequency, f = 1/t. Note that the amplitude scale of the ’y’ axis is different in each sub figure. However, as the investigated attribute is pitch the sub figures should be compared in reference with the No object sub figure. A distinct peak in any other sub figure which is absent in the No object sub figure indicates the possibility of a pitch perception. CHAPTER 4. AUDITORY MODELS 4.2.2.2 39 Pitch strength: As the peaks were randomly distributed in the temporal profile of the autocorrelation function computed using the dual profile module of AIM, it is not obvious which peak corresponds to a pitch. To solve this issue the auditory image model consists of a pitch strength module which calculates the pitch strength to determine if a particular peak is valid or not. The pitch strength module initially calculates the local maximas and their corresponding local minimas. The ratio of peak height to the peak width of the peak (local maxima) is subtracted from the mean of the peak height between two adjacent local minima to obtain the pitch strength (PS) of a particular peak. There were two modifications made in the pitch strength algorithm to improve its performance for the analysis in this thesis. 1) Removed the low pass filtering as it smooths out the peaks and 2) Measured the pitch strength using equation 4.10 . Figure 4.11 is an example to illustrate the present pitch strength algorithm. The peak having the greatest peak height has greater pitch strength and would be the perceived frequency of the repetition pitch. Pitch strength = Peak height - MEAN(Peak height between two adjacent local minima). (4.10) 1.5 233 Hz: 0.48 1 Autocorrelation Index Pitch strength = Peak height - Mean of the peak height of two adjacent local minimas = 0.48 0.5 4.5 5 5.5 6 6.5 7 time(sec) 7.5 8 8.5 9 ×10 -3 Figure 4.11: An example to illustrate the pitch strength measure computed using the pitch strength module of the AIM. The blue dot indicates the local maxima and the two red dots are the corresponding local minima. The vertical pink line is the pitch strength calculated using the equation 4.10. The frequency in Hz was computed by inverting the time delay, f = 1/t. The results of the calculated pitch strength’s for the recordings of Experiment 1 and Experiment 2 are tabulated in Tables 4.4 to 4.6. It should be noted that the peaks CHAPTER 4. AUDITORY MODELS NoObjectrec1 NoObjectrec2 Object50cm Object100cm Object150cm Object200cm 40 Anechoic 0.37 0.40 2.54 1.06 0.35 Conference 0.78 0.80 9.65 2.43 0.85 Lecture 0.19 0.29 0.28 - Table 4.4: Mean of the pitch strength (autocorrelation index) of 10 versions for the recordings in anechoic conference and the lecture room with 5ms duration signal. The blank cells indicate that there were no recordings made at those distances. NoObjectrec1 NoObjectrec2 Object50cm Object100cm Object150cm Object200cm Anechoic 0.47 0.44 4.17 1.67 0.42 Conference 0.55 0.52 6.59 2.22 0.54 Lecture - Table 4.5: Mean of the pitch strength (autocorrelation index) of 10 versions for the recordings in anechoic conference and the lecture room with 50ms duration signal. The blank cells indicate that there were no recordings made at those distances. NoObjectrec1 NoObjectrec2 Object50cm Object100cm Object150cm Object200cm Anechoic 0.71 0.78 4.75 2.44 0.70 Conference 0.84 0.90 7.74 2.91 1.35 Lecture 1.30 1.36 1.42 - Table 4.6: Mean of the pitch strength (autocorrelation index) of 10 versions for the recordings in anechoic, lecture and conference room with 500ms duration signal. The blank cells indicate that there were no recordings made at those distances. were also identified in the case of recordings without the object which should not have any pitch perception. This is because the pitch strength algorithm identifies the local maximas and minimas and hence it calculates the pitch strength for all random peaks (local maximas). The units for pitch strength in this analysis is the autocorrelation index as it is computed on the autocorrelation function. The tabulated data shows that for the 5ms and 50ms duration signals the pitch strength was greater than 1 for 50 and 100cm in the anechoic and conference room (cf Table 4.4 and 4.5). For 500ms duration signal the strength was greater than 1 for 50 and 100cm in the anechoic room and 50, 100 and 200 cm in the conference room . Although the lecture room also had pitch strength greater than 1 at this condition the pitch strength computed was not consistent over a single frequency and it lasted only for 4 to 8 time frames (The time frames were 35ms in time delay computed from a 70ms interval NAP signal. Each frame had an hop time of 10ms). This was not the case for the anechoic and conference room which had high pitch strengths at a particular frequency and also lasted for 14 to 18 time frames of 35ms interval each with an hop time of 10ms. CHAPTER 4. AUDITORY MODELS 41 The perceptual results in Experiment 1 and Experiment 2 show that the participants were able to detect the objects with a high percentage of correct at 50 and 100cm in the anechoic room and 50, 100 and 200cm in the conference room (cf Schenkman and Nilsson 2010; Schenkman, Nilsson, and Grbic 2011). As discussed in the above paragraph the pitch strength was greater than 1 at these conditions. Assuming that pitch is the underlying information that the participants used to detect the objects at these distances, the above comparison shows that there might be a perceptual threshold of 1 (autocorrelation index) for pitch strength and the peak with that pitch strength should exist for certain time frames in order for the participants to perceive the repetition pitch. This is determined by the acoustics of the room. A further comparison of pitch strength results with the performance of the participants is made in Chapter 5. 4.2.3 Sharpness analysis for timbre perception In the room acoustics chapter the spectral centroid was used as a measure for the timbre perception. However, the spectral centroid was computed on the time varying Fourier Transform. To study the effect of human hearing on timbre pereception Fastl and Zwicker (2007) computed the weighted centroid of the specific loudness rather than the Fourier Transform. This measure was known as sharpness which is a measure of how sound extends from being perceived to vary form dull to sharp. The sharpness analysis for our recordings was made using code available from Psysound. As the sharpness varies over time the median of it is used to depict the perceived sharpness. The results of the mean of the medians of the perceived sharpness over the 10 versions in anechoic, conference and lecture room for 5, 50 and 500ms duration signals are tabulated in Tables 4.7 to 4.9. The results for all the recordings can be seen in Appendix B, Tables B.11 to B.20 . According to Pedrielli, Carletti, and Casazza (2008) their participants had a just noticeable difference for sharpness of 0.04 acum. The results in Tables 4.7 to 4.9 show that the difference in median sharpness was greater than 0.04 acum for the object at 50 and 100cm when compared to the recording without the object for Experiment 1. For Experiment 2 the difference between the recordings with and without object was comparably less with the Experiment 1 but they were greater than 0.04 acum. However, at smaller distances (less than 200cm) repetition pitch and loudness information might be more relevant for providing information for the participants to echolocate than the sharpness information. The recordings of Experiment 1 with objects at distances 200cm, 300cm, 400cm and 500cm for 5ms (anechoic, conference), 50ms (anechoic, conference) and 500ms signal (conference) duration’s had differences in median sharpness less than 0.04 acum when NoObjectrec1 NoObjectrec2 Object50cm Object100cm Object150cm Object200cm Object300cm Object400cm Object500cm Anechoic 1.888 1.900 2.052 2.138 1.921 1.906 1.891 1.889 Conference 1.972 1.983 2.032 2.032 2.003 2.009 1.982 1.986 Lecture 1.849 1.778 1.834 - Table 4.7: Mean of the 10 versions of mean of median of the sharpness (acums) for the recordings in anechoic, conference and the lecture room with 5ms duration signal. CHAPTER 4. AUDITORY MODELS NoObjectrec1 NoObjectrec2 Object50cm Object100cm Object150cm Object200cm Object300cm Object400cm Object500cm 42 Anechoic 1.889 1.901 2.068 2.141 1.912 1.904 1.874 1.881 Conference 1.893 1.894 1.964 1.950 1.936 1.914 1.917 1.888 Lecture - Table 4.8: Mean of the 10 versions of mean of median of the sharpness (acums) for the recordings in anechoic, conference and the lecture room with 50ms duration signal. NoObjectrec1 NoObjectrec2 Object50cm Object100cm Object150cm Object200cm Object300cm Object400cm Object500cm Anechoic 1.861 1.882 2.116 2.119 1.892 1.858 1.831 1.835 Conference 1.935 1.938 2.095 2.043 1.967 1.950 1.949 1.941 Lecture 2.072 2.200 2.110 - Table 4.9: Mean of the 10 versions of mean of median of the sharpness (acums) for the recordings in anechoic, conference and the lecture room with 500ms duration signal. compared to the recording without the object. For 500ms signal duration’s in the anechoic room the recordings with object at 400cm and 500cm had difference in sharpness greater than 0.04 acum when compared to the recording without the object (cf Table B.13 in appendix). This might be the information that blind participants in Experiment 1 use to identify the object at longer distance than 400cm. A detailed analysis of the results with the performance of the participants is made in Chapter 5. Chapter 5 Analysis of the perceptual results 5.1 Description of the non parametric modeling: A psychometric function is used in psychoacoustics to relate the perceptual results with the physical parameters of the stimulus. Traditionally the psychometric function is estimated using parametric fitting i.e assuming a true function that can be described by a specific parametric model and then estimating the parameters of that model by maximizing the likelihood. However, in practice the correct parametric model underlying the description of the psychometric function is unknown and estimating the psychometric function based on such a model may lead to incorrect interpretations (Zychaluk and Foster, 2009). To solve this problem Zychaluk and Foster (2009), implemented a non parametric model to estimate the psychometric function i.e the psychometric function is modeled locally without any need for the assumptions of a true function. Therefore, the method proposed by Zychaluk and Foster (2009) is used in our analysis. Below a brief description of the non parametric model in estimating the underlying psychometric function is described followed by the analysis of the results. A generalized linear model (GLM) is usually used in fitting a psychometric function using parametric modeling. It consists of three components, i.e a random component from the exponential family, a systematic component η, and a monotonic differentiable link function g, that relates the two. Hence, the psychometric function P (x), can be modeled using equation 5.1. The parameters of the GLM are estimated by maximizing the appropriate likelihood function (Zychaluk and Foster, 2009). The efficiency of the GLM relies on how much the chosen link function g approximates the true function. η(x) = g[P (x)] (5.1) In the non parametric modelling, instead of fitting the link function g, the function η is fitted using a local linear method i.e for a given point x, the value η(u) at any point u in a neighbourhood of x is approximated using equation 5.2 (Zychaluk and Foster, 2009). η(u) ≈ η(u) − (u − x)η (x) (5.2) where η (x) is the first derivative of η . The actual estimate of the value of η(x) is obtained by fitting this approximation to the data over the prescribed neighbourhood of x . Two features are important for this purpose, kernel K and the bandwidth h . A Gaussian kernel is preferred as it has unbounded support and is best for widely spaced levels. An optimal bandwidth can be chosen using plugin, bootstrap or cross validation methods (Zychaluk and Foster, 2009). As no method is guaranteed to always work, bootstrap method with 30 replications was chosen in our analysis to find the optimal bandwidth. However, when bootstrap method failed to find the optimal bandwidth cross validation was used to find the optimal bandwidth. 43 CHAPTER 5. ANALYSIS OF THE PERCEPTUAL RESULTS 5.2 Analysis 5.2.1 Distance 44 Initially the psychometric function is fitted to the mean proportion of correct responses with respect to the distance. Figures 5.1, 5.2 and 5.3 shows the non parametric modeling (local linear fit) and the parametric modeling of the blind participants perceptual 1.2 1.2 Mean proportion of correct Weibull fit Local linear fit 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 50 Mean proportion of correct Weibull fit Local linear fit 1.1 Proportion of correct responses Proportion of correct responses 1.1 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 100 150 200 250 300 350 400 450 0.2 50 500 100 Distance from the object (cm) 150 200 250 300 350 400 450 500 Distance from the object (cm) (a) (b) Figure 5.1: The parametric (Weibull fit) and non parametric (Local linear fit) modeling of the mean proportion of correct responses of the blind participants as a function of distance. (a) For the 5ms recordings in anechoic chamber. (b) For the 5ms recording in conference room. 1.2 1.2 Mean proportion of correct Weibull fit Local linear fit 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 50 Mean proportion of correct Weibull fit Local linear fit 1.1 Proportion of correct responses Proportion of correct responses 1.1 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 100 150 200 250 300 350 400 Distance from the object (cm) (a) 450 500 0.2 50 100 150 200 250 300 350 400 450 500 Distance from the object (cm) (b) Figure 5.2: The parametric (Weibull fit) and non parametric (Local linear fit) modeling of the mean proportion of correct responses of the blind participants as a function of distance. (a) For the 50ms recordings in anechoic chamber. (b) For the 50ms recording in conference room. 1.2 1.2 1.1 1.1 1 1 Proportion of correct responses Proportion of correct responses CHAPTER 5. ANALYSIS OF THE PERCEPTUAL RESULTS 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 50 100 150 200 250 300 350 400 Distance from the object (cm) (a) 0.9 0.8 0.7 0.6 0.5 0.4 Mean proportion of correct Weibull fit Local linear fit Mean proportion of correct Weibull fit Local linear fit 0.3 450 500 45 0.2 50 100 150 200 250 300 350 400 Distance from the object (cm) 450 500 (b) Figure 5.3: The parametric (Weibull fit) and non parametric (Local linear fit) modeling of the mean proportion of correct responses of the blind participants as a function of distance. (a) For the 500ms recordings in anechoic chamber. (b) For the 500ms recording in conference room. results with respect to distance for recordings of 5, 50 and 500ms signals in the anechoic and conference room. The link function used for the parametric modeling was Weibull function. As the link function was not appropriate the fit does not correlate well with the perceptual results . However, the local linear fit correlates well with the perceptual results. This demonstrates the advantages of the use of non parametric modeling. It is to be noted that the mean of the proportion of the correct responses of the participants were used for psychometric fitting in this chapter. If we had used the individual responses the individual participants thresholds would vary but the local linear fit would probably be well correlated with the perceptual results. Hence, the results in the remaining part of the section of this chapter are based on the psychometric function using local linear fit. The implementation of the non parametric model fitting in Matlab by the Zychaluk and Foster (2009) was used for this purpose. The local linear fit needs at least 3 stimulus values to make the fit . As the recordings in the lecture room (Experiment 2) had only two stimulus values i.e. at 100 and 150 cm it was not possible to make a psychometric fit for these recordings. When the subject’s proportion of correct response is 0.75 one can say that the subject can detect the object. Hence the threshold values of the stimulus in this chapter were chosen at this proportion of correct response. The term threshold refers to the subjective threshold as the output of the auditory models depict the human hearing. The threshold of loudness, repetition pitch and sharpness refers to absolute threshold for which a participant can echolocate using the respective subjective attribute. The threshold of distance refers to the distance at which a person may detect an object with a certain probability. As the fitted psychometric function is discrete, it may be possible that the fit may not have a value at 0.75 exactly. Hence, the threshold values were chosen by taking the mean between the proportion of correct response 0.73 and 0.75. The threshold values of the distance for which the blind and the sighted would be able to detect the object using echolocation are tabulated in Table 5.1. The results show that the blind participants could detect the object at farther distances than the sighted. CHAPTER 5. ANALYSIS OF THE PERCEPTUAL RESULTS Room Threshold(cm) 5ms 50ms 500ms blind sighted blind sighted blind sighted Anechoic 150 130 166 160 172 166 Conference 158 121 176 147 247 207 -a - - - - - Lecture a 46 Non parametric psychometric fit needs atleast 3 inputs Table 5.1: Detection thresholds of object distance (cm) for duration, room, and listener groups. The threshold values were calculated from the psychometric function of the blind and sighted participants response at the mean percentage of correct responses value of 0.73 to 0.75. 5.2.2 Loudness Room Threshold(sones) 5ms 500ms blind sighted blind sighted blind sighted Anechoic 16.8 17.5 43.7 45.1 52.9 53.2 Conference 22.6 24.1 49.4 53.1 53.6 55.3 -a - - - - - Lecture a 50ms Non parametric psychometric fit needs atleast 3 inputs Table 5.2: Threshold values of loudness (sones) for duration, room, and listener groups. The threshold values were calculated from the psychometric function of the blind and sighted subjects response at the mean percentage of correct responses value of 0.73 to 0.75. The threshold values of the loudness for which the blind and the sighted would be able to detect the object using echolocation are tabulated in Table 5.2. The tabulated data show that the blind subjects threshold for loudness was low compared to the sighted. It was roughly 1 sone less in the anechoic chamber and 2 sones less in the conference room. As the loudness model used for both the sighted and the blind is the same it is concluded that the low threshold of the blind is due to their perceptual ability. This is further discussed in Chapter 6. 5.2.3 Pitch The threshold values of the pitch strength for which the blind and the sighted would be able to detect the object using echolocation are tabulated in Table 5.3. The threshold varied for blind and sighted between signal duration and room conditions. One explanation for this variation in the threshold is that for shorter duration signals the participant is more likely to miss the pitch information and hence it is assumed that the performance (percentage of correct response) of the participants with 5 and 50ms signals is not just based on the pitch strength but also on the attention of the participant. Due to the influence of this attention factor on the perceptual results the thresholds obtained from 5 and 50ms duration signals in Table 5.3 cannot be used. CHAPTER 5. ANALYSIS OF THE PERCEPTUAL RESULTS 47 Schenkman and Nilsson (2011) showed in their study that when pitch information was present in the stimuli the participants performance was almost 100 percent. The 500ms recordings with the object at 50 and 100cm in Experiment 1 had almost 100 percent correct response for both the blind and the sighted (cf Schenkman and Nilsson, 2010). Therefore, at this condition it is assumed that the perceptual results depict the performance of the participants based solely on pitch information and the attention factor of the participant to miss the pitch information can be neglected. Hence, the threshold obtained from 500ms duration signals in Table 5.3 were used to find the pitch strength thresholds for the blind and sighted. Room Threshold(autocorrelation index) 5ms 500ms blind sighted blind sighted blind sighted Anechoic 0.77 0.88 0.80 0.96 1.10 1.23 Conference 1.54 2.21 1.07 1.69 1.14 1.41 -a - - - - - Lecture a 50ms Non parametric psychometric fit needs atleast 3 inputs Table 5.3: Threshold values of the pitch strength (autocorrelation index) for duration, room, and listener groups calculated from the psychometric function of the blind and sighted participants response at the mean percentage of correct responses value of 0.73 to 0.75. If it is assumed that the auditory system analyses the pitch information absolutely i.e. it does not compare the peak heights in the ACF between the recordings (when presented in a two alternative forced choice manner) than the results depict that the absolute threshold for detecting the pitch based on autocorrelation theory should be greater than 1.10 and 1.23 (autocorrelation index) for the blind and the sighted, respectively. On the other hand, if it is assumed that the auditory system analyses the pitch information relatively i.e. it compares the peak heights in the ACF between the recordings (when presented in a two alternative forced choice manner) than the results depict that the relative threshold for detecting the pitch based on autocorrelation theory should be greater than 0.36 and 0.49 (autocorrelation index) for the blind and the sighted, respectively. 5.2.4 Sharpness The threshold values of the sharpness for which the blind and the sighted would be able to detect the object using echolocation is tabulated in Table 5.4 . The tabulated data show that the blind and sighted participants threshold for sharpness was almost the same. However, unlike loudness and pitch strength the sharpness information may not always be greater in value for the participants to detect the object. For example in Experiment 1 and Experiment 2 the participants were presented with two stimuli one with object and one without object in an alternative forced choice manner. The participant distinguishes the recording with the object from the recording without the object by identifying the one with higher loudness level or pitch strength. However, suppose a participant uses sharpness to distinguish the recordings it is not always necessary that the recording with the object has the higher value of sharpness. It could be that the recording with the object is duller (lower value of sharpness) than the recording without the object and the particpant may use this information to identify CHAPTER 5. ANALYSIS OF THE PERCEPTUAL RESULTS Room Threshold(acums) 5ms 50ms 500ms blind sighted blind sighted blind sighted Anechoic 1.97 1.98 1.96 1.98 1.94 1.96 Conference 2.01 2.03 1.94 1.94 1.97 1.97 -a - - - - - Lecture a 48 Non parametric psychometric fit needs atleast 3 inputs Table 5.4: Threshold values of the mean of the mean of median sharpness (acums) for duration, room, and listener groups calculated from the psychometric function of the blind and sighted subjects response at the mean percentage of correct responses value of 0.73 to 0.75. the object. A detailed discussion on whether or not the sharpness information is useful for the participants to echolocate is presented in Chapter 6. Chapter 6 Discussion As stated in the introduction, one recent focus on human echolocation research is to find the causes for the variability of echolocation ability between the blind and sighted. Although it is expected that the combination of neuroimaging and psychoacoustic methods can give us some insight into the high echolocating ability of the blind, these do not reveal the information in the acoustic stimulus that determines it (at least when the information is not known) and how this information is represented in the human auditory system. The implementation of auditory models for human echolocation was mainly to solve this issue of finding the important information that is the cause for the variability of echolocation ability between the blind and sighted and how this information might be represented in the human auditory system. Initially the signal analysis was done and presented in Chapter 3 to find the physical information that is useful for echolocation and also to analyze the influence of the room acoustics on human echolocation. The sound pressure level, auto correlation and spectral centroid analysis were performed on the recordings and the results demonstrate that the acoustics of the room does effect the stimuli and thereby the physical attributes that depend on it. However, as the information represented in the auditory system is complex the auditory models available in the literature were used to study how the corresponding perceptual attributes of sound pressure level, auto correlation and spectral centroid were represented in the auditory system. The results suggest that the repetition pitch, loudness and sharpness provide potential information for the listeners to echolocate at distances below 200cm. The results also show that at longer distances sharpness information may influence human echolocation. A detailed discussion of how loudness, pitch and sharpness is essential for human echolocation and how they might be represented in the auditory system is presented in sections 6.1, 6.2 and 6.3. A discussion of how the room acoustics and binaural information affect human echolocation is presented in sections 6.4 and 6.5 followed by discussion of the advantages of using auditory models in understanding human echolocation and theoretical implications of the thesis in sections 6.6 and 6.7 respectively. 6.1 Echolocation and loudness The loudness model of Glasberg and Moore (2002) was used in our analysis as it gives a good fit to the equal loudness contours in ISO 2006. The results of the model were compared with the proportion of correct responses of the listeners . The results are tabulated in Tables 4.1 to 4.3 of Chapter 4 and a comparison of these with the participants perceptual response are shown in Table 5.2 of Chapter 5. 49 CHAPTER 6. DISCUSSION 50 The difference in loudness level between the loudness threshold of the sighted and loudness level of the recording without the object for 5, 50 and 500ms duration signals in the anechoic room were approximately 4.2, 5 and 5 sones and for the conference room were 5, 8 and 3 sones, respectively (cf Table 4.1 to 4.3 and Table 5.2). This difference in loudness level is sufficient to be used by the participants to echolocate. This shows that the loudness information is one of the potential information that can be used by the participants to echolocate. When comparing the loudness threshold of sighted and blind, the threshold for the blind was lower when compared to the sighted (cf Table 5.2). As the same model is used for the analysis it is not obvious what makes this perceptual difference. However, if it is assumed that the loudness information is encoded in the same manner for both the blind and the sighted then the results show that the blind can echolocate at lower levels when compared to that of the sighted. 6.2 Echolocation and pitch Repetition pitch is one of the important information sources that the listeners use to detect the object at shorter distances. However it is not clear how this information is represented in the auditory system. To find out how the repetition pitch is perceived in the auditory system a dual profile analysis was performed in section 4.2.2.1 of Chapter 4 . The results suggested that the repetition pitch could be explained using the peaks in the temporal profile rather than the peaks in the spectral profile of the auto correlation function. This is agreement with the study of Yost (1996), that the peaks in the temporal domain of the autocorrelation are the basis for the explanation of repetition pitch perception. However, the dual profile analysis was not sufficient to find the strength of the pitch perceived as the peaks were more random in the temporal profile of the autocorrelation function. A pitch strength measure was used to solve this problem (cf equation 4.10). The results are tabulated in Tables 4.4 to 4.6 of Chapter 4 and Table 5.3 of Chapter 5. The pitch strength results show that there is a threshold of 1 for the participants to detect the pitch from the peak heights of the auto correlation function in the temporal profile. Regarding the pitch strength threshold between the sighted and the blind the threshold for the blind was lower when compared to the sighted. As the auditory models were used without changing its parameters for the analysis it is not evident what determines this perceptual difference. In this thesis it is assumed that the pitch information is encoded in the same manner for both the blind and the sighted. In light of this assumption the results show that the blind when compared to the sighted can echolocate efficiently by using the pitch information which has a lower pitch strength. 6.3 Echolocation and sharpness Sharpness is the measure of the sound extending from dull to sharp. To find out how the sharpness information is useful for the participants to echolocate the weighted centroid was computed on the specific loudness by using the code from Psysound3. Pedrielli, Carletti, and Casazza (2008), showed in their analysis that the just noticeable difference for sharpness was 0.04 acum. The tabulated results in our analysis (cf Table 4.7 to 4.9) show that the difference in sharpness was greater than 0.04 acum for recordings with object at 50, 100, 150 and 200cm . However, at these distances the loudness information or the pitch information is more prominent. Hence, at these distances the sharpness information might not be the major information for the participants to echolocate but this has to be verified. CHAPTER 6. DISCUSSION 51 Interestingly, for the 500ms recording in the anechoic chamber with object at 400cm and 500cm the sharpness difference was approximately greater than 0.04 acum when compared to the recording without the object (cf Table B.13). According to the study of Pedrielli, Carletti, and Casazza (2008) the just noticeable difference for sharpness was 0.04 acum. Hence, this should be the vital information that the participants may use in order to detect the object at 400cm in Experiment 1. Performing a further experiment by controlling the sharpness information of the stimuli might give us much insight into how this attribute of the sound is helpful for echolocation. 6.4 Echolocation and room acoustics Loudness, pitch and sharpness provide the participants useful information to echolocate. These attributes depend on the physiology of the auditory system but on the other hand they also depend on the acoustics of the room and the type of the stimuli used. The results of the recordings of Experiment 1 and Experiment 2 depict this. For example, the conference room of Experiment 1 improved the pitch strength and hence enabled the participants to echolocate at farther distances but on the other hand the lecture room in Experiment 2 diminished the pitch strength and the participants had to rely on other information like loudness to echolocate in this room causing a deterioration in the performance. One cause for deterioration in performance of the participants may be due to the difference in the recording setup of Experiment 2 and Experiment 1 i.e. the loudspeaker was on the chest of the artificial head for Experiment 1 but was behind the artificial head for Experiment 2. Another cause for the deterioration might be the room acoustics itself i.e. the reverberation time for Experiment 1 conference room was 0.4s and Experiment 2 lecture room was 0.6s. Another example that depicts the influence of room acoustics on echolocation is the recordings in the anechoic room from Experiment 1. The recordings with object at 400cm and 500cm had no other reflections from the room except from the object. This may be the cause for the slight sharpness difference which might be favorable for the participants to detect the object. These results show that by careful design of the room acoustics one can improve the echolocation ability of the listeners in that environment. 6.5 Echolocation and binaural information The binaural information may provide additional information for the participants to echolocate. As mentioned in Chapter 3, past studies show that the inter aural level differences and inter aural time difference provide information for echolocating. For example, in the study of Papadopoulos et al. (2011) the information for obstacle discrimination were found in the frequency dependent inter aural level differences (ILD) especially in the range from 5.5 to 6.5 kHz. Recently in the study of Nilsson and Schenkman (2015), it was found that the blind people used the ILD more efficiently than the sighted. As the recordings of Experiment 1 and Experiment 2 were static the binaural information was not considered in this thesis. The static nature of recordings might be a cause for the lower performance of the particpants to echolocate. However, in a real situation blind persons would use their own sounds and also be moving heads and body. It is reasonable to conclude that such sounds offer more information to the blind. CHAPTER 6. DISCUSSION 6.6 52 Advantages or disadvantages of auditory model approach to human echolocation The research done to understand human echolocation has mostly been using psychoacoustic experiments, where a physical stimulus was presented to the participants in a controlled manner. This helps the researcher to identify the underlying cause for the echolocation of the participants. However, in some cases although the stimuli are presented in a controlled manner the underlying cause for the echolocation is not evident. This is the case with the experiments of Schenkman and Nilsson (2010), where the blind participants were able to perform better than the sighted but the underlying cause for the high performance could not be determined. As discussed in the introduction of this thesis scanning the participants brain using functional magnetic resonance imaging and trying to locate which areas in the brain are activated when the participant detects an object can help the researcher to understand whether physiological adaptation is the cause for the high echolocation ability of the blind. However, one disadvantage of such an analysis is that it does not fully reveal us how the information necessary for the high echolocation ability is represented in the auditory system. To solve this problem the binaural loudness model of Moore and Glasberg (2007), auditory image model of Patterson, Allerhand, and Giguere (1995) and sharpness model of Fastl and Zwicker (2007) were implemented in this thesis. The reason for choosing the loudness model of Moore and Glasberg (2007) was that it agrees well with the equal loudness contours of ISO 2006 and also gives an accurate representation of the binaural loudness (Moore, 2014). One reason for choosing the auditory image model is that instead of using two different modules to depict the frequency selectivity and compression it uses a dynamic compressive gammachirp filterbank (dcGC) module to depict the frequency selectivity and the compression of the basilar membrane. The analysis performed using the AIM showed that the peaks in the temporal information is the source for repetition pitch perception. The sharpness analysis performed using the sharpness model showed that the blind participants might be using this attribute to detect objects at longer distances and that both temporal and spectral information is required to encode this attribute. The results suggest that the auditory models do explain how the information necessary for the high echolocation ability of the blind is represented in the auditory system. In order to know whether the high echolocation ability is due to physiological differences or not one should vary the parameters of the model such that the results from the model fit the participants perceptual results. This was not considered in this thesis and an assumption was made that the high echolocation ability is due to the high perceptual ability. In light of the above mentioned advantages and disadvantages it would be more efficient for a researcher to use psychoacoustic experiments, neuroimaging as well as auditory model analysis in conjunction with signal analysis to understand human echolocation. CHAPTER 6. DISCUSSION 6.7 53 Theoretical implications of thesis The signal analysis performed on the physical stimuli showed how sound pressure level, autocorrelation and spectral centroid vary with the recordings. Hence, signal analysis is a vital tool that can be used to find the physical information that is necessary for human hearing. Furthermore as the auditory models were developed on the basis of the research in physiology and psychology of human auditory system they depict the human hearing. The auditory analysis done on the recordings of Experiment 1 and Experiment 2 agree with the study of Yost (1996) that the information necessary for pitch perception is represented temporally in the auditory system. Assuming that one cause for high echolocation ability is perceptual, the subjective thresholds for the blind and the sighted participants were obtained by comparing the auditory models results with the perceptual results of the blind and the sighted participants. The results indicate that the blind participants have low thresholds of detection and hence are better than sighted in echolocating. Regarding the implications of the thesis to human echolocation, the auditory analysis confirmed that repetition pitch and loudness are important information sources for the listeners to echolocate at shorter distances which is in agreement with the results of Schenkman and Nilsson (2010, 2011), Kolarik, Cirstea, Pardhan, and Moore (2014). Sharpness information was also analyzed and it was found that it can be important both at short and long distance. There has been no previous research in human echolocation that investigated the usefulness of sharpness for human echolocation. Performing psychoacoustic experiments might give us further insight on the usefulness of timbre qualities such as sharpness for echolocation. Chapter 7 General Conclusion 7.1 Conclusions The aim in implementing the auditory models for human echolocation was to find the information that determines high echolocation ability and how this information is represented in the auditory system. As for the information necessary for high echolocation ability three subjective attributes were considered in this thesis as they are known to be of importance namely, loudness, pitch and sharpness. To study how these subjective attributes are represented in the human auditory system a number of auditory models were used. To analyze how loudness is useful for echolocation the binaural loudness model of Moore and Glasberg (2007) was used as it gives a good fit to the equal loudness contours in ISO 2006 (Moore, 2014). The auditory image model of Bleeck, Ives, and Patterson (2004b), was used to analyze the repetition pitch phenomenon that is known to be useful for echolocation at shorter distances. One reason for using auditory image model for repetition pitch analysis was due to the the dynamic compressive filterbank which is physiologically inspired and depicts the frequency selectivity and compression of the basilar membrane. Finally to analyze sharpness, the loudness model of Glasberg and Moore (2002) was used and the sharpness information was obtained from the weighted centroid of the specific loudness (Fastl and Zwicker, 2007). The analysis showed that at shorter distances repetition pitch, loudness and sharpness provide the information for the participants to echolocate. At longer distances sharpness information might be used by the subjects to echolocate. This conclusion has to be justified by performing a further experiment which has control over the sharpness attribute of the stimuli. Regarding how the useful information for human echolocation might be represented in the auditory system, the analysis confirmed that the repetition pitch is represented using the peaks in the temporal profile rather than the spectral profile (Yost, 1996) and as the sharpness information is computed using the centroid of the specific loudness, it is represented using the spectral and temporal information. Although the auditory analysis in this thesis were done using different auditory models to analyze loudness, pitch and sharpness attributes, the auditory model used to compare the perceptual results of the blind and the sighted were the same (e.g: the same loudness model was used for both sighted and blind). Hence, it is assumed in this thesis that the high echolocation ability of the blind is due to their perceptual ability and therefore it was justified to compute the thresholds for the blind and the sighted in the same way. The analysis showed that the blind had lower thresholds than the sighted and could echolocate at a lower loudness and pitch strength levels. It is to be noted that the the recordings in Experiment 1 and Experiment 2 were recorded at static positions. In real 54 CHAPTER 7. GENERAL CONCLUSION 55 life situations the listeners would be using their own sounds and both the listener and the reflecting object may be moving. Probably the thresholds would be even lower for such situations. In conclusion, the thesis has shown the importance of understanding the roles of pitch, loudness and timbre for human echolocation. The specific roles and interactions of these three aspects have to be studied in more detail. Especially, the role of timbre is a topic worthy of deeper understanding. 7.2 Future work In this thesis it was assumed that the information is represented in a similar way for both the blind and the sighted. However, this presupposition may not be true, i.e. the high echolocation ability of the blind may be due to physiological differences. As a part of future work to investigate this, it is required to change the parameters in the auditory models and analyze the results in parallel with neuroimaging, psychoacoustic experiments as well as various methods in signal analysis. Neuroimaging may help to identify whether the high echolocation ability is related to the listeners physiological ability. When it is established that the underlying ability of the listeners is physiological then the parameters of the auditory models can be varied until the results from the auditory models agree with the psychoacoustic results. In this way nueroimaging, psychoacoustic experiments, auditory models and signal analysis together may help us to understand how information necessary for high ability of the blind is represented and perceived. Bibliography ANSI, 1994 “American national standard acoustical terminology, ansi s1.1-1994” American National Standard Institute, New York Arias C, Ramos O A, 1997 “Psychoacoustic tests for the study of human echolocation ability” Applied Acoustics 51 399–419 ASA, 1960 “Acoustical terminology si, 1–1960” American Standards Association, New York ASA, 1973 “American national psychoacoustical terminology, s3.20–1973” American Standards Association, New York Bassett I G, Eastmond E J, 1964 “Echolocation: Measurement of pitch versus distance for sounds reflected from a flat surface” The Journal of the Acoustical Society of America 36 911 Bilsen F, 1966 “Repetition pitch: monaural interaction of a sound with the repetition of the same, but phase shifted sound” Acustica 17 295–300 Bilsen F, Ritsma R, 1969 “Repetition pitch and its implication for hearing theory” Acustica 22 63–73 Bleeck S, 2011 “Aim-mat” [Online; accessed 25-April-2016] URL https://code.soundsoftware.ac.uk/projects/aimmat Bleeck S, Ives T, Patterson R D, 2004a “Aim-mat” [Online; accessed 25-April-2016] URL http://w3.pdn.cam.ac.uk/groups/cnbh/aimmanual/download/downloadframeset.htm Bleeck S, Ives T, Patterson R D, 2004b “Aim-mat: the auditory image model in matlab” Acta Acustica United with Acustica 90 781–787 Cabrera D, 2014 “Psysound3” [Online; accessed 25-April-2016] URL http://www.psysound.org Cabrera D, Ferguson S, Schubert E, 2007 “’psysound3’: Software for acoustical and psychoacoustical analysis of sound recordings” in “Proceedings of the 19th International Conference on Auditory Display (ICAD 2007)”, pp. 356–363 Dallenbach M, Cotzin K M, 1950 “” facial vision:” the rôle of pitch and loudness in the perception of obstacles by the blind” The American Journal of Psychology 63 485–515 Dallenbach Cotzin M, Supa Milton K M, 1944 “” facial vision”: The perception of obstacles by the blind” The American Journal of Psychology 57 133–183 De Boer E, 1956 On the” residue” in hearing Ph.D. thesis Uitgeverij Excelsior De Cheveigné A, 2010 “Pitch perception” in C J Plack, ed., “Oxford Handbook of Auditory Science – Auditory Perception”, pp. 71–104 (Oxford University Press, Oxford) Dufour A, Després O, Candas V, 2005 “Enhanced sensitivity to echo cues in blind subjects” Experimental Brain Research 165 515–519 56 BIBLIOGRAPHY 57 Fastl H, Zwicker E, 2007 Psychoacoustics: Facts and Models volume 22 (Springer Science & Business Media, Berlin) Glasberg B R, Moore B C, 2002 “A model of loudness applicable to time-varying sounds” Journal of the Audio Engineering Society 50 331–342 Goldstein J L, 1973 “An optimum processor theory for the central formation of the pitch of complex tones” The Journal of the Acoustical Society of America 54 1496 Irino T, Patterson R D, 1997 “A time-domain, level-dependent auditory filter: The gammachirp” The Journal of the Acoustical Society of America 101 412–419 Irino T, Patterson R D, 2006 “A dynamic compressive gammachirp auditory filterbank” Audio, Speech, and Language Processing, IEEE Transactions on 14 2222–2232 Kellogg W N, 1962 “Sonar system of the blind new research measures their accuracy in detecting the texture, size, and distance of objects” by ear” Science 137 399–404 Köhler I, 1964 “Orientation by aural clues. american foundation for the blind” Research Bulletin 4 14–53 Kolarik A J, Cirstea S, Pardhan S, 2013 “Evidence for enhanced discrimination of virtual auditory distance among blind listeners using level and direct-to-reverberant cues” Experimental Brain Research 224 623–633 Kolarik A J, Cirstea S, Pardhan S, Moore B C, 2014 “A summary of research investigating echolocation abilities of blind and sighted humans” Hearing Research 310 60–68 Licklider J C, 1951 “A duplex theory of pitch perception” Cellular and Molecular Life Sciences 7 128–134 Miura T, Ueda K, Muraoka T, Ino S, Ifukube T, 2008 “Object’s width and distance distinguished by the blind using auditory sense while they are walking” Journal of the Acoustical Society of America 123 3859 Moore B C, 2013 An Introduction to the Psychology of Hearing volume 6 (Academic press, San Diego) Moore B C, 2014 “Development and current status of the cambridge loudness models” Trends in Hearing 18 1–29 Moore B C, Glasberg B R, 2007 “Modeling binaural loudness” The Journal of the Acoustical Society of America 121 1604–1612 Nilsson M E, Schenkman B N, 2015 “Blind people are more sensitive than sighted people to binaural sound-location cues, particularly inter-aural level differences” Hearing Research Papadopoulos T, Edwards D S, Rowan D, Allen R, 2011 “Identification of auditory cues utilized in human echolocation-objective measurement results” Biomedical Signal Processing and Control 6 280–290 Patterson R D, Allerhand M H, Giguere C, 1995 “Time-domain modeling of peripheral auditory processing: A modular architecture and a software platform” The Journal of the Acoustical Society of America 98 1890 Patterson R D, Handel S, Yost W A, Datta A J, 1996 “The relative strength of the tone and noise components in iterated rippled noise” The Journal of the Acoustical Society of America 100 3286 Patterson R D, Robinson K, Holdsworth J, McKeown D, Zhang C, Allerhand M, 1992 “Complex sounds and auditory images” Auditory Physiology and Perception 83 429–446 BIBLIOGRAPHY 58 Patterson R D, Unoki M, Irino T, 2003 “Extending the domain of center frequencies for the compressive gammachirp auditory filter” The Journal of the Acoustical Society of America 114 1529–1542 Pedrielli F, Carletti E, Casazza C, 2008 “Just noticeable differences of loudness and sharpness for earth moving machine. proceedings of acoustics conference, france” in “Proceedings of Acoustics 08 Conference, Paris”, pp. 2205–2210 Peeters G, Giordano B L, Susini P, Misdariis N, McAdams S, 2011 “The timbre toolbox: Extracting audio descriptors from musical signals” The Journal of the Acoustical Society of America 130 2902– 2916 Pelegrin Garcia D, Roozen B, Glorieux C, 2013 “Calculation of human echolocation cues by means of the boundary element method” in “Proceedings of the 19th International Conference on Auditory Display (ICAD 2013)”, pp. 253–259 Rice C E, Feinstein S H, Schusterman R J, 1965 “Echo-detection ability of the blind: Size and distance factors” Journal of Experimental Psychology 70 246–251 Rojas J A M, Hermosilla J A, Montero R S, Espi P L L, 2009 “Physical analysis of several organic signals for human echolocation: oral vacuum pulses” Acta Acustica united with Acustica 95 325–330 Rojas J A M, Hermosilla J A, Montero R S, Espı́ P L L, 2010 “Physical analysis of several organic signals for human echolocation: hand and finger produced pulses” Acta Acustica united with Acustica 96 1069–1077 Rowan D, Papadopoulos T, Edwards D, Holmes H, Hollingdale A, Evans L, Allen R, 2013 “Identification of the lateral position of a virtual object based on echoes by humans” Hearing Research Schenkman B, 1985 Human echolocation: The detection of objects by the blind Ph.D. thesis Uppsala University Schenkman B, Nilsson M E, Grbic N, 2011 “Human echolocation using click trains and continuous noise” in “Fechner Day 2011: Proceedings of the 27th Annual Meeting of the International Society for Psychophysics”, pp. 13–18 Schenkman B N, Nilsson M E, 2010 “Human echolocation: Blind and sighted persons’ ability to detect sounds recorded in the presence of a reflecting object” Perception 39 483 Schenkman B N, Nilsson M E, 2011 “Human echolocation: Pitch versus loudness information” Perception 40 840 Schnupp J, Nelken I, King A, 2011 Auditory Neuroscience (The MIT Press, Cambridge Massachusetts) Seki Y, Ifukube T, Tanaka Y, 1994 “Relation between the reflected sound localization and the obstacle sense of the blind” Journal of Acoustical Society of Japan 50 289–295 Teng S, Puri A, Whitney D, 2012 “Ultrafine spatial acuity of blind expert human echolocators” Experimental Brain Research 216 483–488 Teng S, Whitney D, 2011 “The acuity of echolocation: Spatial resolution in the sighted compared to expert performance” Journal of Visual Impairment & Blindness 105 20 Terhardt E, 1974 “Pitch, consonance, and harmony” The Journal of the Acoustical Society of America 55 1061 Thaler L, Arnott S R, Goodale M A, 2011 “Neural correlates of natural human echolocation in early and late blind echolocation experts” PLoS One 6 e20162 Thaler L, Milne J L, Arnott S R, Kish D, Goodale M A, 2014 “Neural correlates of motion processing through echolocation, source hearing, and vision in blind echolocation experts and sighted echolocation novices” Journal of Neurophysiology 111 112–127 BIBLIOGRAPHY 59 Vestergaard M, Bleeck S, Patterson R, 2011 “Aim2006 documentation” [Online; accessed 24-April2016] URL http://www.acousticscale.org/wiki/index.php/AIM2006_Documentation Wallmeier L, Geßele N, Wiegrebe L, 2013 “Echolocation versus echo suppression in humans” Proceedings of the Royal Society B: Biological Sciences 280 Wightman F L, 1973 “The pattern-transformation model of pitch” The Journal of the Acoustical Society of America 54 407 Yost W, 2007 Fundamentals of Hearing: An Introduction (Elsevier Academic Press, San Diego) Yost W A, 1996 “Pitch strength of iterated rippled noise” The Journal of the Acoustical Society of America 100 3329 Zychaluk K, Foster D H, 2009 “Model-free estimation of the psychometric function” Attention, Perception, & Psychophysics 71 1414–1425 Appendix A Room acoustics A.1 Calibration Constant The reference sound pressure level (SPL) to calculate the calibration constants in anechoic, conference and lecture rooms were documented in dB(A) i.e. 77, 79 and 79 dB(A) respectively (Schenkman and Nilsson, 2010; Schenkman, Nilsson, and Grbic, 2011). Hence, to calculate the calibration constant the recordings should be A weighted. However, at the time of documentation it was found that instead of using the equation A.2 to find the calibration constant (CC) equation A.1 was used. SP L−20∗log10 20 CC = 10 ( rms(signal) ) 20∗10−6 SP L−20∗log10 (A.1) ( rms(Aweighting(signal)) ) 20∗10−6 20 CC = 10 (A.2) To find out the difference between equation A.1 and equation A.2 the calibrated levels with and without A weighting for the 9th version of left ear, 500ms no object first recording in anechoic, conference room and for the 9th version of left ear, 500ms no object recording in lecture room were calculated. The results are tabulated in Table A.1. With A weighting Without A weighting Anechoic 77.46 77.00 Conference 79.51 78.99 Lecture 79.29 78.99 Table A.1: Calibrated levels with and without A weighting for the 9th version of left ear 500ms no object first recording in anechoic, conference room and for the 9th version of left ear 500ms no object recording in lecture room. The results suggests that finding the calibration constant from A weighted signal would give us a increase in calibrated level by approximately less than 0.5 dB which is small. Hence, it was concluded that although equation A.1 was used instead of using A.2 to calculate the calibration constants. As the difference between them is small the calibration constants calculated from equation A.1 were used to calibrate all the recordings in this thesis. 60 APPENDIX A. ROOM ACOUSTICS A.2 61 Sound Pressure Level NoObject1 NoObject2 Object50cm Object100cm Object200cm Object300cm Object400cm Object500cm Ver1 78.354 78.873 86.316 78.055 78.282 78.146 78.243 78.194 Ver2 79.504 79.910 85.274 79.212 79.479 79.331 79.428 79.364 Ver3 75.639 76.276 83.197 75.465 75.657 75.493 75.540 75.496 Ver4 79.177 79.705 86.097 78.872 79.120 78.964 79.074 78.999 Ver5 79.269 79.757 85.623 78.872 79.209 79.000 79.152 79.098 Ver6 78.826 79.251 83.799 78.495 78.765 78.620 78.768 78.678 Ver7 78.260 78.757 85.329 77.959 78.240 78.060 78.156 78.112 Ver8 76.552 77.177 83.303 76.323 76.568 76.457 76.438 77.698 Ver9 77.852 78.149 83.158 77.494 77.703 77.681 77.775 77.755 Ver10 76.254 76.894 83.295 75.977 76.281 76.098 76.108 76.011 Mean 77.969 78.475 84.539 77.673 77.931 77.785 77.868 77.941 Table A.2: SPL values (dBA) for 10 versions of the left ear recordings in the anechoic chamber with 5ms duration signal NoObject1 NoObject2 Object50cm Object100cm Object200cm Object300cm Object400cm Object500cm Ver1 79.049 78.563 88.572 79.405 79.211 79.382 79.179 79.212 Ver2 80.219 79.755 88.017 80.526 80.381 80.518 80.346 80.384 Ver3 76.185 75.684 86.074 76.676 76.439 76.637 76.353 76.414 Ver4 79.742 79.201 89.413 79.989 79.903 80.082 79.891 79.897 Ver5 80.017 79.478 89.294 80.416 80.280 80.403 80.161 80.233 Ver6 79.280 78.821 87.578 79.672 79.442 79.566 79.435 79.457 Ver7 79.067 78.516 87.492 79.347 79.230 79.399 79.204 79.253 Ver8 77.087 76.531 87.286 77.316 77.271 77.467 77.238 78.493 Ver9 78.498 78.049 85.596 78.862 78.610 78.831 78.632 78.700 Ver10 77.061 76.461 86.860 77.256 77.286 77.424 77.207 77.215 Mean 78.620 78.106 87.618 78.947 78.805 78.971 78.765 78.926 Table A.3: SPL values (dBA) for 10 versions of the right ear recordings in the anechoic chamber with 5ms duration signal NoObject1 Object100cm Object150cm Ver1 73.832 74.361 74.141 Ver2 71.237 71.686 71.213 Ver3 72.262 72.767 72.536 Ver4 72.784 73.255 73.167 Ver5 70.988 71.563 71.182 Ver6 72.732 73.146 72.883 Ver7 71.606 72.089 71.807 Ver8 75.142 75.616 75.227 Ver9 73.013 73.501 73.036 Ver10 70.779 71.127 70.735 Mean 72.437 72.911 72.593 Table A.4: SPL values (dBA) for 10 versions of the left ear recordings in the lecture chamber with 5ms duration signal NoObject1 Object100cm Object150cm Ver1 73.957 76.450 74.128 Ver2 71.567 72.820 71.594 Ver3 72.382 74.638 72.557 Ver4 73.078 75.320 73.243 Ver5 71.414 73.516 71.489 Ver6 72.774 74.667 72.912 Ver7 72.005 74.211 72.105 Ver8 75.163 77.810 75.257 Ver9 73.148 75.351 73.150 Ver10 70.995 72.393 70.977 Mean 72.648 74.718 72.741 Table A.5: SPL values (dBA) for 10 versions of the right ear recordings in the lecture chamber with 5ms duration signal APPENDIX A. ROOM ACOUSTICS NoObject1 NoObject2 Object50cm Object100cm Object200cm Object300cm Object400cm Object500cm Ver1 80.438 80.413 89.311 80.223 80.279 80.340 80.379 80.443 Ver2 80.437 80.402 89.232 80.221 80.268 80.341 80.377 80.444 Ver3 80.433 80.396 89.175 80.215 80.256 80.335 80.376 80.432 62 Ver4 80.434 80.393 89.168 80.218 80.241 80.332 80.366 80.421 Ver5 80.434 80.385 89.169 80.214 80.234 80.334 80.364 80.409 Ver6 80.432 80.377 89.176 80.217 80.234 80.331 80.364 80.394 Ver7 80.429 80.354 89.181 80.215 80.231 80.328 80.368 80.392 Ver8 80.429 80.339 89.179 80.213 80.218 80.326 80.370 80.389 Ver9 80.432 80.340 89.171 80.213 80.203 80.329 80.363 80.385 Ver10 80.430 80.306 89.160 80.206 80.192 80.326 80.366 80.375 Mean 80.433 80.370 89.192 80.216 80.236 80.332 80.369 80.409 Table A.6: SPL values (dBA) for 10 versions of the left ear recordings in the conference chamber with 5ms duration signal NoObject1 NoObject2 Object50cm Object100cm Object200cm Object300cm Object400cm Object500cm Ver1 79.779 79.687 89.084 79.883 79.771 79.853 79.789 79.725 Ver2 79.776 79.677 89.086 79.885 79.760 79.858 79.787 79.715 Ver3 79.772 79.670 89.068 79.880 79.757 79.853 79.789 79.693 Ver4 79.776 79.661 89.079 79.877 79.743 79.852 79.782 79.683 Ver5 79.774 79.655 89.058 79.869 79.740 79.853 79.780 79.680 Ver6 79.774 79.653 89.073 79.868 79.737 79.850 79.781 79.673 Ver7 79.773 79.631 89.060 79.866 79.734 79.849 79.781 79.664 Ver8 79.776 79.579 89.048 79.865 79.724 79.849 79.783 79.654 Ver9 79.775 79.497 89.038 79.862 79.708 79.856 79.779 79.650 Ver10 79.772 79.358 89.040 79.854 79.666 79.849 79.785 79.648 Mean 79.775 79.607 89.063 79.871 79.734 79.852 79.784 79.679 Table A.7: SPL values (dBA) for 10 versions of the right ear recordings in the conference chamber with 5ms duration signal NoObject1 NoObject2 Object50cm Object100cm Object200cm Object300cm Object400cm Object500cm Ver1 77.496 77.955 85.654 82.085 77.411 77.236 77.536 77.311 Ver2 74.845 75.192 83.337 79.567 74.784 74.584 74.767 74.697 Ver3 76.418 76.839 84.884 81.857 76.340 76.198 76.372 76.241 Ver4 77.436 77.776 84.873 82.412 77.314 77.185 77.346 77.263 Ver5 77.466 77.915 85.147 82.157 77.447 77.275 80.044 77.306 Ver6 77.561 77.914 85.211 82.127 77.415 77.308 77.506 77.419 Ver7 77.275 77.704 84.687 81.104 77.237 77.054 77.164 77.094 Ver8 77.695 78.168 86.040 81.829 77.621 77.492 77.603 77.513 Ver9 77.418 77.798 85.682 82.018 77.316 77.160 77.331 77.252 Ver10 76.461 76.816 84.667 81.621 76.332 76.227 76.432 76.298 Mean 77.007 77.408 85.018 81.678 76.922 76.772 77.210 76.839 Table A.8: SPL values (dBA) for 10 versions of the left ear recordings in the anechoic chamber with 50ms duration signal NoObject1 NoObject2 Object50cm Object100cm Object200cm Object300cm Object400cm Object500cm Ver1 78.199 77.715 88.589 82.631 78.421 78.518 78.441 78.362 Ver2 75.433 74.989 85.930 80.359 75.618 75.728 75.555 75.614 Ver3 77.092 76.604 87.915 81.934 77.311 77.428 77.281 77.258 Ver4 78.019 77.563 88.257 82.806 78.197 78.298 78.136 78.168 Ver5 78.220 77.720 88.238 82.870 78.427 78.538 80.726 78.392 Ver6 78.135 77.680 88.193 82.773 78.287 78.445 78.280 78.311 Ver7 77.960 77.480 87.764 82.212 78.166 78.236 78.085 78.104 Ver8 78.377 77.895 88.938 82.768 78.553 78.717 78.509 78.545 Ver9 78.062 77.580 88.614 82.569 78.246 78.368 78.195 78.235 Ver10 77.060 76.584 87.443 81.904 77.173 77.359 77.222 77.216 Mean 77.656 77.181 87.988 82.283 77.840 77.964 78.043 77.820 Table A.9: SPL values (dBA) for 10 versions of the right ear recordings in the anechoic chamber with 50ms duration signal APPENDIX A. ROOM ACOUSTICS NoObject1 NoObject2 Object50cm Object100cm Object200cm Object300cm Object400cm Object500cm Ver1 79.427 79.428 88.367 83.272 79.933 79.375 79.432 79.428 Ver2 79.429 79.427 88.375 83.258 79.939 79.366 79.428 79.427 Ver3 79.422 79.422 88.370 83.244 79.946 79.367 79.434 79.425 63 Ver4 79.423 79.421 88.373 83.252 79.960 79.367 79.430 79.425 Ver5 79.422 79.421 88.364 83.261 79.951 79.370 79.431 79.425 Ver6 79.415 79.419 88.348 83.276 79.950 79.368 79.425 79.423 Ver7 79.417 79.421 88.346 83.278 79.925 79.363 79.432 79.423 Ver8 79.417 79.420 88.346 83.278 79.953 79.366 79.425 79.416 Ver9 79.415 79.418 88.353 83.274 79.927 79.365 79.420 79.414 Ver10 79.414 79.418 88.347 83.282 79.942 79.360 79.424 79.418 Mean 79.420 79.422 88.359 83.267 79.943 79.367 79.428 79.422 Table A.10: SPL values (dBA) for 10 versions of the left ear recordings in the conference chamber with 50ms duration signal NoObject1 NoObject2 Object50cm Object100cm Object200cm Object300cm Object400cm Object500cm Ver1 79.307 79.315 88.257 82.959 80.217 79.422 79.374 79.283 Ver2 79.304 79.313 88.276 82.965 80.221 79.418 79.372 79.283 Ver3 79.302 79.311 88.254 82.958 80.228 79.414 79.371 79.279 Ver4 79.299 79.307 88.252 82.973 80.235 79.409 79.373 79.282 Ver5 79.302 79.307 88.239 82.972 80.228 79.411 79.376 79.279 Ver6 79.302 79.305 88.241 82.977 80.238 79.409 79.372 79.279 Ver7 79.297 79.304 88.240 82.968 80.217 79.413 79.374 79.280 Ver8 79.298 79.303 88.238 82.961 80.238 79.417 79.368 79.277 Ver9 79.295 79.305 88.251 82.953 80.221 79.412 79.366 79.274 Ver10 79.298 79.306 88.243 82.948 80.236 79.413 79.367 79.275 Mean 79.300 79.308 88.249 82.963 80.228 79.414 79.371 79.279 Table A.11: SPL values (dBA) for 10 versions of the right ear recordings in the conference chamber with 50ms duration signal NoObject1 NoObject2 Object50cm Object100cm Object200cm Object300cm Object400cm Object500cm Ver1 77.795 78.264 86.080 82.249 77.706 77.583 77.693 77.619 Ver2 77.517 77.965 85.668 82.085 77.501 77.329 77.423 77.346 Ver3 77.020 77.478 85.124 81.621 77.039 76.819 76.917 76.843 Ver4 76.485 76.922 84.469 81.355 76.416 76.272 76.387 76.329 Ver5 77.047 77.506 84.844 81.971 77.003 76.865 76.942 76.901 Ver6 77.280 77.712 84.995 82.227 77.217 77.058 77.187 77.124 Ver7 77.563 77.986 85.480 82.461 77.495 77.339 77.470 77.435 Ver8 76.939 77.356 85.089 81.811 76.847 76.700 76.848 76.772 Ver9 77.000 77.407 85.092 81.666 76.919 77.127 76.871 76.799 Ver10 76.889 77.327 84.976 81.329 76.829 76.661 76.775 76.704 Mean 77.153 77.592 85.182 81.877 77.097 76.975 77.051 76.987 Table A.12: SPL values (dBA) for 10 versions of the left ear recordings in the anechoic chamber with 500ms duration signal NoObject1 NoObject2 Object50cm Object100cm Object200cm Object300cm Object400cm Object500cm Ver1 78.535 78.039 88.979 83.138 78.685 78.853 78.661 78.699 Ver2 78.231 77.741 88.651 82.881 78.464 78.568 78.362 78.396 Ver3 77.789 77.284 88.126 82.332 78.007 78.109 77.915 77.946 Ver4 77.205 76.708 87.529 81.845 77.378 77.523 77.328 77.380 Ver5 77.802 77.306 87.905 82.590 77.984 78.140 77.924 77.989 Ver6 77.986 77.501 88.112 82.869 78.148 78.289 78.107 78.153 Ver7 78.233 77.758 88.485 83.048 78.413 78.540 78.356 78.436 Ver8 77.595 77.124 88.167 82.347 77.760 77.902 77.718 77.759 Ver9 77.693 77.190 88.160 82.389 77.861 78.307 77.791 77.837 Ver10 77.588 77.087 88.042 82.066 77.737 77.883 77.696 77.735 Mean 77.866 77.374 88.216 82.550 78.044 78.211 77.986 78.033 Table A.13: SPL values (dBA) for 10 versions of the right ear recordings in the anechoic chamber with 500ms duration signal APPENDIX A. ROOM ACOUSTICS NoObject1 NoObject2 Object50cm Object100cm Object200cm Object300cm Object400cm Object500cm Ver1 79.009 78.988 87.550 82.818 79.587 78.941 79.019 79.011 Ver2 79.007 78.988 87.539 82.863 79.625 78.931 79.025 79.003 Ver3 79.004 78.988 87.525 82.836 79.588 78.919 79.020 79.008 64 Ver4 79.006 78.990 87.538 82.825 79.556 78.936 79.012 79.003 Ver5 78.998 78.996 87.537 82.805 79.659 78.928 79.006 79.006 Ver6 79.002 78.997 87.546 82.824 79.598 78.917 79.016 79.014 Ver7 79.007 78.998 87.548 82.807 79.564 78.916 79.015 79.014 Ver8 79.002 78.992 87.529 82.822 79.602 78.925 79.011 79.005 Ver9 79.000 78.995 87.542 82.828 79.640 78.924 79.015 79.014 Ver10 78.998 78.998 87.533 82.843 79.559 78.926 79.019 79.012 Mean 79.003 78.993 87.539 82.827 79.598 78.926 79.016 79.009 Table A.14: SPL values (dBA) for 10 versions of the left ear recordings in the conference chamber with 500ms duration signal NoObject1 NoObject2 Object50cm Object100cm Object200cm Object300cm Object400cm Object500cm Ver1 78.812 78.829 87.501 82.340 79.478 78.914 78.858 78.801 Ver2 78.812 78.825 87.471 82.394 79.493 78.902 78.858 78.796 Ver3 78.814 78.823 87.418 82.407 79.477 78.892 78.859 78.799 Ver4 78.817 78.826 87.452 82.384 79.443 78.904 78.856 78.797 Ver5 78.814 78.827 87.490 82.335 79.520 78.901 78.853 78.801 Ver6 78.819 78.828 87.436 82.381 79.473 78.889 78.865 78.797 Ver7 78.823 78.822 87.426 82.361 79.460 78.893 78.864 78.797 Ver8 78.816 78.817 87.483 82.378 79.483 78.898 78.859 78.794 Ver9 78.818 78.821 87.469 82.391 79.523 78.892 78.862 78.799 Ver10 78.820 78.821 87.425 82.393 79.463 78.894 78.868 78.801 Mean 78.817 78.824 87.457 82.377 79.481 78.898 78.860 78.798 Table A.15: RSPL values (dBA) for 10 versions of the right ear recordings in the conference chamber with 500ms duration signal NoObject1 Object100cm Object150cm Ver1 79.719 80.188 79.968 Ver2 78.588 79.106 78.812 Ver3 78.855 79.205 79.156 Ver4 78.867 79.239 79.225 Ver5 79.097 79.553 79.350 Ver6 79.445 79.852 79.719 Ver7 78.255 78.630 78.468 Ver8 79.791 80.284 80.028 Ver9 80.036 80.431 80.209 Ver10 79.000 79.453 79.187 Mean 79.165 79.594 79.412 Table A.16: SPL values (dBA) for 10 versions of the left ear recordings in the lecture chamber with 500ms duration signal NoObject1 Object100cm Object150cm Ver1 80.164 82.120 80.222 Ver2 79.051 81.088 79.149 Ver3 79.307 81.139 79.444 Ver4 79.224 81.159 79.327 Ver5 79.490 81.377 79.577 Ver6 79.854 81.852 79.949 Ver7 78.719 80.516 78.818 Ver8 80.197 82.339 80.321 Ver9 80.346 82.393 80.461 Ver10 79.417 81.466 79.543 Mean 79.577 81.545 79.681 Table A.17: SPL values (dBA) for 10 versions of the right ear recordings in the lecture chamber with 500ms duration signal APPENDIX A. ROOM ACOUSTICS A.3 65 Spectral Centroid Frequency (Hz) Spectral Centroid for NoObject rec2 10000 5000 0 0.05 0.1 0.15 0.2 0.25 0.3 Time(sec) Spectral Centroid for Object at 50cm 0.35 0.4 Frequency (Hz) 0 10000 5000 0 0.05 0.1 0.15 0.2 0.25 0.3 Time(sec) Spectral Centroid for Object at 200cm 0.35 0.4 Frequency (Hz) 0 10000 5000 0 0 0.05 0.1 0.15 0.2 0.25 0.3 Time(sec) Spectral Centroid for Object at 400cm 0.35 0.4 Frequency (Hz) Frequency (Hz) Frequency (Hz) Frequency (Hz) Frequency (Hz) Spectral Centroid for NoObject rec1 10000 5000 0 0 0.05 0.1 0.15 0.2 0.25 Time(sec) 0.3 0.35 0.4 10000 5000 0 0 0.05 0.1 0.15 0.2 0.25 0.3 Time(sec) Spectral Centroid for Object at 100cm 0.35 0.4 0 0.05 0.1 0.15 0.35 0.4 0 0.05 0.1 0.35 0.4 0 0.05 0.1 0.35 0.4 10000 5000 0 0.2 0.25 0.3 Time(sec) Spectral Centroid for Object at 300cm 10000 5000 0 0.15 0.2 0.25 0.3 Time(sec) Spectral Centroid for Object at 500cm 10000 5000 0 0.15 0.2 0.25 Time(sec) 0.3 Figure A.1: The spectral centroid as a function of time for the 10 versions (marked in different colors for each subplot) of the left ear 5ms recording in the anechoic chamber (Experiment 1). Frequency (Hz) Spectral Centroid for NoObject rec2 5000 0 0.05 0.1 0.15 0.2 0.25 0.3 Time(sec) Spectral Centroid for Object at 50cm 0.35 0.4 Frequency (Hz) 0 10000 5000 0 0.05 0.1 0.15 0.2 0.25 0.3 Time(sec) Spectral Centroid for Object at 200cm 0.35 0.4 Frequency (Hz) 0 10000 5000 0 0 0.05 0.1 0.15 0.2 0.25 0.3 Time(sec) Spectral Centroid for Object at 400cm 0.35 0.4 Frequency (Hz) Frequency (Hz) Frequency (Hz) Frequency (Hz) Frequency (Hz) Spectral Centroid for NoObject rec1 10000 10000 5000 0 0 0.05 0.1 0.15 0.2 0.25 Time(sec) 0.3 0.35 0.4 10000 5000 0 0 0.05 0.1 0.15 0.2 0.25 0.3 Time(sec) Spectral Centroid for Object at 100cm 0.35 0.4 0 0.05 0.1 0.15 0.35 0.4 0 0.05 0.1 0.15 0.35 0.4 0 0.05 0.1 0.15 0.35 0.4 10000 5000 0 0.2 0.25 0.3 Time(sec) Spectral Centroid for Object at 300cm 10000 5000 0 0.2 0.25 0.3 Time(sec) Spectral Centroid for Object at 500cm 10000 5000 0 0.2 0.25 Time(sec) 0.3 Figure A.2: The spectral centroid as a function of time for the 10 versions (marked in different colors for each subplot) of the right ear 5ms recording in the anechoic chamber (Experiment 1). APPENDIX A. ROOM ACOUSTICS 66 Frequency (Hz) Spectral Centroid for NoObject rec2 10000 5000 0 0.05 0.1 0.15 0.2 0.25 0.3 Time(sec) Spectral Centroid for Object at 50cm 0.35 0.4 Frequency (Hz) 0 10000 5000 0 0.05 0.1 0.15 0.2 0.25 0.3 Time(sec) Spectral Centroid for Object at 200cm 0.35 0.4 Frequency (Hz) 0 10000 5000 0 0 0.05 0.1 0.15 0.2 0.25 0.3 Time(sec) Spectral Centroid for Object at 400cm 0.35 0.4 Frequency (Hz) Frequency (Hz) Frequency (Hz) Frequency (Hz) Frequency (Hz) Spectral Centroid for NoObject rec1 10000 5000 0 0 0.05 0.1 0.15 0.2 0.25 Time(sec) 0.3 0.35 0.4 10000 5000 0 0 0.05 0.1 0.15 0.2 0.25 0.3 Time(sec) Spectral Centroid for Object at 100cm 0.35 0.4 0 0.05 0.1 0.15 0.35 0.4 0 0.05 0.1 0.15 0.35 0.4 0 0.05 0.1 0.15 0.35 0.4 10000 5000 0 0.2 0.25 0.3 Time(sec) Spectral Centroid for Object at 300cm 10000 5000 0 0.2 0.25 0.3 Time(sec) Spectral Centroid for Object at 500cm 10000 5000 0 0.2 0.25 Time(sec) 0.3 Figure A.3: The spectral centroid as a function of time for the 10 versions (marked in different colors for each subplot) of the left ear 5ms recording in the conference room (Experiment 1). Frequency (Hz) Spectral Centroid for NoObject rec2 10000 5000 0 0.05 0.1 0.15 0.2 0.25 0.3 Time(sec) Spectral Centroid for Object at 50cm 0.35 0.4 Frequency (Hz) 0 10000 5000 0 0.05 0.1 0.15 0.2 0.25 0.3 Time(sec) Spectral Centroid for Object at 200cm 0.35 0.4 Frequency (Hz) 0 10000 5000 0 0 0.05 0.1 0.15 0.2 0.25 0.3 Time(sec) Spectral Centroid for Object at 400cm 0.35 0.4 Frequency (Hz) Frequency (Hz) Frequency (Hz) Frequency (Hz) Frequency (Hz) Spectral Centroid for NoObject rec1 10000 5000 0 0 0.05 0.1 0.15 0.2 0.25 Time(sec) 0.3 0.35 0.4 10000 5000 0 0 0.05 0.1 0.15 0.2 0.25 0.3 Time(sec) Spectral Centroid for Object at 100cm 0.35 0.4 0 0.05 0.1 0.15 0.35 0.4 0 0.05 0.1 0.15 0.35 0.4 0 0.05 0.1 0.15 0.35 0.4 10000 5000 0 0.2 0.25 0.3 Time(sec) Spectral Centroid for Object at 300cm 10000 5000 0 0.2 0.25 0.3 Time(sec) Spectral Centroid for Object at 500cm 10000 5000 0 0.2 0.25 Time(sec) 0.3 Figure A.4: The spectral centroid as a function of time for the 10 versions (marked in different colors for each subplot) of the right ear 5ms recording in the conference room (Experiment 1). APPENDIX A. ROOM ACOUSTICS 67 Spectral Centroid for Object at 100cm 10000 8000 8000 Frequency (Hz) Frequency (Hz) Spectral Centroid for NoObject 10000 6000 4000 2000 0 6000 4000 2000 0 0.2 0.4 0.6 0.8 1 1.2 Time(sec) 1.4 1.6 1.8 1.6 1.8 0 0 0.2 0.4 0.6 0.8 1 1.2 Time(sec) 1.4 1.6 1.8 Spectral Centroid for Object at 150cm Frequency (Hz) 10000 8000 6000 4000 2000 0 0 0.2 0.4 0.6 0.8 1 1.2 Time(sec) 1.4 Figure A.5: The spectral centroid as a function of time for the 10 versions (marked in different colors for each subplot) of the left ear 5ms recording in the lecture room (Experiment 2). Spectral Centroid for Object at 100cm 10000 8000 8000 Frequency (Hz) Frequency (Hz) Spectral Centroid for NoObject 10000 6000 4000 2000 0 6000 4000 2000 0 0.2 0.4 0.6 0.8 1 Time(sec) 1.2 1.4 1.6 1.8 1.6 1.8 0 0 0.2 0.4 0.6 0.8 1 Time(sec) 1.2 1.4 1.6 1.8 Spectral Centroid for Object at 150cm Frequency (Hz) 10000 8000 6000 4000 2000 0 0 0.2 0.4 0.6 0.8 1 Time(sec) 1.2 1.4 Figure A.6: The spectral centroid as a function of time for the 10 versions (marked in different colors for each subplot) of the right ear 5ms recording in the lecture room (Experiment 2). APPENDIX A. ROOM ACOUSTICS 68 Frequency (Hz) Spectral Centroid for NoObject rec2 10000 5000 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 Time(sec) Spectral Centroid for Object at 50cm 0.4 0.45 Frequency (Hz) 0 10000 5000 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 Time(sec) Spectral Centroid for Object at 200cm 0.4 0.45 Frequency (Hz) 0 10000 5000 0 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 Time(sec) Spectral Centroid for Object at 400cm 0.4 0.45 Frequency (Hz) Frequency (Hz) Frequency (Hz) Frequency (Hz) Frequency (Hz) Spectral Centroid for NoObject rec1 10000 5000 0 0 0.05 0.1 0.15 0.2 0.25 Time(sec) 0.3 0.35 0.4 0.45 10000 5000 0 0 0.05 0.1 0 0.05 0.1 0 0.05 0.1 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 Time(sec) Spectral Centroid for Object at 100cm 0.4 0.45 0.15 0.4 0.45 0.15 0.4 0.45 0.4 0.45 10000 5000 0 0.2 0.25 0.3 0.35 Time(sec) Spectral Centroid for Object at 300cm 10000 5000 0 0.2 0.25 0.3 0.35 Time(sec) Spectral Centroid for Object at 500cm 10000 5000 0 0.15 0.2 0.25 Time(sec) 0.3 0.35 Figure A.7: The spectral centroid as a function of time for the 10 versions (marked in different colors for each subplot) of the left ear 50ms recording in the anechoic chamber (Experiment 1). Frequency (Hz) Spectral Centroid for NoObject rec2 10000 5000 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 Time(sec) Spectral Centroid for Object at 50cm 0.4 0.45 Frequency (Hz) 0 10000 5000 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 Time(sec) Spectral Centroid for Object at 200cm 0.4 0.45 Frequency (Hz) 0 10000 5000 0 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 Time(sec) Spectral Centroid for Object at 400cm 0.4 0.45 Frequency (Hz) Frequency (Hz) Frequency (Hz) Frequency (Hz) Frequency (Hz) Spectral Centroid for NoObject rec1 10000 5000 0 0 0.05 0.1 0.15 0.2 0.25 Time(sec) 0.3 0.35 0.4 0.45 10000 5000 0 0 0.05 0.1 0 0.05 0.1 0 0.05 0.1 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 Time(sec) Spectral Centroid for Object at 100cm 0.4 0.45 0.15 0.4 0.45 0.15 0.4 0.45 0.4 0.45 10000 5000 0 0.2 0.25 0.3 0.35 Time(sec) Spectral Centroid for Object at 300cm 10000 5000 0 0.2 0.25 0.3 0.35 Time(sec) Spectral Centroid for Object at 500cm 10000 5000 0 0.15 0.2 0.25 Time(sec) 0.3 0.35 Figure A.8: The spectral centroid as a function of time for the 10 versions (marked in different colors for each subplot) of the right ear 50ms recording in the anechoic chamber (Experiment 1). APPENDIX A. ROOM ACOUSTICS 69 Frequency (Hz) Spectral Centroid for NoObject rec2 10000 5000 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 Time(sec) Spectral Centroid for Object at 50cm 0.4 0.45 Frequency (Hz) 0 10000 5000 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 Time(sec) Spectral Centroid for Object at 200cm 0.4 0.45 Frequency (Hz) 0 10000 5000 0 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 Time(sec) Spectral Centroid for Object at 400cm 0.4 0.45 Frequency (Hz) Frequency (Hz) Frequency (Hz) Frequency (Hz) Frequency (Hz) Spectral Centroid for NoObject rec1 10000 5000 0 0 0.05 0.1 0.15 0.2 0.25 Time(sec) 0.3 0.35 0.4 0.45 10000 5000 0 0 0.05 0.1 0 0.05 0.1 0 0.05 0.1 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 Time(sec) Spectral Centroid for Object at 100cm 0.4 0.45 0.15 0.4 0.45 0.15 0.4 0.45 0.4 0.45 10000 5000 0 0.2 0.25 0.3 0.35 Time(sec) Spectral Centroid for Object at 300cm 10000 5000 0 0.2 0.25 0.3 0.35 Time(sec) Spectral Centroid for Object at 500cm 10000 5000 0 0.15 0.2 0.25 Time(sec) 0.3 0.35 Figure A.9: The spectral centroid as a function of time for the 10 versions (marked in different colors for each subplot) of the left ear 50ms recording in the conference room (Experiment 1). Frequency (Hz) Spectral Centroid for NoObject rec2 10000 5000 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 Time(sec) Spectral Centroid for Object at 50cm 0.4 0.45 Frequency (Hz) 0 10000 5000 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 Time(sec) Spectral Centroid for Object at 200cm 0.4 0.45 Frequency (Hz) 0 10000 5000 0 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 Time(sec) Spectral Centroid for Object at 400cm 0.4 0.45 Frequency (Hz) Frequency (Hz) Frequency (Hz) Frequency (Hz) Frequency (Hz) Spectral Centroid for NoObject rec1 10000 5000 0 0 0.05 0.1 0.15 0.2 0.25 Time(sec) 0.3 0.35 0.4 0.45 10000 5000 0 0 0.05 0.1 0 0.05 0.1 0 0.05 0.1 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 Time(sec) Spectral Centroid for Object at 100cm 0.4 0.45 0.15 0.4 0.45 0.15 0.4 0.45 0.4 0.45 10000 5000 0 0.2 0.25 0.3 0.35 Time(sec) Spectral Centroid for Object at 300cm 10000 5000 0 0.2 0.25 0.3 0.35 Time(sec) Spectral Centroid for Object at 500cm 10000 5000 0 0.15 0.2 0.25 Time(sec) 0.3 0.35 Figure A.10: The spectral centroid as a function of time for the 10 versions (marked in different colors for each subplot) of the right ear 50ms recording in the conference room (Experiment 1). APPENDIX A. ROOM ACOUSTICS 70 Frequency (Hz) Spectral Centroid for NoObject rec2 10000 5000 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Time(sec) Spectral Centroid for Object at 50cm 0.8 0.9 Frequency (Hz) 0 10000 5000 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Time(sec) Spectral Centroid for Object at 200cm 0.8 0.9 Frequency (Hz) 0 10000 5000 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Time(sec) Spectral Centroid for Object at 400cm 0.8 0.9 Frequency (Hz) Frequency (Hz) Frequency (Hz) Frequency (Hz) Frequency (Hz) Spectral Centroid for NoObject rec1 10000 5000 0 0 0.1 0.2 0.3 0.4 0.5 Time(sec) 0.6 0.7 0.8 0.9 10000 5000 0 0 0.1 0.2 0 0.1 0.2 0 0.1 0.2 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Time(sec) Spectral Centroid for Object at 100cm 0.8 0.9 0.3 0.8 0.9 0.3 0.8 0.9 0.8 0.9 10000 5000 0 0.4 0.5 0.6 0.7 Time(sec) Spectral Centroid for Object at 300cm 10000 5000 0 0.4 0.5 0.6 0.7 Time(sec) Spectral Centroid for Object at 500cm 10000 5000 0 0.3 0.4 0.5 Time(sec) 0.6 0.7 Figure A.11: The spectral centroid as a function of time for the 10 versions (marked in different colors for each subplot) of the left ear 500ms recording in the anechoic chamber (Experiment 1). Frequency (Hz) Spectral Centroid for NoObject rec2 10000 5000 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Time(sec) Spectral Centroid for Object at 50cm 0.8 0.9 Frequency (Hz) 0 10000 5000 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Time(sec) Spectral Centroid for Object at 200cm 0.8 0.9 Frequency (Hz) 0 10000 5000 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Time(sec) Spectral Centroid for Object at 400cm 0.8 0.9 Frequency (Hz) Frequency (Hz) Frequency (Hz) Frequency (Hz) Frequency (Hz) Spectral Centroid for NoObject rec1 10000 5000 0 0 0.1 0.2 0.3 0.4 0.5 Time(sec) 0.6 0.7 0.8 0.9 10000 5000 0 0 0.1 0.2 0 0.1 0.2 0 0.1 0.2 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Time(sec) Spectral Centroid for Object at 100cm 0.8 0.9 0.3 0.8 0.9 0.3 0.8 0.9 0.8 0.9 10000 5000 0 0.4 0.5 0.6 0.7 Time(sec) Spectral Centroid for Object at 300cm 10000 5000 0 0.4 0.5 0.6 0.7 Time(sec) Spectral Centroid for Object at 500cm 10000 5000 0 0.3 0.4 0.5 Time(sec) 0.6 0.7 Figure A.12: The spectral centroid as a function of time for the 10 versions (marked in different colors for each subplot) of the right ear 500ms recording in the anechoic chamber (Experiment 1). APPENDIX A. ROOM ACOUSTICS 71 Frequency (Hz) Spectral Centroid for NoObject rec2 10000 5000 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Time(sec) Spectral Centroid for Object at 50cm 0.8 0.9 Frequency (Hz) 0 10000 5000 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Time(sec) Spectral Centroid for Object at 200cm 0.8 0.9 Frequency (Hz) 0 10000 5000 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Time(sec) Spectral Centroid for Object at 400cm 0.8 0.9 Frequency (Hz) Frequency (Hz) Frequency (Hz) Frequency (Hz) Frequency (Hz) Spectral Centroid for NoObject rec1 10000 5000 0 0 0.1 0.2 0.3 0.4 0.5 Time(sec) 0.6 0.7 0.8 0.9 10000 5000 0 0 0.1 0.2 0 0.1 0.2 0 0.1 0.2 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Time(sec) Spectral Centroid for Object at 100cm 0.8 0.9 0.3 0.8 0.9 0.3 0.8 0.9 0.8 0.9 10000 5000 0 0.4 0.5 0.6 0.7 Time(sec) Spectral Centroid for Object at 300cm 10000 5000 0 0.4 0.5 0.6 0.7 Time(sec) Spectral Centroid for Object at 500cm 10000 5000 0 0.3 0.4 0.5 Time(sec) 0.6 0.7 Figure A.13: The spectral centroid as a function of time for the 10 versions (marked in different colors for each subplot) of the left ear 500ms recording in the conference room (Experiment 1). Frequency (Hz) Spectral Centroid for NoObject rec2 10000 5000 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Time(sec) Spectral Centroid for Object at 50cm 0.8 0.9 Frequency (Hz) 0 10000 5000 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Time(sec) Spectral Centroid for Object at 200cm 0.8 0.9 Frequency (Hz) 0 10000 5000 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Time(sec) Spectral Centroid for Object at 400cm 0.8 0.9 Frequency (Hz) Frequency (Hz) Frequency (Hz) Frequency (Hz) Frequency (Hz) Spectral Centroid for NoObject rec1 10000 5000 0 0 0.1 0.2 0.3 0.4 0.5 Time(sec) 0.6 0.7 0.8 0.9 10000 5000 0 0 0.1 0.2 0 0.1 0.2 0 0.1 0.2 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Time(sec) Spectral Centroid for Object at 100cm 0.8 0.9 0.3 0.8 0.9 0.3 0.8 0.9 0.8 0.9 10000 5000 0 0.4 0.5 0.6 0.7 Time(sec) Spectral Centroid for Object at 300cm 10000 5000 0 0.4 0.5 0.6 0.7 Time(sec) Spectral Centroid for Object at 500cm 10000 5000 0 0.3 0.4 0.5 Time(sec) 0.6 0.7 Figure A.14: The spectral centroid as a function of time for the 10 versions (marked in different colors for each subplot) of the right ear 500ms recording in the conference room (Experiment 1). APPENDIX A. ROOM ACOUSTICS 72 Spectral Centroid for Object at 100cm 10000 8000 8000 Frequency (Hz) Frequency (Hz) Spectral Centroid for NoObject 10000 6000 4000 2000 0 6000 4000 2000 0 0.2 0.4 0.6 0.8 1 Time(sec) 1.2 1.4 1.6 1.8 1.6 1.8 0 0 0.2 0.4 0.6 0.8 1 Time(sec) 1.2 1.4 1.6 1.8 Spectral Centroid for Object at 150cm Frequency (Hz) 10000 8000 6000 4000 2000 0 0 0.2 0.4 0.6 0.8 1 Time(sec) 1.2 1.4 Figure A.15: The spectral centroid as a function of time for the 10 versions (marked in different colors for each subplot) of the left ear 500ms recording in the lecture room (Experiment 2). Spectral Centroid for Object at 100cm 10000 8000 8000 Frequency (Hz) Frequency (Hz) Spectral Centroid for NoObject 10000 6000 4000 2000 0 6000 4000 2000 0 0.2 0.4 0.6 0.8 1 Time(sec) 1.2 1.4 1.6 1.8 1.6 1.8 0 0 0.2 0.4 0.6 0.8 1 Time(sec) 1.2 1.4 1.6 1.8 Spectral Centroid for Object at 150cm Frequency (Hz) 10000 8000 6000 4000 2000 0 0 0.2 0.4 0.6 0.8 1 Time(sec) 1.2 1.4 Figure A.16: The spectral centroid as a function of time for the 10 versions (marked in different colors for each subplot) of the right ear 500ms recording in the lecture room (Experiment 2). Appendix B Auditory models B.1 Loudness NoObjectrec1 NoObjectrec2 Object50cm Object100cm Object200cm Object300cm Object400cm Object500cm Ver1 12.467 12.497 20.979 20.200 13.608 12.406 12.509 12.458 Ver2 15.203 15.104 22.566 22.348 16.274 15.217 15.221 15.199 Ver3 11.987 11.965 18.869 18.251 13.044 11.971 12.019 11.976 Ver4 14.699 14.591 22.169 21.259 15.803 14.682 14.742 14.690 Ver5 14.204 14.125 21.649 20.584 15.254 14.217 14.273 14.242 Ver6 14.704 14.643 21.813 21.344 15.636 14.753 14.758 14.731 Ver7 12.458 12.354 19.535 19.407 13.437 12.387 12.390 12.373 Ver8 12.716 12.647 19.822 19.483 13.839 12.758 12.739 13.426 Ver9 13.687 13.647 20.489 20.708 14.792 13.680 13.702 13.695 Ver10 11.444 11.386 18.850 18.361 12.353 11.402 11.435 11.412 Mean 13.357 13.296 20.674 20.194 14.404 13.347 13.379 13.420 Table B.1: Maximum of the Short Term Loudness in sones of 10 versions for the recordings in the anechoic chamber with 5ms duration signal. The last column indicates the mean over the 10 versions. NoObjectrec1 NoObjectrec2 Object50cm Object100cm Object200cm Object300cm Object400cm Object500cm Ver1 41.511 41.452 66.048 53.907 41.790 41.726 41.656 41.499 Ver2 35.716 35.647 57.284 46.762 35.917 35.860 35.768 35.739 Ver3 38.842 38.801 62.587 51.569 39.090 39.078 38.920 38.824 Ver4 40.318 40.251 63.893 53.316 40.586 40.502 40.361 40.299 Ver5 41.149 41.114 65.020 53.592 41.506 41.396 41.714 41.161 Ver6 40.951 40.861 64.267 53.255 41.121 41.135 41.040 40.965 Ver7 40.601 40.536 62.560 51.682 40.796 40.748 40.604 40.587 Ver8 41.978 41.931 67.099 53.775 42.231 42.281 42.051 41.978 Ver9 40.236 40.171 65.030 53.180 40.591 40.412 40.313 40.242 Ver10 39.596 39.462 62.933 52.029 39.574 39.779 39.709 39.594 Mean 40.090 40.023 63.672 52.307 40.320 40.292 40.213 40.089 Table B.2: Maximum of the Short Term Loudness in sones of 10 versions for the recordings in the anechoic chamber with 50ms duration signal. The last column indicates the mean over the 10 versions. NoObjectrec1 NoObjectrec2 Object50cm Object100cm Object200cm Object300cm Object400cm Object500cm Ver1 49.689 49.609 78.660 63.467 49.893 49.944 49.744 49.677 Ver2 49.617 49.576 78.586 64.006 49.914 49.913 49.681 49.640 Ver3 47.484 47.382 75.840 61.475 47.784 47.779 47.542 47.485 Ver4 46.822 46.789 73.507 60.473 47.059 47.016 46.873 46.795 Ver5 47.513 47.424 74.836 61.174 47.709 47.741 47.559 47.486 Ver6 48.206 48.146 76.411 62.485 48.432 48.388 48.233 48.192 Ver7 49.550 49.515 77.657 64.087 49.684 49.756 49.594 49.539 Ver8 47.659 47.637 75.250 61.816 47.819 47.870 47.694 47.669 Ver9 47.632 47.565 75.547 61.719 47.797 47.884 47.693 47.623 Ver10 47.198 47.177 75.132 60.891 47.440 47.481 47.261 47.204 Mean 48.137 48.082 76.143 62.159 48.353 48.377 48.187 48.131 Table B.3: Maximum of the Short Term Loudness in sones of 10 versions for the recordings in the anechoic chamber with 500ms duration signal. The last column indicates the mean over the 10 versions. 73 APPENDIX B. AUDITORY MODELS NoObjectrec1 NoObjectrec2 Object50cm Object100cm Object200cm Object300cm Object400cm Object500cm Ver1 19.309 19.349 26.688 24.389 21.485 19.635 19.981 19.531 Ver2 19.288 19.358 26.692 24.354 21.427 19.625 19.969 19.531 Ver3 19.298 19.357 26.690 24.341 21.515 19.646 19.956 19.524 Ver4 19.314 19.370 26.690 24.412 21.561 19.645 19.957 19.527 74 Ver5 19.308 19.380 26.682 24.337 21.600 19.645 19.961 19.522 Ver6 19.350 19.385 26.731 24.382 21.588 19.652 19.978 19.522 Ver7 19.335 19.384 26.701 24.349 21.544 19.668 19.997 19.526 Ver8 19.342 19.399 26.712 24.396 21.554 19.671 19.988 19.531 Ver9 19.333 19.389 26.734 24.382 21.504 19.667 19.988 19.532 Ver10 19.323 19.391 26.752 24.432 21.593 19.658 19.974 19.541 Mean 19.320 19.376 26.707 24.377 21.537 19.651 19.975 19.529 Table B.4: Maximum of the Short Term Loudness in sones of 10 versions for the recordings in the conference room with 5ms duration signal. The last column indicates the mean over the 10 versions. NoObjectrec1 NoObjectrec2 Object50cm Object100cm Object200cm Object300cm Object400cm Object500cm Ver1 45.024 45.081 69.618 55.674 47.552 45.150 45.269 45.041 Ver2 45.021 45.073 69.656 55.666 47.573 45.143 45.252 45.045 Ver3 45.006 45.075 69.625 55.635 47.597 45.147 45.273 45.036 Ver4 45.012 45.069 69.626 55.674 47.637 45.141 45.252 45.037 Ver5 44.998 45.070 69.587 55.694 47.634 45.126 45.260 45.044 Ver6 44.989 45.062 69.569 55.722 47.663 45.127 45.242 45.047 Ver7 44.991 45.065 69.586 55.708 47.594 45.138 45.257 45.044 Ver8 44.980 45.070 69.590 55.690 47.665 45.132 45.243 45.036 Ver9 44.980 45.076 69.616 55.674 47.606 45.126 45.224 45.030 Ver10 44.986 45.078 69.592 55.683 47.666 45.117 45.219 45.049 Mean 44.999 45.072 69.607 55.682 47.619 45.135 45.249 45.041 Table B.5: Maximum of the Short Term Loudness in sones of 10 versions for the recordings in the conference room with 50ms duration signal. The last column indicates the mean over the 10 versions. NoObjectrec1 NoObjectrec2 Object50cm Object100cm Object200cm Object300cm Object400cm Object500cm Ver1 52.453 52.465 78.748 63.607 54.483 52.402 52.569 52.506 Ver2 52.438 52.461 78.704 63.668 54.623 52.401 52.577 52.469 Ver3 52.449 52.471 78.586 63.669 54.589 52.382 52.582 52.481 Ver4 52.448 52.484 78.654 63.557 54.484 52.419 52.562 52.493 Ver5 52.456 52.490 78.702 63.493 54.649 52.390 52.529 52.485 Ver6 52.443 52.504 78.655 63.553 54.621 52.372 52.577 52.519 Ver7 52.446 52.501 78.629 63.540 54.498 52.360 52.570 52.519 Ver8 52.438 52.500 78.671 63.544 54.591 52.380 52.567 52.497 Ver9 52.435 52.502 78.664 63.566 54.689 52.381 52.571 52.510 Ver10 52.432 52.496 78.575 63.539 54.576 52.378 52.587 52.540 Mean 52.444 52.487 78.659 63.574 54.580 52.387 52.569 52.502 Table B.6: Maximum of the Short Term Loudness in sones of 10 versions for the recordings in the conference room with 500ms duration signal. The last column indicates the mean over the 10 versions. NoObjectrec1 Object100cm Object150cm Ver1 14.495 16.730 15.386 Ver2 16.846 18.145 17.283 Ver3 15.568 16.890 16.028 Ver4 15.419 16.902 15.897 Ver5 14.604 15.876 15.154 Ver6 16.211 18.083 17.141 Ver7 14.969 16.553 15.741 Ver8 16.825 18.971 17.734 Ver9 15.637 17.293 16.205 Ver10 14.391 16.158 15.226 Mean 15.497 17.160 16.179 Table B.7: Maximum of the Short Term Loudness in sones of 10 versions for the recordings in the lecture room with 5ms duration signal. The last column indicates the mean over the 10 versions. NoObjectrec1 Object100cm Object150cm Ver1 37.333 39.974 37.993 Ver2 39.586 42.003 40.029 Ver3 37.340 39.772 38.108 Ver4 37.542 40.061 38.036 Ver5 38.969 41.484 39.708 Ver6 39.385 41.827 40.155 Ver7 37.625 39.851 37.979 Ver8 39.139 41.983 39.842 Ver9 40.873 43.319 41.447 Ver10 38.681 41.378 39.241 Mean 38.647 41.165 39.254 Table B.8: Maximum of the Short Term Loudness in sones of 10 versions for the recordings in the lecture room with 5ms duration, 32 clicks signal. The last column indicates the mean over the 10 versions. APPENDIX B. AUDITORY MODELS NoObjectrec1 Object100cm Object150cm Ver1 45.479 48.055 45.780 Ver2 47.913 50.425 48.312 Ver3 45.689 47.900 46.236 Ver4 45.490 47.804 46.061 75 Ver5 46.258 48.771 46.563 Ver6 46.939 49.504 47.112 Ver7 44.233 46.610 44.669 Ver8 47.865 50.673 48.356 Ver9 48.594 51.714 49.227 Ver10 46.518 49.176 46.921 Mean 46.498 49.063 46.924 Table B.9: Maximum of the Short Term Loudness in sones of 10 versions for the recordings in the lecture room with 5ms duration 64 clicks signal. The last column indicates the mean over the 10 versions. NoObjectrec1 Object100cm Object150cm Ver1 50.897 53.594 51.273 Ver2 53.347 56.330 53.633 Ver3 50.800 53.047 51.226 Ver4 50.633 53.038 51.280 Ver5 51.920 54.317 52.409 Ver6 52.236 55.310 52.673 Ver7 49.713 52.142 50.079 Ver8 53.366 56.965 54.012 Ver9 55.000 57.813 55.522 Ver10 52.216 54.564 52.557 Mean 52.013 54.712 52.466 Table B.10: Maximum of the Short Term Loudness in sones of 10 versions for the recordings in the lecture room with 500ms duration signal. The last column indicates the mean over the 10 versions. APPENDIX B. AUDITORY MODELS B.2 76 Sharpness NoObjectrec1 NoObjectrec2 Object50cm Object100cm Object200cm Object300cm Object400cm Object500cm Ver1 1.892 1.922 2.108 2.142 1.937 1.926 1.914 1.896 Ver2 1.884 1.910 2.119 2.166 1.933 1.867 1.853 1.862 Ver3 1.907 1.906 2.062 2.154 1.929 1.906 1.884 1.882 Ver4 1.894 1.894 1.996 2.078 1.912 1.906 1.896 1.881 Ver5 1.890 1.907 1.954 2.088 1.921 1.920 1.912 1.904 Ver6 1.847 1.869 1.982 2.115 1.896 1.915 1.891 1.897 Ver7 1.878 1.899 2.020 2.192 1.898 1.893 1.884 1.905 Ver8 1.876 1.882 2.068 2.173 1.942 1.896 1.893 1.884 Ver9 1.895 1.886 2.111 2.098 1.914 1.890 1.895 1.877 Ver10 1.916 1.922 2.101 2.169 1.926 1.941 1.884 1.907 Mean 1.888 1.900 2.052 2.138 1.921 1.906 1.891 1.889 Table B.11: Median of the sharpness in acums of 10 versions for the recordings in the anechoic room (Experiment 1) with 5ms duration signal. The last column indicates the mean over the 10 versions. NoObjectrec1 NoObjectrec2 Object50cm Object100cm Object200cm Object300cm Object400cm Object500cm Ver1 1.868 1.894 2.112 2.127 1.902 1.910 1.874 1.852 Ver2 1.899 1.893 2.080 2.155 1.927 1.907 1.865 1.852 Ver3 1.876 1.909 2.075 2.183 1.911 1.913 1.909 1.886 Ver4 1.897 1.901 2.046 2.152 1.898 1.894 1.846 1.898 Ver5 1.880 1.925 2.002 2.170 1.889 1.894 1.875 1.859 Ver6 1.883 1.904 2.016 2.113 1.928 1.895 1.870 1.884 Ver7 1.894 1.902 2.042 2.087 1.909 1.896 1.854 1.879 Ver8 1.928 1.905 2.094 2.119 1.924 1.902 1.880 1.896 Ver9 1.896 1.906 2.113 2.141 1.919 1.922 1.883 1.927 Ver10 1.872 1.867 2.095 2.161 1.910 1.911 1.888 1.875 Mean 1.889 1.901 2.068 2.141 1.912 1.904 1.874 1.881 Table B.12: Median of the sharpness in acums of 10 versions for the recordings in the anechoic room (Experiment 1) with 50ms duration signal. The last column indicates the mean over the 10 versions. NoObjectrec1 NoObjectrec2 Object50cm Object100cm Object200cm Object300cm Object400cm Object500cm Ver1 1.870 1.878 2.104 2.096 1.880 1.856 1.826 1.840 Ver2 1.861 1.878 2.110 2.074 1.901 1.860 1.835 1.832 Ver3 1.857 1.874 2.123 2.090 1.890 1.854 1.825 1.831 Ver4 1.865 1.894 2.098 2.140 1.897 1.862 1.831 1.832 Ver5 1.865 1.883 2.105 2.144 1.893 1.858 1.828 1.829 Ver6 1.855 1.878 2.115 2.143 1.889 1.843 1.821 1.832 Ver7 1.857 1.880 2.133 2.167 1.894 1.850 1.837 1.834 Ver8 1.857 1.882 2.122 2.141 1.893 1.871 1.839 1.837 Ver9 1.857 1.888 2.121 2.107 1.887 1.856 1.831 1.846 Ver10 1.862 1.890 2.127 2.084 1.899 1.865 1.837 1.835 Mean 1.861 1.882 2.116 2.119 1.892 1.858 1.831 1.835 Table B.13: Median of the sharpness in acums of 10 versions for the recordings in the anechoic room (Experiment 1) with 500ms duration signal. The last column indicates the mean over the 10 versions. APPENDIX B. AUDITORY MODELS NoObjectrec1 NoObjectrec2 Object50cm Object100cm Object200cm Object300cm Object400cm Object500cm Ver1 1.929 1.988 2.037 2.042 2.052 1.988 1.979 1.964 Ver2 1.964 1.968 2.038 2.034 2.033 1.990 1.981 1.996 Ver3 1.955 1.963 2.013 2.029 1.985 2.038 1.977 2.027 Ver4 1.962 2.011 2.037 2.038 1.981 2.044 1.967 1.993 77 Ver5 1.979 1.975 2.028 2.035 1.993 1.972 1.995 1.978 Ver6 1.983 1.998 2.040 2.025 1.996 2.002 1.989 1.987 Ver7 1.980 2.002 2.040 2.027 2.011 2.022 2.014 1.977 Ver8 1.989 1.964 2.047 2.024 2.008 2.009 1.951 1.995 Ver9 1.991 1.973 2.027 2.025 2.015 2.019 1.990 1.972 Ver10 1.989 1.993 2.008 2.037 1.959 2.005 1.981 1.967 Mean 1.972 1.983 2.032 2.032 2.003 2.009 1.982 1.986 Table B.14: Median of the sharpness in acums of 10 versions for the recordings in the conference room (Experiment 1) with 5ms duration signal. The last column indicates the mean over the 10 versions. NoObjectrec1 NoObjectrec2 Object50cm Object100cm Object200cm Object300cm Object400cm Object500cm Ver1 1.893 1.896 1.980 1.974 1.940 1.905 1.918 1.863 Ver2 1.892 1.908 1.968 1.960 1.927 1.922 1.916 1.891 Ver3 1.898 1.896 1.957 1.954 1.941 1.910 1.925 1.892 Ver4 1.881 1.892 1.954 1.927 1.923 1.909 1.911 1.871 Ver5 1.886 1.893 1.964 1.940 1.938 1.893 1.923 1.899 Ver6 1.904 1.883 1.966 1.945 1.944 1.917 1.921 1.863 Ver7 1.901 1.890 1.966 1.947 1.937 1.924 1.932 1.889 Ver8 1.897 1.895 1.954 1.968 1.933 1.921 1.903 1.888 Ver9 1.883 1.893 1.967 1.933 1.934 1.917 1.907 1.911 Ver10 1.893 1.898 1.960 1.951 1.944 1.923 1.914 1.911 Mean 1.893 1.894 1.964 1.950 1.936 1.914 1.917 1.888 Table B.15: Median of the sharpness in acums of 10 versions for the recordings in the conference room (Experiment 1) with 50ms duration signal. The last column indicates the mean over the 10 versions. NoObjectrec1 NoObjectrec2 Object50cm Object100cm Object200cm Object300cm Object400cm Object500cm Ver1 1.937 1.939 2.100 2.040 1.967 1.950 1.953 1.940 Ver2 1.931 1.940 2.098 2.043 1.966 1.948 1.951 1.941 Ver3 1.935 1.938 2.091 2.049 1.968 1.952 1.945 1.946 Ver4 1.938 1.942 2.093 2.043 1.969 1.943 1.952 1.942 Ver5 1.934 1.937 2.099 2.041 1.967 1.951 1.948 1.944 Ver6 1.934 1.938 2.092 2.046 1.968 1.954 1.947 1.936 Ver7 1.932 1.935 2.090 2.043 1.967 1.948 1.952 1.936 Ver8 1.933 1.936 2.102 2.038 1.967 1.951 1.951 1.943 Ver9 1.933 1.937 2.095 2.045 1.965 1.953 1.943 1.942 Ver10 1.938 1.936 2.088 2.042 1.968 1.955 1.946 1.944 Mean 1.935 1.938 2.095 2.043 1.967 1.950 1.949 1.941 Table B.16: Median of the sharpness in acums of 10 versions for the recordings in the conference room (Experiment 1) with 500ms duration signal. The last column indicates the mean over the 10 versions. APPENDIX B. AUDITORY MODELS NoObjectrec1 Object100cm Object150cm Ver1 1.817 1.729 1.773 Ver2 1.873 1.820 1.886 Ver3 1.773 1.763 1.752 Ver4 1.959 1.776 1.902 78 Ver5 1.826 1.855 1.825 Ver6 1.729 1.601 1.667 Ver7 1.861 1.900 1.958 Ver8 1.892 1.973 1.894 Ver9 1.905 1.754 1.863 Ver10 1.853 1.614 1.824 Mean 1.849 1.778 1.834 Table B.17: Median of the sharpness in acums of 10 versions for the recordings in the lecture room (Experiment 2) with 5ms duration signal. The last column indicates the mean over the 10 versions. NoObjectrec1 Object100cm Object150cm Ver1 2.051 2.153 2.083 Ver2 1.992 2.105 2.032 Ver3 2.015 2.108 2.022 Ver4 2.013 2.125 2.048 Ver5 2.006 2.103 2.030 Ver6 1.982 2.097 2.006 Ver7 2.044 2.145 2.080 Ver8 2.006 2.059 2.049 Ver9 2.025 2.099 2.043 Ver10 2.031 2.142 2.053 Mean 2.017 2.114 2.045 Table B.18: Median of the sharpness in acums of 10 versions for the recordings in the lecture room(Experiment 2) with 5ms duration, 32 clicks signal. The last column indicates the mean over the 10 versions. NoObjectrec1 Object100cm Object150cm Ver1 2.075 2.188 2.112 Ver2 2.035 2.150 2.068 Ver3 2.066 2.175 2.091 Ver4 2.069 2.182 2.098 Ver5 2.074 2.197 2.109 Ver6 2.067 2.181 2.102 Ver7 2.070 2.169 2.098 Ver8 2.078 2.192 2.117 Ver9 2.061 2.169 2.094 Ver10 2.065 2.178 2.101 Mean 2.066 2.178 2.099 Table B.19: Median of the sharpness in acums of 10 versions for the recordings in the lecture room (Experiment 2) with 5ms duration, 64 clicks signal. The last column indicates the mean over the 10 versions. NoObjectrec1 Object100cm Object150cm Ver1 2.075 2.210 2.123 Ver2 2.061 2.184 2.095 Ver3 2.071 2.202 2.112 Ver4 2.081 2.206 2.120 Ver5 2.064 2.197 2.105 Ver6 2.085 2.218 2.124 Ver7 2.068 2.181 2.101 Ver8 2.083 2.206 2.118 Ver9 2.062 2.194 2.093 Ver10 2.069 2.202 2.108 Mean 2.072 2.200 2.110 Table B.20: Median of the sharpness in acums of 10 versions for the recordings in the lecture room (Experiment 2) with 500ms duration signal. The last column indicates the mean over the 10 versions. APPENDIX B. AUDITORY MODELS B.3 79 Pitch strength using strobe temporal integration Figure B.1 shows the temporal profile of the stabilised auditory image for a 500ms recording in the conference room. As stated in Chapter 4 the stabilised auditory image was implemented using two modules namely, sf2003 and ti2003. A brief description of this is given below. 1.2 3 738 Hz: 0.29 1 76 Hz: 0.07 0.005 0.01 1.5 0.4 1 0.2 0.5 0 0.015 0 0.005 time interval(sec) 0.01 0 0.015 time interval(sec) (a) No object (b) Object at 50 cm 1.4 2 1.8 1.2 1.6 232 Hz: 0.23 1 1.2 1 Amplitude 1.4 100 Hz: 0.10 0.8 0.6 Amplitude 0 0.6 2 Amplitude 0.8 Amplitude 2.5 0.8 0.6 0.4 0.4 0.2 0.2 0 0.005 0.01 time interval(sec) (c) Object at 100 cm 0 0.015 0 0.005 0.01 0 0.015 time interval(sec) (d) Object at 200 cm Figure B.1: The temporal profiles of stabilised auditory image for a 500ms signal recorded in the conference room (Experiment 1) at 495ms time frame. The blue dot indicates the highest peak and the corresponding values indicates the pitch strength (calculated using equation 4.10) and frequency in Hz (calculated by using the inverse relationship of time and frequency, f = 1/t). APPENDIX B. AUDITORY MODELS 80 Initially, the sf2003 module uses an adaptive strobe threshold to issue a strobe on the NAP. After the strobe is initiated the threshold initially rises along a parabolic path and then returns to the linear decay to avoid spurious strobes (cf Figure 4.2). Once the strobes are computed for each frequency channel of the NAP then the ti2003 module uses the strobes to initiate a temporal integration. The time interval between the strobe and the NAP value determines the position of where the NAP value is entered in the SAI. For example if a strobe is identified in the 200Hz channel of the NAP at 5ms time instant than the level of the NAP sample at 5ms time instant is added to the 1st position of the 200Hz channel in the SAI. The next sample of the NAP is added to the 2nd position of the SAI. This process of adding the level of the NAP samples continues for 35ms and terminates if no further strobes are identified. In the case of strobes detected within the 35ms interval, each strobe initiates a temporal integration process. To preserve the shape of the SAI to that of the NAP, ti2003 uses a weighting concept i.e the new strobes are initially weighted high (also the weights are normalized such that the sum of the weights is equal to 1) so that the older strobes contribute relatively less to the SAI. In this way the time axis of the NAP is converted into a time interval axis of the SAI. The temporal profile in the sub figures of Figure B.1 was generated by summing the SAI along the centre frequencies. The results from Figure B.1 show that the recordings with the object when compared with the recording without the object had pitch strength greater than 0.1 at the corresponding frequencies of the repetition pitch. However, whether or not this is the case for all the recordings has to be verified. As previous research (Yost 1996; Patterson et al 1996) quantified the repetition pitch perception using the autocorrelation theory the thesis followed in their foot steps assuming that the autocorrelation is the way repetition pitch is perceived. The autocorrelation results in Chapter 5 justified this assumption. To quantify how the strobe temporal integration could explain the pitch perception that is known to be useful for human echolocation, a detailed analysis using the strobe temporal integration module of the AIM has to be done. This is left for future work.
© Copyright 2026 Paperzz