Studies on Bird Vocalization Detection and Classification

D
e
p
a
r
t
me
nto
fS
i
g
na
lP
r
o
c
e
s
s
i
nga
ndA
c
o
us
t
i
c
s
Se
ppoF
age
rl
und
St
udie
so
n Bird Vo
c
al
iz
at
io
nD
e
t
e
c
t
io
n and Cl
assific
at
io
no
f
Spe
c
ie
s
St
udie
so
n Bird Vo
c
al
iz
at
io
n
D
e
t
e
c
t
io
n and
Cl
assific
at
io
no
f Spe
c
ie
s
S
e
p
p
oF
a
g
e
r
l
und
A
a
l
t
oU
ni
v
e
r
s
i
t
y
D
O
C
T
O
R
A
L
D
I
S
S
E
R
T
A
T
I
O
N
S
Aalto University publication series
DOCTORAL DISSERTATIONS 166/2014
Studies on Bird Vocalization Detection
and Classification of Species
Seppo Fagerlund
A doctoral dissertation completed for the degree of Doctor of
Science (Technology) to be defended, with the permission of the
Aalto University School of Electrical Engineering, at a public
examination held at the lecture hall S1 of the school on 17 November
2014 at 12.
Aalto University
School of Electrical Engineering
Department of Signal Processing and Acoustics
Supervising professor
Prof. Unto K. Laine
Thesis advisor
Prof. Unto K. Laine
Preliminary examiners
Dr. Rolf Bardeli, Fraunhofer Institute Intelligent Analysis and
Information Systems IAIS, Germany
Dr. Peter Jančovič, University of Birmingham, U.K.
Opponent
Prof. Gernot Kubin, Graz University of Technology, Austria
Aalto University publication series
DOCTORAL DISSERTATIONS 166/2014
© Seppo Fagerlund
ISBN 978-952-60-5921-1 (printed)
ISBN 978-952-60-5922-8 (pdf)
ISSN-L 1799-4934
ISSN 1799-4934 (printed)
ISSN 1799-4942 (pdf)
http://urn.fi/URN:ISBN:978-952-60-5922-8
Unigrafia Oy
Helsinki 2014
Finland
Publication orders (printed book):
[email protected]
Abstract
Aalto University, P.O. Box 11000, FI-00076 Aalto www.aalto.fi
Author
Seppo Fagerlund
Name of the doctoral dissertation
Studies on Bird Vocalization Detection and Classification of Species
Publisher School of Electrical Engineering
Unit Department of Signal Processing and Acoustics
Series Aalto University publication series DOCTORAL DISSERTATIONS 166/2014
Field of research Acoustics and Audio Signal Processing
Manuscript submitted 12 June 2014
Date of the defence 17 November 2014
Permission to publish granted (date) 21 October 2014
Language English
Monograph
Article dissertation (summary + original articles)
Abstract
The topic of this thesis is automatic identification of bird vocalization and bird species based
on the sounds they produce. Two main approaches for recordings bird sounds are presented:
active recording and passive continuous recording. The aim of the active recording method is
to capture sounds of a particular bird species or an individual. On the other hand, passive
continuous recordings – which can be captured without human presence – are used in
acoustical monitoring and are intended to include all sounds in the local environment.
The automatic identification system begins by segmenting distinct sound events from the
recordings. The purpose of segmentation is to detect syllables in bird sounds. Active recordings
with one bird individual typically have a high signal-to-noise ratio that helps in the
task. Segmentation of passive continuous recordings is more demanding due to the possibility
of many simultaneous sound sources and a varying signal level of sound events. Audio events
in recordings, comprising sounds from many sources, are also often overlapping which adds
complexity to the segmentation phase.
After the audio events have been segmented, feature extraction and classification are
performed. Within feature extraction the audio signals are represented with a low number of
attributes (compared to the original data) that characterize particular sound events. Feature
extraction performs dimension reduction by removing redundant information from the
original data. Suitable features depend on the data and should be selected so that they
discriminate sounds from different sources. The classification phase decides on which class
each sound event belongs to based on the feature representation.
The main focus of this thesis is to develop and examine feature representations for different
types of bird sounds suitable for automatic classification. Special attention has been given to
birds that produce inharmonic and noisy sounds due to the diverse structure of their
vocalizations. A method based on short time-domain structures was found to be efficient for
many different types of sounds. It also exhibited efficiency for sound event detection in
continuous recordings.continuous recordings.
Keywords bird song, bioacoustics, sound event detection, feature extraction, pattern
recognition, automated recognition
ISBN (printed) 978-952-60-5921-1
ISBN (pdf) 978-952-60-5922-8
ISSN-L 1799-4934
ISSN (printed) 1799-4934
ISSN (pdf) 1799-4942
Location of publisher Helsinki
Pages 123
Location of printing Helsinki Year 2014
urn http://urn.fi/URN:ISBN:978-952-60-5922-8
Tiivistelmä
Aalto-yliopisto, PL 11000, 00076 Aalto www.aalto.fi
Tekijä
Seppo Fagerlund
Väitöskirjan nimi
Lintujen äänien havaitseminen ja lintulajien tunnistaminen.
Julkaisija Sähkötekniikan korkeakoulu
Yksikkö Signaalinkäsittelyn ja akustiikan laitos
Sarja Aalto University publication series DOCTORAL DISSERTATIONS 166/2014
Tutkimusala Akustiikka ja äänenkäsittelytekniikka
Käsikirjoituksen pvm 12.06.2014
Julkaisuluvan myöntämispäivä 21.10.2014
Monografia
Väitöspäivä 17.11.2014
Kieli Englanti
Yhdistelmäväitöskirja (yhteenveto-osa + erillisartikkelit)
Tiivistelmä
Tämän väitöstyön aiheena on lintujen äänien automaattinen havaitseminen sekä lintulajien
tunnistaminen lintujen tuottamien äänien perusteella. Työ esittelee pääasialliset tavat tehdä
lintuäänityksiä, joita ovat aktiivinen sekä passiivinen äänitys. Aktiivisessa äänityksessä
tarkoituksena on tallentaan tietyn lintulajin tai yksilön ääniä. Passiivista äänitystä, joka
voidaan suorittaa myös ilman ihmisen läsnäoloa, käytetään akustiseen seurantaan ja
tavoitteena on tallentaa kaikki alueella esiintyvät äänet.
Ensimmäinen vaihe automaattisessa tunnistusjärjestelmässä on erillisten äänitapahtumien
segmentointi. Tämän työn segmentoinnin tavoitteena on erottaa lintujen äänien yksittäiset
tavut. Aktiivisissa äänityksissä signaali-kohinasuhde on tyypillisesti korkea, joka helpottaa
tehtävää. Jatkuva-aikaisten passiivisten äänitysten segmentointi on haastavampaa, koska ne
voivat sisältää ääniä useista äänilähteistä ja signaali-kohinasuhde vaihtelee äänitapahtumien
välillä. Useita äänilähteitä sisältävissä tallenteissa erilliset äänitapahtumat ovat usein
päällekkäisiä mikä vaikeuttaa segmentointia.
Äänitapahtumien segmentoinnin jälkeen seuraavana vaiheena on piirteiden laskenta sekä
luokittelu. Piirreirrotuksessa äänitapahtumat esitetään pienellä määrällä (alkuperäiseen
datamäärään verrattuna) ääniä kuvaavilla tunnusluvuilla. Piirreirrotus vähentää alkuperäistä
datamäärää poistamalla redundanttia tietoa. Kuhunkin tilanteeseen soveltuvat piirteet
riippuvat äänien tyypistä ja tulisi valita niin että piirteet erottelevat eri lähteistä tulevat äänet.
Luokitteluvaiheessa äänitapahtumat piirre-esityksellä kuvatut äänet luokitellaan eri luokkiin.
Tämän väitöskirjatyön pääasiallisena tarkoituksena on kehittää ja tutkia erityyppisille lintujen
äänille sopivia piirre-esityksiä niiden automaattiseksi luokittelemiseksi. Erityisessä asemassa
on ollut epäharmonisia ja kohinanomaisia ääniä tuottavat linnut niiden äänien monipuolisen
rakenteen vuoksi. Äänien lyhyttä aikarakennetta kuvaava menetelmä on osoittautunut
tehokkaaksi esitykseksi monentyyppisille äänille. Tämä menetelmä on osoittautunut
tehokkaaksi myös yksittäisten äänitapahtumien tunnistamiseen jatkuva-aikaisista äänitteistä.
Avainsanat linnun laulu, bioakustiikka, äänitapahtuman havaitseminen, piirreirrotus,
hahmontunnistus, automaattinen tunnistaminen
ISBN (painettu) 978-952-60-5921-1
ISBN (pdf) 978-952-60-5922-8
ISSN-L 1799-4934
ISSN (painettu) 1799-4934
ISSN (pdf) 1799-4942
Julkaisupaikka Helsinki
Sivumäärä 123
Painopaikka Helsinki
Vuosi 2014
urn http://urn.fi/URN:ISBN:978-952-60-5922-8
Preface
This work has been carried out at the Department of Signal Processing
and Acoustics at the Aalto University School of Electrical Engineering in
Espoo, Finland. The work was funded by the Academy of Finland, the EU
IST Sixth Framework Programme, and Aalto university. The work was
also supported by the Finnish Cultural Foundation, the Nokia Foundation, and the Research Foundation of the Helsinki University of Technology.
I wish to express my gratitude to my supervisor and instructor professor Unto K. Laine for making this thesis possible. A countless number
of discussions with Unto have been inspiring and his enthusiasm has encouraged me in my work. Additionally Unto have provided ideas for permutation transformation and smoothing of the permutation statistic. I
also want to thank Unto for the patience and confidence he has shown
in me during this project, especially during the times when research work
did not progress as expected. My gratitude belongs also to Dr. Aki Härmä,
Prof. Juha Tanttu and Dr. Jari Turunen who introduced me to this topic
in the first place and also to Dr. Panu Somervuo for his contribution during the early stages of this work. I wish to thank the pre-examiners of
this thesis, Dr. Rolf Bardeli and Dr. Peter Jančovič, for their valuable
comments regarding the manuscript.
I am indebted to Dr. Toomas Altosaar for proofreading my manuscripts
and his considerable support in many different aspects during the past
years. I am also sincerely glad for our numerous discussion sessions covering many different topics, some of them very far from the research and
science. I also wish to thank Dr. Okko Räsänen for reading and commenting on my manuscripts. Other former and current members of our
research team - Petri, Jouni, Heikki, Sofoklis, Juho and Shreyas - deserve
my gratitude for the positive working environment.
1
Preface
I am proud of being able to work in a fantastic environment within the
community of the acoustics lab. I want to thank Lea, Mara, Hynde, Mirja,
Heidi, Markku, Ulla and Paula for taking care of many administrative
and practical issues. Special thanks to Tomppa, Mairas, Carlo (In Memoriam 1980-2008), Hannu, Jouni, Marko T. and Sami; in addition to being
great colleagues I have had nice discussions and experiences with you in
sports or outdoor activities. Also, my thanks goes to Cumhur, Henkka,
Heidi-Maria, Jussi, Antti, Mikko-Ville, Olli, Tapani, Ilkka, Tuomo, Marko
H., Julia, and many other former and current labsters for the great atmosphere and unforgettable moments in various events.
I also wish to thank my parents Aino and Reino, and my brother Pekka
with his family, for their support and understanding. I am also grateful to
all my friends, especially for all the outdoor activities we have experienced
together.
Finally, and most importantly, I want to express my gratitude to my
lovely wife Anu for all the love and support she has given me over the
years. I am also grateful to our daughter Meri and son Justus for always
being there. You have ensured that life has never been monotonous or
empty. There are no words to express how much I love you all.
Espoo, October 28, 2014,
Seppo Fagerlund
2
Contents
Preface
1
Contents
3
List of Publications
5
Author’s Contribution
7
List of Symbols
9
List of Abbreviations
11
1. Introduction
13
1.1 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . .
14
1.2 Scope and outline of the thesis
17
. . . . . . . . . . . . . . . . .
2. Acoustical Monitoring of Bird Sounds
19
2.1 Natural audio scenes . . . . . . . . . . . . . . . . . . . . . . .
19
2.2 Organization and structure of bird sounds . . . . . . . . . . .
21
2.3 Outdoor sound recording . . . . . . . . . . . . . . . . . . . . .
23
3. Signal Processing Tools
27
3.1 Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . .
28
3.1.1 Frame based segmentation . . . . . . . . . . . . . . .
28
3.2 Parametric representations . . . . . . . . . . . . . . . . . . .
29
3.2.1 Spectral features . . . . . . . . . . . . . . . . . . . . .
30
3.2.2 Sinusoidal features . . . . . . . . . . . . . . . . . . . .
30
3.2.3 Low-level signal features . . . . . . . . . . . . . . . . .
31
3.2.4 Time domain structures . . . . . . . . . . . . . . . . .
32
3.3 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . .
33
3.3.1 k-nearest neighbour classification . . . . . . . . . . .
34
3
Contents
3.3.2 Support vector machines . . . . . . . . . . . . . . . . .
4. Summary and Main Results of the Publications
35
37
4.1 Publication I: Parametrization of Inharmonic Bird Sounds
for Automatic Recognition . . . . . . . . . . . . . . . . . . . .
37
4.2 Publication II: Parametric Representations of Bird Sounds
for Automatic Species Recognition . . . . . . . . . . . . . . .
38
4.3 Publication III: Bird Species Recognition Using Support Vector Machines. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.4 Publication IV: Monitoring of Capercaillie Courting Display
39
39
4.5 Publication V: Classification of Audio Events Using Permutation Transformation . . . . . . . . . . . . . . . . . . . . . .
40
4.6 Publication VI: New Parametric Representations of Bird Sounds
for Automatic Classification . . . . . . . . . . . . . . . . . . .
42
5. Conclusions and Future Directions
43
Appendix
47
Errata
51
References
53
Publications
61
4
List of Publications
This thesis consists of an overview and of the following publications which
are referred to in the text by their Roman numerals.
I Seppo Fagerlund, Aki Härmä. Parametrization of Inharmonic Bird Sounds
for Automatic Recognition. In 13th European Signal Processing Conference (EUSIPCO), Antalya, Turkey, Sep 2005.
II Panu Somervuo, Aki Härmä and Seppo Fagerlund. Parametric representations of bird sounds for automatic species recognition. IEEE Trans.
Audio, Speech and Language Processing, Vol. 14, no. 6, pp. 2252 - 2263,
2006.
III Seppo Fagerlund. Bird Species Recognition Using Support Vector Machines. EURASIP Journal on Advances in Signal Processing, Article ID
38637, 2007.
IV Seppo Fagerlund. Monitoring of Capercaillie Courting Display. In
19th International Congress on Sound and Vibration (ICSV19), Vilinus,
Lithuania, Jul 2012.
V Seppo Fagerlund, Unto K. Laine. Classification of audio events using
permutation transformation. Applied Acoustics, Vol. 83, pp. 57-63,
2014.
VI Seppo Fagerlund, Unto K. Laine. New parametric representations of
bird sounds for automatic classification. In 39th International Confer-
5
List of Publications
ence on Acoustics, Speech and Signal Processing (ICASSP), Florence,
Italy, May 2014.
6
Author’s Contribution
Publication I: “Parametrization of Inharmonic Bird Sounds for
Automatic Recognition”
The present author is responsible for the experiments in this work. The
author wrote the main parts of the article and edited it based on comments of the second author.
Publication II: “Parametric representations of bird sounds for
automatic species recognition”
The present author implemented the segmentation algorithm in collaboration with the second author and compiled the database used in this
work. The author is also responsible for the Mel-cepstrum model and descriptive parameters. The author wrote the corresponding parts of the
arcticle.
Publication III: “Bird Species Recognition Using Support Vector
Machines”
The present author is the sole author of this publication.
Publication IV: “Monitoring of Capercaillie Courting Display”
The present author is the sole author of this publication.
7
Author’s Contribution
Publication V: “Classification of audio events using permutation
transformation”
The present author planned this work in collaboration with the second author. The original idea of the algorithmic distance dK came from the second author. The author was responsible of implementing the algorithms
and performing the experiments. The author wrote the article in collaboration with the second author
Publication VI: “New parametric representations of bird sounds for
automatic classification”
The present author developed the method described in chapter 3 in collaboration with the second author. The present author conducted the research in this work. The author wrote the article and edited it based on
comments by second author.
8
List of Symbols
Am
Arithmetic mean of X(n)
b
Bias
dE
Euclidean distance measure
dKL
Kullback-Leibler divergence distance measure
dM
Mahalanobis distance measure
f (x)
Decision boundary of the classifier
Gm
Geometric mean of X(n)
I(πi , πj )
Mutual information
K(xi , x)
Kernel function
K
Number of filter bank channels
M
Number of frequency bins
M F CCi
Mel-frequency cepstral coefficients
nsv
Number of support vectors
p(πi , πj )
Joint probability distribution of πi and πj
p(πi ), p(πj )
Probability distributions of πi and πj
Th
Threshold for spectral rolloff frequency measure
X(n)
Power spectrum of the signal x(n)
Xk
Logarithmic energy of kth channel on mel-scale
x, y
Feature vectors
yi
Class labels
αi
Weights of support vectors
πi , πj
Temporal patterns
Σ
Covariance matrix
9
List of Symbols
10
List of Abbreviations
ANN
Artificial Neural Network
ARU
Autonomous Recording Unit
ASR
Automatic Speech Recognition
BW
Signal Bandwidth
CASA
Computational Auditory Scene Analysis
CASR
Computational Auditory Scene Recognition
EN
Signal Energy
GMM
Gaussian Mixture Model
HMM
Hidden Markov Model
kNN
k-Nearest Neighbour Classifier
LDA
Linear Discriminant Analysis
MFCC
Mel-frequency Cepstral Coefficients
MI
Mutual Information Measure
PPF
Permutation Pair Frequency matrix
SC
Spectral Centroid
SF
Spectral Flux
SFM
Spectral Flatness
SOM
Self Organizing Maps
SRF
Spectral Rolloff Frequency
SVM
Support Vector Machine
ZCR
Zero Crossing Rate
11
List of Abbreviations
12
1. Introduction
Observing our surrounding environment has always interested humans.
Obtaining correct observations from the environment are not essential
anymore for our everyday living, however, this has not decreased our interest in monitoring wildlife. Especially during the spring time many
people are enthusiastic to follow the development of the season that includes the arrival of migratory birds. Nature has also been an important
source of inspiration, e.g., for many composers and poets. Nature also
provides a place for relaxing and recreation. Recently, environmental observations have improved the awareness of our surroundings and have
increased our motivation to acquire even more information regarding our
ecosystem’s conditions. Acoustical monitoring has especially gained much
attention during the past few years.
The technical development of portable devices such as modern mobile
phones, provide an easy way to make observations about our surroundings. Mobile phones are equipped with sufficiently good quality microphones and adequate storage capacity to make short recordings outdoors.
Phones and other portable devices have even ample computational power
to perform some on-site analyses of the recordings. Moreover, advancements in other portable recording equipment, especially in storage capacity, have made it possible to perform also long-term recordings without
human presence at the recording site. In fact, the most significant challenges for long-term monitoring scenarios are how to supply electrical
power to the recording equipment and how equipment can be protected
from changing weather conditions.
Acoustical monitoring of nature offers many possibilities for scientific
research. Birds are a good indicator of the state of our surrounding environment; since they are widely distributed they react quickly to changes
in environmental conditions such as climate change [1]. For nature con-
13
Introduction
servation purposes it can be sufficient to monitor only one or a few target
species. Acoustical monitoring is also widely used for tracking bird migration and for estimating populations of bird species. This work is traditionally performed by a large number of ornithologists, biologists as well
as many enthusiastic and skilled volunteer birdwatchers. This is an area
where automated monitoring could offer huge advantages.
Generally speaking, monitoring is a very laborious, time consuming
and expensive task when performed by human listeners and even more
so when monitoring large areas since more experienced observers are
needed. Even more observers are needed if monitoring needs to be performed round-the-clock. Also, observations made by different people may
not be entirely comparable due to differences in their expertise causing
a bias in the results [2]. Especially in areas that are difficult to access human activity can influence the behavior of monitored animals;
these are also areas of particular interest for making observations. Automated acoustical monitoring is possible in multiple locations simultaneously. Automatic or semi-automatic monitoring can be a valuable tool for
making observations especially in large and elusive areas. However, they
cannot always replace manual scanning of the recordings. An experienced
listener may still make more accurate observations than an automated
analysis system [3]. This becomes important especially within birds with
low vocalization rate.
Motivations for bird species identification are numerous and it has several possible application in addition to surveillance of birds. First, many
humans are interested in bird watching and it would be helpful to have
technology that can identify bird species by their sounds. Often birds
can be heard but they are difficult to see especially during the night
[4]. In these cases recognition by sounds would be a powerful tool for
identifying birds. Secondly, automatic identification of bird species near
airports could also prevent collisions with aircraft if birds are detected
early enough [5, 6, 7]. Finally, identification could also prevent unwanted
human-animal encounters for example animals causing harm to the agriculture [8].
1.1
Related work
Detection and classification of bird sounds is a typical audio classification problem with many similarities to, e.g., automatic speech recogni-
14
Introduction
tion (ASR) and computational audio scene analysis (CASA) and recognition (CASR). In general, ASR operates on speech samples such as words
or sentences. CASR typically omits the sound event segmentation phase
and scene recognition is performed continuously or at regular time intervals [9]. The extracted events are then represented with a certain parametric representation and compared to predefined event models. Finally,
the sound event is associated to the class that most closely resembles the
models of the target sound. Hence many parametric representations and
classification algorithms are applied in bird species recognition that are
common to ASR and CASR.
Automatic monitoring of environmental audio events and identification
of animal species is a relatively small area of research compared to ASR
and CASR. However, it has gathered increasingly more interest during
recent years and is currently a very active area of research. The largest
concentrated interest for recognizing animals by sounds has focused on
birds. Traditionally bird vocalization identification has been performed
by visual inspection of sound spectrograms; the first attempts to perform
automated analysis of bird sounds were based on spectrogram template
matching [10, 11]. Some recent studies have exploited spectrogram images for animal sound identification [12] and retrieval of similar sound
events from sound databases [13].
Segmentation of distinct sound events is a crucial part of the automatic
identification system. Segmentation of bird sounds is a relatively sparsely
covered area in the literature and in most studies segmentation is performed by hand based on the energy content of the recordings. Energy
based segmentation degrades when many birds sing simultaneously and
their sounds become overlapped in time. This is a very common situation
found in continuous unsupervised field recordings. However, temporally
overlapping sounds in different frequency bands can be separated from
spectrogram images of the recordings [14, 15]. Segmentation methods
based on image processing of spectrograms also starts by dividing sounds
into overlapping frames. Segments of bird vocalization are detected from
spectrogram images by searching for similar areas in predefined timefrequency masks of bird syllables. While the masks of the syllables are
needed for each syllable type, the algorithm also performs bird versus
non-bird classification. Another study focusing on tonal bird sounds aims
to detect sinusoidal components from the frames and combine those into
frequency tracks [16, 17].
15
Introduction
Several different parametric representations of bird sounds have been
proposed for classification of bird species or even individuals. The majority of the features operate in the frequency or time-frequency domain and
include Mel-frequency cepstral coefficients (MFCC) [2, 18, 19, 20, 21, 22,
23], linear predictive coefficients (LPC) [24, 22, 25] or wavelets [26]. Even
though MFCC is adapted to the central properties of the human auditory
system it has proven to be a sufficient representation for many different
types of bird sounds, especially those with wideband characteristics. Recently, cepstral features using a filter bank whose frequency resolution
was designed for bird hearing rather than human hearing have also been
proposed [27, 8].
Modeling of sinusoidal components has been efficiently used to represent tonal and harmonic bird sounds [28, 29, 6, 30, 16]. Frequency tracts
of sinusoidal components of sounds can be used for segmentation and
also used as features of sounds for classification purposes [30, 16]. Especially in noisy conditions classification of tonal bird sounds performs
better when compared to MFCC features [30]. This is probably due to the
fact that MFCC covers a wide spectrum while sinusoidal models operate
in a narrow frequency band.
Low-level descriptive acoustical signal parameters, such as spectral centroid (SC) and signal bandwidth (BW), provide suitable representations
for different types of bird sounds [15]. Also, combinations of different feature sets have been used for identification of bird species, e.g., Lopes et al.
[7] uses MFCC together with low-level acoustical features.
Lee et al. [31] uses different representations of bird sounds that exploit
the spectral and temporal characteristics of signals using fixed duration
birdsong segments. Each segment is transformed into an image based on
predefined image basis functions. A study by E. D. Chesmore [32] uses
pre-defined time dependent waveshape patterns to transform bird sounds
into a series of codes. Histograms of these code-pairs were used to classify
different species.
Several different classification algorithms have been applied in the context of bird vocalization recognition including Gaussian Mixture Models
(GMM) [19, 31, 17], Hidden Markov Models (HMM) [2, 18, 33, 22, 34], different types of Artificial Neural Networks (ANN) [35, 26, 36, 37, 24] and
Self Organizing Maps (SOM) [38, 26].
In addition to birds, species recognition by sounds has been a subject of
study also within other animals including at least frogs [39, 12], alarm
16
Introduction
calls of prairie dogs [40], grasshoppers [41] and elephants [42]. A generalized LPC model to extract perceptually relevant features for a particular
species and applied to elephants and whales has also been developed [42].
One large area of study is the sound analysis and identification of marine
mammal sounds [43].
1.2
Scope and outline of the thesis
This thesis presents methods for automated identification of specific audio
events from recordings. The main focus is identification of bird vocalization for species classification. First, the characteristics of bird sounds and
typical circumstances encountered while performing recordings in nature
are discussed. Sound recording equipment for outdoor use in nature depends much on what purpose the recording needs to fulfill. For example,
long-term acoustical monitoring in a specific location differs significantly
from the spontaneous recording of a nearby bird. Secondly, the signal processing tools and algorithms required to generate interpretations from the
recordings are presented. The main focus of this thesis is to investigate
different parametric representations of bird sounds.
This thesis consists of an introductory part followed by six articles that
have been published in scientific journals and conference proceedings.
The introductory part is organized as follows. Chapter 2 focuses on the
characteristics of bird vocalization and considerations related to natural
audio scenes and recording outdoors. Chapter 3 presents the necessary
signal processing procedures for automatic classification. Chapter 4 summarizes the main results in the publications while Chapter 5 concludes
the introductory part.
17
Introduction
18
2. Acoustical Monitoring of Bird
Sounds
This chapter describes the considerations required to perform recordings
of bird sounds in their natural environment. Besides bird vocalizations,
audio scenes can contain sounds from multiple other sources. In order to
identify bird sounds within the recordings it is necessary to understand
the specific characteristics of bird vocalizations as well as natural audio
scenes. Considerations related to recording equipment are also covered in
this chapter.
2.1
Natural audio scenes
Natural audio scenes can be extremely multiform depending on the geographical location, time of day and weather conditions. For automatic
identification of sound events, the simplest desirable natural audio scene
would include only one sound source without any background noise. However, this is rarely the case when making recordings in nature. Typically
multiple sound sources are present simultaneously and their sounds can
overlap in time and frequency [44]. Figure 2.1 shows an example of the
sounds from Capercaillie (Tetrao Urogallus, TETURO) during the courting season from two different audio environments. The first recording is
clean and only the sounds of Capercaillie are present while the second
recording includes sounds from other bird species as well. In the second
recording, the sounds from different birds are overlapping but for a human listener it is easy to concentrate on a particular sound event even
with single channel recordings. Furthermore, for human listeners it is
also possible to identify sound events from different sources even when
they are overlapping (in time or frequency) or emanating from the same
direction.
In addition to multiple sound sources, recordings are influenced by the
19
Acoustical Monitoring of Bird Sounds
Carpercaillie sounds Carpercaillie sounds Figure 2.1. Example of Capercaillie sounds in two different audio scenes.
environmental conditions of the recording site. The topography of the
land, trees and other vegetation influences sound propagation [45]. Vegetation and topography causes sound echoes and reverberations but they
also attenuate acoustical wave propagation. This may become relevant
especially when birds are at different distances from the recording device.
When sound travels through air high frequency components are attenuated more than low frequency components . Therefore, sounds originating
from longer distances have less high frequency information than those arriving from shorter distances.
Furthermore, weather conditions and human actions also influence the
recordings. The two main natural noise sources are wind and rain and affect recordings in two different ways; first, by adding noise to the recordings, but also by changes in animal behavior due to weather conditions. In
addition to natural noise sources, also human induced sounds, e.g., sounds
from airplanes and other vehicles, factories, etc., may decrease the signal
to noise ratio.
Some birds also change their repertoire and vocalization according to
the surrounding audio scene [46, 47]. Birds, when closer to urban areas,
tend to sing at higher frequencies and have a higher minimum frequency
limit to their songs. Urban birds also sing simpler songs than those in the
forest [48]. Birds can also change their vocalization according to the situ-
20
Acoustical Monitoring of Bird Sounds
ation they are in, e.g., during conflict situations when other birds intrude
into a bird’s territory. The vocalization complexity during aggressive encounters may also vary.
2.2
Organization and structure of bird sounds
Bird sounds can be categorized into songs and call-sounds [49]. Singing
is limited to songbirds, but they cover only about half of the total number
of bird species. Non-songbirds also utilize sounds to communicate and for
them sounds are just as important as for songbirds [50]. Generally speaking, the sounds for songbirds are more complex. They also have a larger
repertoire than non-songbirds since they have a more sophisticated control of their sound production mechanism [51]. Some species have even
two independent vibrating membranes in their vocal chords which enables the production of two independent sounds simultaneously [52].
Usually songs are longer and include more complex vocalizations than
call sounds. Songs are spontaneous vocalizations while calls are associated with some meaning. In most species singing is limited to males, but
for a few species females also sing and some species even participate in
duets [53]. Female songs tend to be simpler than the songs produced by
males. Most species sing in a certain time of the year but birds also have
a particular daily schedule for singing. The best time for observing bird
singing is during the spring breeding season. Especially during this time
the male bird song has two main functions; i) to attract females and ii) to
repel rival males entering the territory.
In section 2.1 it was noted that birds vocalize at a higher frequency in urban areas. In addition to this, birds have regional variations in song repertoire and syllable structure [49, 54]. A single individual of the passerine
species can have a larger repertoire than just one type of song and in
some species the repertoire increases with the age of the bird. Larger
repertoires play an important role in mate selection as female birds prefer
males with larger repertoires. Larger repertoires also form a territorial
signal to other individuals.
The hierarchical levels in bird sounds are phrases, syllables and elements [53]. Figure 2.2 illustrates the hierarchical levels of a Thrush
Nightingale (Luscinia Luscinia, LUSLUS) song. A phrase is a series of
syllables that occur in a particular order in the song. Typically, syllables
in the phrase are similar but can also be different. Syllables are con-
21
Acoustical Monitoring of Bird Sounds
Phrases 14000
Syllables 12000
Elements Frequency (Hz)
10000
8000
6000
4000
2000
0.5
1
1.5
2
2.5
Time (s)
3
3.5
4
4.5
5
Figure 2.2. A song from the Thrush Nightingale. The song is composed of four phrases
each containing a different number of syllables. The song also illustrates the
wide spectrum of different types of sounds birds can produce.
structed from elements; an element is defined as the smallest separable
unit in the spectrogram. Elements can overlap in time and frequency and
thus the separation of elements can be difficult and ambiguous. Therefore,
the syllable is generally considered as the basic building block for automatic identification. Thus bird sound recognition is typically performed
based on the classification of individual syllables rather than elements. A
series of syllables that occur together in a particular pattern is called a
phrase. Syllables in a phrase are typically similar to each other but they
can also be different as can be seen in the first and the last phrase of the
song in figure 2.2. The number of syllables in the phrase may also vary.
In addition to the sounds produced by the vocalization mechanism, birds
also produce sounds from physical activity, e.g., sounds from wings during flight. Sounds from physical activity can have special roles in the bird
communication [55], but they do not always carry a particular meaning
for birds. Nevertheless, nonvocal sounds can be used for bird identification in some cases.
Birds can produce a large and diverse set of different sounds. The characteristics of a simple voiced sound are its fundamental frequency and
possibly, but not always, its harmonics. A typical song may contain components which are purely sinusoidal, harmonic, non-harmonic, broadband
and noisy in structure [52]. Sound is often modulated in amplitude or
frequency or even both together (coupled modulation) [56]. The frequency
range of bird sounds typically lies between 100 Hz and 8 kHz but the variation between different species is large. Figure 2.2 of the thrush nightingale song also illustrates the large spectrum of different sounds that even
22
Acoustical Monitoring of Bird Sounds
a single bird species can produce. The song includes syllables that can be
regarded at least as being harmonic and broadband in nature.
2.3
Outdoor sound recording
The recording equipment depends on the purpose of the recording, environmental conditions at the recording site and the intended length of the
recording. In order to make recordings in nature the equipment must include at least one microphone or a set of microphones and a recorder. Microphones for recorders may consist of internal or external microphones,
the latter being the case when they must be of high quality or in case they
need a separate pre-amplifier. Also, for the case of external microphones,
a separate energy source for the pre-amplifier might be required. Currently, the recorder is in practice always a digital recorder and it should
capture the waveform in an uncompressed format.
Typically, most of the microphones that are designed for audio recordings have a sufficient frequency response for recording birds in nature.
More relevant for outdoor recordings are characteristics such as a microphone’s self-noise level and its directivity pattern. Directional microphones are preferred in active or supervised nature recordings since
sound direction can be localized. Directional microphones are useful for
increasing the signal-to-noise level of interesting sounds; they help to obtain good quality recordings from the target when the sound source is
in the direction of the microphone. Parabolic reflectors can be used for
increasing the directional sensitivity even further but they can alter the
original sound wave. Since parabolas amplify the sound pressure level
the size of the reflector should be selected according to the wavelength of
the sound to be recorded. Longer wavelengths (lower frequency) require
larger reflectors.
Omnidirectional microphones capture sounds coming from all directions
and thus there is no need to point them in any particular direction. This
is a desirable property for passive recordings where the direction of the
interesting sound is unknown, or sought after sounds are arriving from
multiple directions. Also, omnidirectional microphones are useful if the
sound source is likely to move which is often the case with birds.
Portable recording devices with internal or external microphones can
be used for long-term (one or two days maximum) recordings without
human presence at the recording site [22]. An advantage of such de-
23
Acoustical Monitoring of Bird Sounds
Figure 2.3. Recording site and recording device for Capercaillie courting display.
vices is that they are lightweight and easy to move from one location to
another; they can still provide much more useful audio data than human observation can. Additionally, passive recordings are not affected by
human presence at the recording site. In figure 2.3 an example of the
setup for recording Capercaillie courting sounds is shown. For long-term
monitoring, the energy requirements of the recording setup as well as
storage capacity become critical issues, especially in remote areas. Autonomous recording units (ARU) have been used to monitor seabirds in
the Aleutian Archipelago [57] and woodpeckers [3] in Florida. ARUs enable even longer-term recordings than conventional portable recorders
and also studies on changes in seasonal behavior are possible, which are
used in e.g. marine mammal studies [58].
One challenge when recording in nature occurs when multiple audio
source signals exist simultaneously. One opportunity for separation of individual source signals is to use a set of microphones [59]. Microphones
can be placed in an array which enables the separation of signals coming
from different directions based on time-of-arrival to each individual microphone. The configuration of separate microphones may vary depending on the purpose of the recording. They can be placed, e.g., in a line
with constant spacing [60] or in a cross configuration [61] which enables
better support for all directions. In all configurations of arrays, individual
microphones exist relatively close to each other.
Another possibility for source separation in nature recordings is a spatially dispersed group of microphones or microphone arrays that enable
improved spatial localization of the sound sources at the recording site.
The spatial location of sound sources can be evaluated from the time delays and intensity differences of the signals arriving to each microphone.
24
Acoustical Monitoring of Bird Sounds
Spatially dispersed microphone setups can be especially useful to evaluate populations and interaction between individual birds.
To add to the challenge of outdoor recordings, environmental conditions
must also be taken into account. Directly influencing the recordings are
sources of noise such as wind, rain and sounds from human activity. Wind
effects can be avoided up to some level by using wind shields. The sound
of rain, however, has a wide spectrum making it harder to remove. Especially specific to ARUs, changes in environmental conditions may cause
problems with recording devices. These changes include temperature
and humidity fluctuations and therefore equipment needs to be protected
against these elements. Undesired influences, such as animals and humans becoming interested in the recording equipment must be taken into
account. Disturbances from these sources needs to be planned for by protecting the recording equipment when seen necessary.
25
Acoustical Monitoring of Bird Sounds
26
3. Signal Processing Tools
This section describes the different phases of the sound event classification system. Figure 3.1 presents a block diagram of the main components of the classification system. Preprocessing can include, e.g., noise
removal, filtering certain frequency bands from the signal and emphasizing and attenuating certain frequency bands. The segmentation process extracts individual audio events or a series of events from continuous recordings. Hence the segmentation also performs data reduction
by omitting periods of time in continuous recordings when audio events
are not present. Depending on the application, the ordering of preprocessing and segmentation may also be interchanged. Feature extraction
then transforms audio signal segments into parametric representations.
The objective in feature extraction is to reduce the information rate of the
input signal into a compact set of features that can be used to separate
audio events belonging to different classes. Finally, classification assigns
each parametric representation to one of the target classes, or else gives
a probability for the audio event belonging to one of the target classes.
Figure 3.1. A block diagram of the species classifcation system.
27
Signal Processing Tools
3.1
Segmentation
The aim of audio segmentation is to identify relevant boundaries for the
distinct audio events existing within the input signal; it is often an essential processing phase and can be found in several audio processing applications. Accurate segmentation is also a crucial step in the entire classification system since inaccurate segmentation can increase the noise in
features and leads to misclassifications. Segmentation of bird sounds in
natural audio recordings can encompass two phases depending upon the
input. The first phase is in use in long-term continuous recordings that
may include long periods that are void of bird sounds. The goal in this
phase is to find regions where birds are present and vocalizing. The goal
in the second phase is to extract individual syllables in songs or a series
of calls (see section 2.2 for the structure of bird sounds). Syllables are
generally considered to be the basic unit used in classification.
Segmentation is always based on some change in the characteristic of
the audio signal and it can be carried out in the temporal or time-frequency
domain. Segmentation is typically performed after the preprocessing stage
and, depending on the application, it can be carried out by focusing only
on a certain frequency band [13].
3.1.1
Frame based segmentation
Syllables in songs of single bird individuals are often temporally distinct;
therefore, segmentation by temporal characteristics of the signal is justified. Segmentation methods that are based on the frame energy are
popular methods in audio event detection. It is a straightforward, computationally light and easy to implement method. Segmentation typically
starts by dividing the signal into temporally overlapping frames. The energy level is calculated for each frame and when the energy surpasses a
threshold level the frame is considered to be part of the sound event. In
the next phase adjacent frames are connected together while too short
candidates are removed. Since the background noise level changes in
audio recordings it is common to use an adaptive threshold for syllable
detection.
However, energy based segmentation methods are often sensitive to the
noise level even when adaptive noise level estimation is used. Another
problem that surfaces with energy based methods is that it is difficult to
distinguish elements with very different energy levels and signal-to-noise
28
Signal Processing Tools
ratios. To avoid the deficiencies of energy based segmentation methods it
is possible to calculate some other characteristic measure for the frames,
e.g., their spectral characteristics [44].
In Publication IV, a mutual information (MI) measure of time domain
structures was used for segmentation. Mutual information of the frame
measure is calculated as
I(πi , πj ) = p(πi , πj ) log
p(πi , πj )
p(πi )p(πj )
(3.1)
where p(∗) are distributions of different rankings of five signal values obtained by a moving analysis window inside the frame under study in the
frame (this is further discussed in 3.2.4). Mutual information measures
the structural richness in the signal and has a higher value for a signal
with some structure than only for the background noise.
3.2
Parametric representations
This part introduces the different parametric representations of the audio
signals. Features are divided into spectral features, sinusoidal features,
low-level signal parameters and time-domain structural features. Different representations have advantages and disadvantages depending on
the sound type they are trying to represent. Even though the production
mechanism of specific sounds across species is quite well known, most of
the features do not directly rely on the production mechanism of the bird
sounds but rather try to characterize the signals so that discrimination
between classes is maximized. The number of possible different features
that could be extracted from a signal at a specific time is enormous and
hence it is practical to select only the most efficient features when possible. However, if the underlying sound type and suitable features are not
known beforehand it is common to calculate many features and reduce
their number later on by discarding some of them according to their discriminative power or by transforming the features to a lower dimension.
Feature dimension reduction by linear discriminant analysis [62, p. 155]
was studied in Publication I.
Many features are calculated on a frame basis where the feature vector is calculated for each frame separately. The feature extraction phase
thus produces a sequence of feature vectors. The processing of feature
vectors in classifiers varies. The sequence of feature vectors can be used
as the input of the classifier, such as in Publication V or the classifier in-
29
Signal Processing Tools
put may also be the average of the feature vector, as in the Publication
I, Publication III and Publication VI. When features are obtained for the
entire extracted segment, a single feature vector is used as the input of
the classifier.
3.2.1
Spectral features
Frequency or time-frequency domain features begins with the Fourier
transformation of the signal. They measure the frequency content of the
signals. The Fourier transform can be calculated from the entire signal
sample or on a frame to frame basis providing time varying characteristics of the input signal.
Mel-frequency cepstral coefficients (MFCC) [63] are a widely used feature representation in different audio recognition systems and it is popular in the context of bird identification as well. The construction of MFCC
parameters begins with the calculation of the power spectrum which is
then transformed into the logarithmic mel-frequency spectrum using a
filter-bank of triangular filters. The dynamic range of the spectrum is
then reduced by taking the logarithm of each channel. Then the ith MFCcoefficient is calculated by
M F CCi =
K
k=1
1 π
Xk cos i k −
2 K
(3.2)
where Xk is the logarithmic energy of the kth mel-spectrum and K is
the total number of bands. The discrete cosine transform (DCT) decorrelates the features and reduces the dimensionality of the feature vector.
In addition to static MFCC features it is common to incorporate the time
derivatives and second order derivatives to the feature representation.
3.2.2
Sinusoidal features
A large number of bird sounds are tonal or composed of a small number of
sinusoidal components [64]. Syllables of tonal and harmonic bird sounds
can be efficiently modeled by a single or a small number of time varying
amplitude and frequency trajectories consisting of sinusoidal components
[28]. Sinusoidal modeling starts by finding trajectories of the sinusoidal
components in the short time spectrum of the signal. These components
are found by searching for peaks in the power spectrum of the frames.
Furthermore, the dynamic structure of tonal bird sounds can be incorporated to the model by evaluating the delta features of the trajectories
30
Signal Processing Tools
[30].
Originally, sinusoidal modeling [65] was developed for the analysis and
synthesis of speech;. In sinusoidal modeling the underlying signal is represented with a small number of sinusoidal components. For classification
of sound events it is adequate to represent sounds with one sinusoidal
component or for harmonic sounds with one sinusoidal component per
harmonic component as in [28] and Publication II.
3.2.3
Low-level signal features
In many applications in the field of audio signal processing, the specific
signal model is unknown and the spectral characteristics may be varying.
This is typical especially in the field of animal and natural sounds. In
these applications it is common to use descriptive measures of the signals
to parameterize sounds [15]. Low-level signal parameters include temporal and spectral domain features and features that are calculated on a
frame basis but also over the entire signal segment.
Equations for low-level feature calculation are presented in the table
3.1. Spectral features include spectral centroid, signal bandwidth, spectral rolloff frequency, spectral flux, spectral flatness and the frequency
range of the signal. Spectral centroid (SC) and signal bandwidth (BW)
describe the center point of the spectrum and width of the frequency band
around the SC. Spectral rolloff frequency (SRF) describes the slope of the
skewness of the spectral shape whereas spectral flux (SF) measures spectral differences between adjacent frames. Finally, spectral flatness (SFM)
characterizes the distribution of the energy spread over the frequency
band of the signal.
Temporal domain features include zero-crossing rate (ZCR), signal energy and duration of the underlying signal. ZCR rate measures the rate of
signal sign changes. It correlates with the SC although ZCR is calculated
in the temporal domain. Signal energy (EN) measures the log-energy content of the signal. The temporal duration of the signal, together with the
frequency range, defines the temporal and spectral boundaries of the underlying signal.
Features derived from the modulation spectrum was also used as a lowlevel signal descriptor. The modulation spectrum describes the modulation frequency and index in the signal. First, the envelope of the signal is
calculated as
31
Signal Processing Tools
Feature name
Equation
M
Signal Bandwidth
n=0 |X(n)|
Spectral rolloff frequency
n|X(n)|2
SC = n=0
M
2
n=0 |X(n)|
M (n−SC)2 |X(n)|
n=0
M
BW =
2
Spectral centroid
SRF = max SRF |
SF =
Spectral flux
SRF
n=0
M
n=0
|X(n)|2 < T h
ZCR =
Zero-crossing rate
M −1
n=0
n=0
|X(n)|2
|Xi (n)| − |Xi+1 (n)|
SF M = 10 log10
Spectral flatness
M
1/M M |X |
i
i=0
M
i=0 |Xi |
1/M
|sgn(x(n)) − sgn(x(n + 1))|
EN = 20 log10 |x(n)|
Short time energy
Table 3.1. Formulas for the calculation of low-level features. In the equations, x(n) and
X(n) denote the time signal and power spectrum of the signal, respectively,
while M denotes the maximum frequency bin. T h is the threshold value between 0 and 1. Commonly used value is T h = 0.95.
EH (n) = |x(n) + ıH(x(n))|
(3.3)
where x(n) + ıH(x(n)) is the analytic signal and H denotes the Hilbert
transformed signal [66]. The modulation spectrum was obtained by calculating the Fourier transform from the envelope signal. The frequency and
magnitude of the maximum peak of the Modulation spectrum were used
as the final features. These correspond to the modulation frequency and
modulation index of the underlying signal. Low-level descriptive features
have been utilized in Publication I, Publication II and Publication III.
3.2.4
Time domain structures
Time domain feature representations have not gained much interest since
most of the feature representations operate in the frequency or timefrequency domain. Recently however, more attention has focused on signal analysis and pattern recognition in time-series based on temporal
structures. Most of the applications utilizing temporal patterns concern
medical signals that are generally characterized by a non-stationary structure [67, 68, 69, 70]. These types of signals are often difficult to identify,
e.g., when using spectral methods.
The main advantage of using temporal features is a potentially shorter
analysis window when compared to spectral methods. The frequency resolution in spectral methods depends on the length of the analysis window.
The underlying signal is assumed to be sufficiently stationary within the
32
Signal Processing Tools
analysis window, however, this assumption does not hold for many environmental signals [71]. Also, some bird sounds have a highly dynamic
structure and local temporal dynamics has been found important for the
identification of these sounds [34, 16].
Time-domain structural features describe the waveshapes in the underlying signal [32]. The number of waveforms in the signals is enormous
and therefore model waveshapes are stored in a codebook. The signal
waveform is mapped to each of the wave shapes in the codebook while
each waveshape corresponds to a unique codeword. The drawback of this
method is that the codebook must be developed separately for each application.
A different approach to avoid codebook creation is to use the rankings of
the signal amplitude values as a description of the waveform. In an analysis window consisting of N samples the amplitude values can be ordered
in N -factorial (N !) different ways. The analysis window can therefore be
coded with one of the N ! codes. In this approach the number of possible
codes grows enormously as the size of the analysis window N increases.
However, in real world signals only a small number from all possible rankings exist or are required for accurate analysis. Also, very short time windows have proven to be sufficient for representing the statistical structure
of signals. Another desirable property, when using this approach, is the
robustness to some types of noise. Also, the method offers an invariance
to the scaling level of the underlying signal because it is based on the mutual relations between the samples in the analysis window [72]. As long
as noise does not alter the ranking of the signal values, the corresponding code does not change. Furthermore, since feature representations are
based on statistics over the entire signal, the effect of a single analysis
window to the statistic is low. Time domain structural features have been
used for the feature representation of sound events in Publication V and
Publication VI.
3.3
Classification
The role of a classifier is to perform decision making as to which class a
test pattern belongs to when compared to a training set. This is accomplished by comparing the best match between a test sample and a class
model or target pattern. In classification tasks, the feature vector space is
divided into regions that correspond to different classes. In classification
33
Signal Processing Tools
of bird syllables each class (representing a bird species) may be composed
of different types of feature representations, each corresponding to a different type of syllable for each species. In this case some models may have
the same class label.
3.3.1
k-nearest neighbour classification
The k-nearest neighbor (kNN) algorithm is a nonlinear classifier that decides on a class label by using the k-nearest neighbors of a test vector
compared to the vectors in the training data set. In the nearest neighbor
classification (k = 1), the class is chosen that exhibits the minimum distance to the test vector. In the kNN method, the test vector is assigned to
the class which is most often represented in the k-nearest neighborhood
[62, p. 45]. The kNN method does not need any tuning of classifier parameters since the feature vectors in the training data set comprise the
classifier. However, the method may exhibit a bias when different classes
have various number of samples or they are unequally dispersed in the
feature space.
Different metrics can be used to measure the distance between a test
sample and feature vectors in the training data. The Euclidean distance
measure is a commonly used metric and it is calculated for two feature
vectors x and y as
dE (x, y) = x − y =
(x − y)T (x − y)
(3.4)
The major drawback with the Euclidean distance appears with correlated features and features with different dynamic ranges; both typically
reduce recognition accuracy. This problem may be resolved by using the
Mahalanobis distance measure when calculating the distance between
feature vectors. The Mahalanobis distance between vectors is calculated
as
dM (x, y) =
(x − y)T Σ−1 (x − y)
(3.5)
where Σ is the covariance matrix of the training data.
When the feature vector is comprised of a distribution of some aspect
of the signal, the Kulback-Leiber divergence distance is a suitable measure for evaluating the distance between two feature vectors. The K-L
divergence is calculated as
dKL (x, y) =
p(x) log
p(x)
p(y)
The K-L divergence was used in Publication V.
34
(3.6)
Signal Processing Tools
3.3.2
Support vector machines
Support vector machine (SVM) is a classification technique that first transforms nonlinearly the separable feature representations into a higher dimensional feature space where the classification problem can be solved by
a linear hyperplane. SVM utilizes the feature vectors near the decision
boundary in model construction making the classifier locally and globally
optimal [73, p. 325]. Mapping to the higher dimensional feature space
is conducted implicitly by kernel functions and thus calculations are not
performed directly in the higher dimensional space.
The SVM classifier is optimal in the sense that it maximizes the margin
between classes. The SVM solution for a classification problem is denoted
by
f (x) =
nsv
αi yi K(xi , x) + b
(3.7)
i=1
where αi and yi are weights and class labels of the support vectors (actual
observations in the training set) and xi is the corresponding feature vector
and b is a constant bias term, x being the test feature vector. The term nsv
is the number of the support vectors in the classifier and K(xi , x) is the
kernel function which enables the calculations in the higher dimensional
space.
Since SVM classifiers are constructed from feature vectors, prior knowledge of the structure of the classifier is not required since this is formed
during the training of the model. SVMs have good generalization properties both locally and globally [74, p. 52]. SVMs maximize the margin
between classes and are suitable for variously shaped classes as well as
classes constructed from many clusters in the feature space. The SVM
classifier also solves the problem of having different sized datasets in
classes since only training samples close to the decision boundary are used
to form the classifier and weight αi of other samples is set to zero.
Fundamentally SVMs are binary classifiers, where y ∈ {−1, +1} in equation (3.7); this is often inadequate since real classification problems are
mostly multiclass in nature. The two main approaches to extend SVMs to
multiclass cases are methods that use one decision function and methods
that combine multiple binary classifiers [75]. The latter approach uses
one-against-rest strategy where each class is trained against the rest of
the classes. In this case the test vector is assigned to the class that has
largest margin to the decision boundary. Another strategy constructs oneagainst-one classifiers for all combinations of classes leading to a decision
35
Signal Processing Tools
tree structure for the entire classifier. A one-against-one classification
strategy was used in Publication III.
36
4. Summary and Main Results of the
Publications
This section summarizes the main results in the thesis. Identification of
bird species syllables was studied in publications I, II, III and VI. Table
4.1 summarizes the number of species and used methods in these papers
as well as the average classification score of bird species classification by
syllables. Only the results obtained in Publication I and Publication III
species set 1 are fully comparable due to identical data sets.
Number of
Publication I
Publication II
Publication III
6
14
set1: 6
set1: 6
set2: 8
set2: 10
Species
Features
Publication VI
*Low-level
*Low-level
*Low-level
*PPF
*MFCC
*MFCC
*MFCC
*MFCC
*SVM
*kNN
set1: 65%
*Sinusoidal
Classifiers
*kNN
*GMM
*kNN + dM (x, y)
*HMM
74% (MFCC)
52%
set1: 91%
(MFCC, Δ, ΔΔ)
(MFCC, low-level)
(PPF)
set2: 98%
set2: 63%
(MFCC, low-level)
(PPF)
*DTW
Definitive
classification
score
Table 4.1. Summary of the methods and results. Defective classification score indicates
best overall classification result for species set in question and also the feature
set that was used to obtain the result.
4.1
Publication I: Parametrization of Inharmonic Bird Sounds for
Automatic Recognition
Publication I focuses on the classification of bird species that produce
mostly inharmonic sounds. Such bird species are for example Common
Raven (Corvus corax, CORRAX) and Hooded Crow (Corvus corone cornix,
CORNIC), whose vocalization is considered inharmonic for more than 95%
of the time. Two different parametric representations were studied in this
37
Summary and Main Results of the Publications
work. A total of 19 low-level descriptive parameters provide the descriptive characteristics of the bird vocalization while a set of MFCC-features
provides a robust representation of the spectrum of the sounds. Linear
discriminant analysis (LDA) was applied for evaluation of the importance
of single descriptive features and optionally used to reduce the length of
the feature vector.
Recognition results using the kNN-classifier exhibits a better species
identification accuracy with MFCC-features, especially when the Euclidean
distance measure was used. The difference in accuracy was much smaller
with the Mahalanobis distance measure which is presumably due to the
fact that the Mahalanobis distance measure decorrelates the features,
which increases the accuracy with descriptive features. LDA-analysis
shows that single features have different discriminative powers and it
can be used for dimension reduction of the feature vector. However, features with low discriminative power did not impair classification accuracy
when they were included in feature vectors.
4.2
Publication II: Parametric Representations of Bird Sounds for
Automatic Species Recognition
In Publication II bird species classification was studied for 14 Passerine
birds that produce mostly voiced (tonal and harmonic) sounds. In addition
to the features in Publication I, classification was also tested with sinusoidal models of the syllables. Four different classes of syllables were used
characterized by: i) the pure tonal sounds, ii) the harmonic sounds whose
strongest sinusoidal component was the fundamental frequency, iii) the
first harmonic of the fundamental frequency, and iv) the second harmonic
of the fundamental frequency. In addition to MFCC-features in publication I, also the first and second order delta parameters were used. GMM
and HMM were used for classification of syllables and songs. Additionally,
dynamic time warping (DTW) was tested in syllable based recognition in
order to achieve a better fit between syllables with different lengths.
Recognition results show that the best average accuracy was obtained
when MFCC features were used together with low-level descriptive features. DTW improves recognition accuracy when compared to the case
where features were obtained on a frame basis. However, there are speciesspecific variations in the results since different features capture different
aspects of the syllables. For example, the sinusoidal model produced the
38
Summary and Main Results of the Publications
best classification accuracy for species that generate tonal sounds. Song
level results exhibit considerably better performance than classification
based on single syllables. This suggests that the dynamics of the syllables
in the song are also important for identification of bird species.
4.3
Publication III: Bird Species Recognition Using Support Vector
Machines.
Publication III applies a support vector machine classifier (SVM) for identification of bird syllables for two different sets of bird species. The first
set (six species) included the same material that was used in Publication
I and the second set (eight species) contained the same data as in [26].
Bird syllables were represented using MFCC features including delta,
delta-delta parameters and low-level descriptive features. In this work
a decision tree topology of binary SVM classifiers were used to achieve a
multiclass classifier.
Recognition results show improved recognition accuracy for both sets
of bird species when compared to the reference methods. Especially the
results of the first set exhibit a considerably better accuracy for syllable
classification. In addition for both sets of species, the best accuracy was
achieved when all available features were used for classification. This
result suggests that the training of an SVM-classifier scales the features
so that even features with a low discriminative power can improve the
recognition accuracy.
4.4
Publication IV: Monitoring of Capercaillie Courting Display
Publication IV presents the methods for Capercaillie breeding sound detection and classification from continuous wildlife recordings. Three different types of vocalized sounds that Capercaillies produce during the
courting season are presented in 4.1. The recording devices were set up
at the courting ground before courting started and collected after courting
had ended; the recordings were therefore performed without human disturbance at the recording site. Due to the size of the courting site, birds
were located at different distances from the recorder and the signal-tonoise level varied considerably. Segmentation of the sound events was
performed based on the mutual information measure of the distribution
of structural patterns in the short time windows. Figure 4.2 depicts an
39
Summary and Main Results of the Publications
example of the segmentation in a very low signal-to-noise ratio situation.
Figure 4.1. Three different types of Capercaillie sounds during courting season
Figure 4.2. Segmentation of a Capercaillie sound event in the case of a low signal-to noise
ratio. The duration of the sound event is approximately 20 ms. The detected
sounds are similar to those in middle panel of the figure 4.1.
The mutual information measure also captures other sound events than
Capercaillie breeding sounds. The segmented sound events were finally
detected by utilizing MFCC features achieving a 93% accuracy for correctly detected sound events and a 9% false acceptance rate.
4.5
Publication V: Classification of Audio Events Using
Permutation Transformation
Publication V introduces novel time-domain features for representing audio events. The method is based on the statistics of short-term ordinal
40
Summary and Main Results of the Publications
structures existing in the waveform. The method first transforms sample
values in a short-time window into rank numbers according to the amplitude values. Each permutation of rank numbers is then represented with
a unique code. Finally, the frequency of code pair occurrences (codes of
adjacent windows) are accumulated to construct a permutation pair frequency (PPF) matrix that acts as a feature representation of the sound
segment. Figure 4.3 shows an example of a temporal pattern of an arbitrary signal.
1
0.8
0.6
0.4
0.2
0
−0.2
−0.4
−0.6
−0.8
−1
0
2
4
6
8
10
12
14
16
Figure 4.3. Exaple of temporal pattern (continuous black line) of five sample values. Corresponding rank numbers for these amplitude values are (4 2 3 5 1).
PPF-matrices were tested with two different classification tasks. The
first task was to separate four different randomly generated noise sequences that have equal spectra but exhibit differences in the temporal
domain due to the manner in which they were created. The second task
was to identify stop consonants existing in human speech on the basis of
the burst part of the consonant. The first task has been proven to be difficult when using conventional frequency domain methods due to almost
equivalent spectra. Likewise, stop consonant bursts are difficult to classify with conventional methods due to their short duration and diverse
structure. The recognition results for both assignments showed that the
proposed method is an efficient feature representation for non-stationary
and noisy sounds.
41
Summary and Main Results of the Publications
4.6
Publication VI: New Parametric Representations of Bird
Sounds for Automatic Classification
In Publication VI the same feature extraction method that was used in
Publication V was applied to the classification of bird species using their
sounds instead. Classification was tested with two groups of bird species;
the first group consisted of bird species that produce mostly noisy and inharmonic sounds. The second group included, in addition to the species
existing in the first group, four passerine birds whose sounds are characterized by tonal components. Classification was tested for single syllables
and for entire songs; they were also compared to the results using MFCCfeatures as a baseline method. The kNN classifier using the Euclidean
distance measure was applied to the classification of the bird syllables.
Recognition results showed significantly better classification rates with
the proposed method when compared to those obtained with MFCC features.
42
5. Conclusions and Future Directions
This thesis presents systems and methods for acoustical monitoring in
natural environments and identification of specific sound events. The
main focus is on bird vocalization classification, which encompasses several challenges. First, bird vocalizations are retrieved from recordings
that may contain vocalization of one individual bird or potentially several birds. Especially continuous recordings can contain several different
noise sources and unwanted sounds as well as many bird species and individuals vocalizing simultaneously. Furthermore, extracted multiform
bird sounds are classified based on the parametric representations of the
sounds.
This introductory part of the thesis covers the necessary background
information required for a bird vocalization identification system. The
main attention in this thesis is on different parametric representations
of bird sounds and their performance for classification of different types
of bird sounds. Traditionally, sounds with a wide spectrum and noiselike structure have been paid less attention to when compared to voiced
sounds and hence these sounds have had a significant role in this thesis.
While technical development of recording equipment and computers have
provided improved resources for longer recordings, more computational
power for the analysis of the sounds and training the models, the main
challenge in developing an autonomous classification system is still to
find enough annotated sound examples from birds. This is a problem especially with rare species as well as with birds that vocalize infrequently
or have a very rich spectrum of different vocalizations. This is a problem especially with rare species as well as with birds that vocalize infrequently. This fact becomes significant especially when the system should
identify species from a very large number of possible candidates, which
would be the case, e.g., in commercial applications. Recently, more bird
43
Conclusions and Future Directions
sound databases have become available for public use and now provide a
sufficient amount of data to become useful corpora. Also efforts on digitalization of large analogue animal sound archives have increased the
amount of the data available [13].
Even though the automatic identification of birds and other animals has
been an active area of research during at least the last decade, the bioacoustic community is still missing a common well-established framework
for bioacoustical signal classification [59] covering all the stages of the
classification system. A wireless sensor network with the potential to support a different number of sensor nodes [76, 77] would be ideal for many
recording scenarios of bioacoustis signals. Most available research has focused on some predefined problem setting and the data has been collected
specifically for that purpose and is generally not available for others to
use. Even now there is no well-established public database that would be
available for general purposes. As long as such a database does not exist,
comparing different methods will remain difficult. Alternatively, future
bioacoustic challenges could provide a suitable platform for comparing
different systems [78].
A remarkable potential application for automated acoustical monitoring
is in the surveillance of certain target species simultaneously in several
different locations. Automated monitoring may not substitute monitoring
by human listeners entirely, but it enables monitoring in more locations.
Automated monitoring combined with other methods would probably produce a more accurate surveillance of the migrating birds. Also, surveillance of bird species populations and monitoring of acoustical scenes in
certain locations would be suitable for monitoring environmental conditions in those locations. Automated monitoring would reveal immediately
changes in acoustical scenes or bird species populations and potentially
enable a fast response to the cause of the change in acoustical scenes.
The publications in this thesis and other studies in the field have utilized several different parametric representations for different types of
sound events. The future goal would be to integrate different feature
representation together. This would potentially enable a more universal
system for classification of different types of sounds. Also robust segmentation of the bird vocalization is still an open question and requires more
attention in future. Main challenge in segmentation of the syllables is still
large spectrum of different vocalizations especially in terms of temporal
and spectral coverage.
44
Conclusions and Future Directions
Features based on time domain structures utilized in Publication IV,
Publication V and Publication VI have already produced interesting results and confirmed the potential of temporal domain feature representations. A potential area for future research is the continued development
of temporal structural features. More investigation is required to determine how this method behaves especially when multiple sound sources
occur simultaneously; a more systematic evaluation of the different types
of signals and noise conditions is also required.
45
Conclusions and Future Directions
46
Appendix
Additional notes and clarifications for the included
publications
The MFCC features were utilized in classification of bird species in
many included publications throughout the thesis. In all classification
experiments the average MFCC feature vector was calculated over the
syllable duration.
In Publication IV, Publication V and Publication VI the notation of a
single permutation was changed after introduction of how a single permutation was formed. The correspondence between notations is πi,j ≡ πnτ (t)
satisfying i, j = t. The notation was changed in order to make sequential
equations more explicit.
Publication I
The presented paper handles bird sounds that cannot be regarded as tonal
or harmonic. In a preceding study [28] bird sounds were divided into four
classes based on their harmonic structure. However, a large number of
bird sounds do not fit to these classes. In this paper F0, F1 and F2 refer to
the harmonic components, e.g., F0 is fundamental frequency and F1 and
F2 are the first and the second harmonic components, respectively.
Publication IV
This publication concerns the detection of the Capercaillie courting sounds
in long-term outdoors recordings. Capercaillie courting sounds have a
47
Appendix
wide spectrum and in outdoor recordings the signal-to-noise ratio may
vary considerably. Thus, mutual information of short time structures was
used for signal segmentation for the sound events. Mutual information
measures structural information of a signal rather than just energy content. This segmentation method identifies those frames as part of sound
events whose mutual information measure exceeds twice the manually
assigned threshold for the mutual information of the background noise.
After the segmentation phase the MFCC features, as explained in Publication IV, were calculated for those frames that were detected as part
of the sound event. The Euclidean distance measure between MFCC features of the frames to be classified and the models of the Capercaillie
sounds were divided by the mutual information (MI) measure before the
final classification. Since in the studied case the largest MI-values occurred for the instances of Capercaillie, this method naturally improved
the classification of the Capercaillie sounds and reduced the misclassifications of other sound events. However, one has to remember that this
combination of MI-values with distance measures works well only for the
cases where the sound to be classified has a stronger structure (in MI
terms) than all the competing sound events.
For experimental evaluation a sound sample of 10 minutes of a typical
Capercaillie courting display was used. This section included, in addition
to Capercaillie sounds, sounds also from other sources, mostly notably
from other birds. However, the amount of other sounds varied during
the different parts of the recording. When measured manually, 40% of
the sound events on average were not produced by Capercaillies. The
background noise level rate remained relatively stable during the whole
signal selected for this evaluation. The classification accuracy indicates
the rate of correctly detected sounds (hit rate) and the misclassification
rate indicates the rate of the sounds that were detected but found not to
be the sounds of a Capercaillie (false acceptance).
Publication V
In synthetic noise sequence classification experiments four different noise
sequences were created from random pattern of length 20 samples and its
three variants. These four patterns were different in the temporal domain, but their power spectrum was equal. This explains why Fourier
analysis cannot classify these signals. A set of signals (200 of each four
48
Appendix
types) of length 4000 samples were created based on four patterns by
adding them in random positions. Half of the signals were assigned to
the training set and the other half to the test set.
The classification results showed low recognition accuracy with PPFmatrices when the number of the patterns was low. The reason for this is
due to the fact that when the number of added patterns is low the signals
include a large number of zeros and due to handling of the equal values
in the permutation window the permutation code sequence is mostly composed of identical codes. In natural signals this phenomenon is extremely
unlikely. When the number of the patterns grows sufficiently large randomness in the signals increase and the temporal structure decreases because patterns are added into random positions.
In stop consonant classification tests the TIMIT labels were used to indicate the starting and ending points of the burst. Recognition tests were
performed iteratively for each consonant in the test set. The classification
begins with a window length of 19 ms and it was increased by 5 ms for
each iteration round until the end of the burst.
Publication VI
Average of the MFCC feature vector were used as feature representation
for syllables. It may not be entirely comparable to the permutation based
features because permutation features are not calculated on frame basis. MFCC features were chosen because it is widely used and utilized
in preceding studies as well. However, in the identification of the songs
the results obtained from the syllable recognition were utilized. Average
MFCC-feature vector was not calculated for entire song segment, but the
results from syllable recognition were integrated by assigning the song
into class which was most often represented in syllables of particular song.
49
Appendix
50
Errata
Publication V
Equation (9) should be written
dKL (P, Q) =
120
i,j=1
p(P ) log
p(P )
p(Q)
Heading of chapter 4 should be Stop consonant classification
51
Errata
52
References
[1] R. Virkkala and A. Lehikoinen, “Patterns of climate-induced density shifts
of species: poleward shifts faster in northern boreal birds than in southern
birds,” Global Change Biology, vol. 20, no. 10, pp. 2995–3003, 2014.
[2] J. A. Kogan and D. Margoliash, “Automated recognition of bird song elements from continuous recordings using dynamic time warping and hidden
Markov models: A comparative study,” The Journal of the Acoustical Society
of America., vol. 103, no. 4, pp. 2185–2196, 1998.
[3] K. A. Swiston and D. J. Mennill, “Comparison of manual and automated
methods for identifying target sounds in audio recordings of pileated, palebilled, and putative ivory-billed woodpeckers,” Journal of Field Ornithology,
vol. 80, no. 1, pp. 42–50, 2009.
[4] A. Farnsworth and R. W. Russell, “Monitoring flight calls of migrating birds
from an oil platform in the northern guld of mexico,” Journal of Field Ornithology, vol. 78, no. 3, pp. 279–289, 2007.
[5] C. Kwan, G. Mei, X. Zhao, Z. Ren, R. Xu, V. Stanford, C. Rochet, J. Aube,
and K.C. Ho, “Bird classification algorithms: theory and experimental results,” in IEEE Int. Conf. Acoustics, Speech, Signal Processing (ICASSP
2004), Montreal, Canada, May 2004.
[6] J. R. Heller and J. D. Pinezich, “Automatic recognition of harmonic bird
sounds using a frequency tract extraction algorithm,” The Journal of the
Acoustical Society of America, vol. 124, no. 3, pp. 1830–1837, 2008.
[7] M. T. Lopes, L. L. Gioppo, T. T. Higushi, C. A. A. Kaestner, C. N. Silla, and
A. L. Koerich, “Automatic bird species identification for large number of
species,” in IEEE International Symposium on Multimedia (ISM), Dec 2011,
pp. 117–122.
[8] K. A. Steen, O. R. Therkildsen, H. Karstoft, and O. Green, “A vocal-based
analytical method for goose behaviour recognition,” Sensors, vol. 12, no. 3,
pp. 3773–3788, 2012.
[9] A. J. Eronen, V. T. Peltonen, J. T. Tuomi, A. P. Klapuri, S. Fagerlund,
T. Sorsa, G. Lorho, and J. Huopaniemi, “Audio-based context recognition,”
IEEE Transactions on Audio, Speech, and Language Processing, vol. 14, no.
1, pp. 321–329, 2006.
53
References
[10] S. E. Anderson, A. S. Dave, and D. Margoliash, “Template-based automatic
recognition of birdsong syllables from continuous recordings,” The Journal
of the Acoustical Society of America, vol. 100, no. 2, pp. 1209–1219, 1996.
[11] K. Ito, K. Mori, and S. Iwasaki, “Application of dynamic programming
matching to classification of budgerigar contact calls,” The Journal of the
Acoustical Society of America, vol. 100, no. 6, pp. 3947–3956, 1996.
[12] T. S. Brandes, P. Naskrecki, and H. K. Figueroa, “Using image processing to
detect and classify narrow-band cricket and frog calls,” The Journal of the
Acoustical Society of America, vol. 120, no. 5, pp. 2950–2957, 2006.
[13] R. Bardeli, “Similarity search in animal sound databases,” IEEE Transactions on Multimedia, vol. 11, no. 1, pp. 68–76, 2009.
[14] L. Neal, F. Briggs, R. Raich, and X.Z. Fern, “Time-frequency segmentation of
bird song in noisy acoustic environments,” in IEEE International Conference
on Acoustics, Speech and Signal Processing (ICASSP), May 2011, pp. 2012–
2015.
[15] F. Briggs, B. Lakshminarayanan, L. Neal, X. Z. Fern, R. Raich, S. J. K.
Hadley, A. S. Hadley, and M. G. Betts, “Acoustic classification of multiple simultaneous bird species: A multi-instance multi-label approach,” The
Journal of the Acoustical Society of America, vol. 131, no. 6, pp. 4640–4650,
2012.
[16] P. Jančovič, M. Köküer, and M. Russell, “Bird species recognition from field
recordings using hmm-based modelling of frequency tracks,” in IEEE Int.
Conf. Acoustics, Speech, Signal Processing (ICASSP 2014), Sep 2014.
[17] P. Jančovič, M. Köküer, M. Zakeri, and M. Russell, “Unsupervised discovery
of acoustic patterns in bird vocalisations employing dtw and clustering,” in
21st European Signal Processing Conference (EUSIPCO), Sep 2013, pp. 1–5.
[18] M. B. Trawicki, M. T. Johnson, and T. S. Osiejuk, “Automatic song-type
classification and speaker identification of norwegian ortolan bunting (emberiza hortulana) vocalizations,” in IEEE Workshop on Machine Learning
for Signal Processing, Sep 2005, pp. 277–282.
[19] C. Kwan, K. C. Ho, G. Mei, Y. Li, Z. Ren, R. Xu, Y. Zhang, D. Lao, M. Stevenson, and V. Stanford, “An automated acoustic system to monitor and classify
birds,” EURASIP Journal on Applied Signal Processing, vol. 2006, pp. Article ID 96706, 2006.
[20] C. Lee, C. Han, and C. Chuang, “Automatic classification of bird species
from their sounds using two-dimensional cepstral coefficients,” IEEE Transactions on Audio, Speech and Language Processing, vol. 16, no. 8, pp. 1541–
1550, 2008.
[21] M. Marcarini, G.A. Williamson, and L. de Sisternes Garcia, “Comparison of
methods for automated recognition of avian nocturnal flight calls,” in IEEE
International Conference on Acoustics, Speech and Signal Processing, 2008
(ICASSP), Mar 2008, pp. 2029–2032.
[22] V. M. Trifa, A. N. G. Kirschel, C. E. Taylor, and E. E. Vallejo, “Automated
species recognition of antbirds in a mexican rainforest using hidden markov
54
References
models,” The Journal of the Acoustical Society of America, vol. 123, no. 4,
pp. 2424–2431, 2008.
[23] N. Harte, S. Murphy, D. J. Kelly, and N. M. Marples, “Identifying new bird
species from differences in birdsong,” in 14th Annual Conference of the
International Speech Communication Association (INTERSPEECH), Lyon,
France, Aug 2013.
[24] C.-F. Juang and T.-M. Chen, “Birdsong recognition using prediction-based
recurrent neural fuzzy networks,” Neurocomputing, vol. 71, no. 1-3, pp. 121
– 130, 2007.
[25] W. Chu and D. T. Blumstein, “Noise robust bird song detection using syllable
pattern-based hidden markov models,” in IEEE International Conference on
Acoustics, Speech and Signal Processing (ICASSP), May 2011, pp. 345–348.
[26] A. Selin, J. Turunen, and J. T. Tanttu, “Wavelets in automatic recognition
of bird sounds,” EURASIP Journal on Signal Processing Special Issue on
Multirate Systems and Applications, vol. 2007, no. 1, pp. Article ID 51806,
2007.
[27] W. Chu and A. Abeer, “Fbem: A filter bank em algorithm for the joint optimization of features and acoustic model parameters in bird call classification,” in IEEE International Conference on Acoustics, Speech and Signal
Processing (ICASSP), Mar 2012, pp. 1993–1996.
[28] A. Härmä and P. Somervuo, “Classification of the harmonic structure in
bird vocalization,” in IEEE International Conference on Acoustics, Speech
and Signal Processing (ICASSP), May 2004.
[29] Z. Chen and R. C. Maher, “Semi-automatic classification of bird vocalizations using spectral peak tracks,” The Journal of the Acoustical Society of
America, vol. 120, no. 5, pp. 2974–2984, 2006.
[30] P. Jančovič and M. Köküer, “Automatic detection and recognition of tonal
bird sounds in noisy environments,” EURASIP Journal on Advances in
Signal Processing, vol. 2011, no. 1, pp. Article ID 982936, 2011.
[31] C. Lee, S. Hsu, J. Shih, and C.F Chou, “Continuous birdsong recognition
using gaussian mixture modeling of image shape features,” IEEE Transactions on Multimedia, vol. 15, no. 2, pp. 454 – 464, 2013.
[32] E. D. Chesmore, “Application of time domain signal coding and artificial
neural networks to passive acoustical identification of animals,” Applied
Acoustics, vol. 62, no. 12, pp. 1359 – 1374, 2001.
[33] H. Tyagi, R. M. Hegde, H. A. Murthy, and A. Prabhakar, “Automatic identification of bird calls using spectral ensemble average voiceprints,” in 13th
European Signal Processing Conference (EUSIPCO), Sep 2006.
[34] T. S. Brandes, “Feature vector selection and use with hidden markov models
to identify frequency-modulated bioacoustic signals amidst noise,” IEEE
Transactions on Audio, Speech, and Language Processing, vol. 16, no. 6, pp.
1173–1180, 2008.
55
References
[35] A. L. McIlraith and H. C. Card, “Birdsong recognition using backpropagation and multivariate statistics,” IEEE Trans. Signal Processing, vol. 45,
no. 11, pp. 2740–2748, 1997.
[36] S. A. Selouani, M. Kardouchi, E. Hervet, and D. Roy, “Automatic birdsong recognition based on autoregressive time-delay neural networks,” in
Congress on Computational Intelligence Methods and Applications (ICSC),
Dec 2005.
[37] M. R. W. Dawson, I. Charrier, and C. B. Sturdy, “Using an artificial neural
network to classify black-capped chickadee (poecile atricapillus) call note
types,” The Journal of the Acoustical Society of America, vol. 119, no. 5, pp.
3161–3172, 2006.
[38] P. Somervuo and A. Härmä, “Analyzing bird song syllables on the selforganizing map,” in Workshop on Self-Organizing Maps (WSOM03), 2003.
[39] G. Grigg, A. Taylor, H. Mc Callum, and G. Watson, “Monitoring frog communities: an application of machine learning,” in 8th Innovative Applications
of Artificial Intelligence Conference, Sep 1996, pp. 1564–1569.
[40] J. Placer, C. N. Slobodchikoff, J. Burns, J. Placer, and R. Middleton, “Using
self-organizing maps to recognize acoustic units associated with information content in animal vocalizations,” The Journal of the Acoustical Society
of America, vol. 119, no. 5, pp. 3140–3146, 2006.
[41] E. D. Chesmore and E. Ohya, “Automated identification of field-recorded
songs of four british grasshoppers using bioacoustic signal recognition,”
Bulletin of entomological research, vol. 94, no. 4, pp. 319–330, 2004.
[42] P. J. Clemins and M. T. Johnson, “Generalized perceptual linear prediction
features for animal vocalization analysis,” The Journal of the Acoustical
Society of America, vol. 120, no. 1, pp. 527–534, 2006.
[43] D. K. Mellinger and J. W. Bradbury, “Acoustic measurement of marine mammal sounds in noisy environments,” in International Conference on Underwater Acoustical Measurements: Technologies and Results, Jun 2007, pp.
273–280.
[44] R. Bardeli, D. Wolff, F. Kurth, M. Koch, K.-H. Tauchert, and K.-H. Fromolt,
“Detecting bird sounds in a complex acoustic environment and application
to bioacoustic monitoring,” Pattern Recognition Letters, vol. 31, pp. 1524 –
1534, 2010.
[45] T. F. W. Embleton, “Tutorial on sound propagation outdoors,” The Journal
of the Acoustical Society of America, vol. 100, no. 1, pp. 31–48, 1996.
[46] D. Luther and L. Baptista, “Urban noise and the cultural evolution of bird
songs,” Proceedings of the Royal Society B, vol. 277, no. 1680, pp. 469–473,
2010.
[47] E. Bermudez-Cuamatzin, A. A. Rios-Chelen, D. Gil, and C. M. Garcia, “Experimental evidence for real-time song frequency shift in response to urban
noise in a passerine bird,” Biology Letters, vol. 7, no. 1, pp. 36–38, 2011.
[48] H. Slabbekoorn and A. den Boer-Visser, “Cities change the songs of birds,”
Current Biology, vol. 16, no. 23, pp. 2326–2331, 2006.
56
References
[49] J. R. Krebs and D. E. Kroodsma, “Repertoires and geographical variation in
bird song,” Advances in the Study of Behaviour, vol. 11, pp. 143–177, 1980.
[50] G. J. L. Beckers, R. A. Suthers, and C. ten Cate, “Mechanisms of frequency
and amplitude modulation in ring dove song,” The Journal of Experimental
Biology, vol. 206, no. 11, pp. 1833–1843, 2003.
[51] A. S. Gaunt, “A hypothesis concerning the relationship of syringeal structure to vocal abilities,” Auk, vol. 100, no. 4, pp. 853–862, 1983.
[52] S. Nowicki, “Bird acoustics,” in Encyclopedia of Acoustics, M. J. Crocker,
Ed., chapter 150, pp. 1813–1817. John Wiley & Sons, 1997.
[53] C. K. Catchpole and P. J. B. Slater, Bird Song: Biological Themes and Variations, Cambridge University Press, Cambridge, UK, 1995.
[54] T. S. Osiejuk, K. Ratynska, J. P. Cygan, and D. Svein, “Song structure and
repertoire variation in ortolan bunting (Emberiza hortulana l.) from isolated norwegian population,” Annales Zoologici Fennici, vol. 40, pp. 3–16,
2003.
[55] M. Hingee and R. D. Magrath, “Flights of fear: a mechanical wing whistle
sounds the alarm in a flocking bird,” Proceedings of the Royal Society B:
Biological Sciences, vol. 276, no. 1676, pp. 4173–4179, 2009.
[56] J. H. Brackenbury, “Functions of the syrinx and the control of sound production,” in Form and Function in Bird, A. S. King and J. McLelland, Eds.,
chapter 4, pp. 193–220. Academic Press, 1989.
[57] R. T. Buxton and I. L. Jones, “Measuring nocturnal seabird activity and
status using acoustic recording devices: applications for island restoration,”
Journal of Field Ornithology, vol. 83, no. 1, pp. 47–60, 2012.
[58] R. S. Sousa-Lima, T. F. Norris, J. N. Oswald, and D. P. Fernandes, “A review
and inventory of fixed autonomous recorders for passive acoustic monitoring
of marine mammals,” Aquatic Mammals, vol. 39, no. 1, 2013.
[59] D. T. Blumstein, D. J. Mennill, P. Clemins, L. Girod, K. Yao, G. Patricelli,
J. L. Deppe, A. H. Krakauer, C. Clark, K. A. Cortopassi, S. F. Hanser, B. McCowan, A. M. Ali, and A. N. G. Kirschel, “Acoustic monitoring in terrestrial
environments using microphone arrays: applications, technological considerations and prospectus,” Journal of Applied Ecology, vol. 48, no. 3, pp.
758–767, 2011.
[60] D. L. Jones and R. Ratnam, “Blind location and separation of callers in a
natural chorus using a microphone array,” The Journal of the Acoustical
Society of America, vol. 126, no. 2, pp. 895–910, 2009.
[61] K.-H. Frommolt, K.-H. Tauchert, and M. Koch, “Advantages and disadvantages of acoustic monitoring of birds–realistic scenarios for automated
bioacoustic monitoring in a densely populated region,” in Computational
Bioacoustics for Assessing Biodiversity. Proc. of the Internat. Expert Meeting on IT-based Detection of Bioacoustical Patterns. BfN-Skripten. Citeseer,
2008, vol. 234, pp. 83–92.
[62] S. Theodoridis and L. Koutroumbas, Pattern Recognition, Academic Press,
1999.
57
References
[63] S. Davis and P. Mermelstein, “Comparison of parametric representations
for monosyllabic word recognition in continuously spoken sentences,” IEEE
Transactions on Acoustics, Speech and Signal Processing, vol. 28, no. 4, pp.
357–366, 1980.
[64] N. H. Fletcher, “A class of chaotic bird calls,” The Journal of the Acoustical
Society of America, vol. 108, no. 2, pp. 821–826, 2000.
[65] R. J. McAulay and T. F. Quatieri, “Speech analysis/synthesis based on a
sinusoidal representation,” IEEE Transactions on Acoustics, Speech and
Signal Processing, vol. 34, no. 4, pp. 744–754, 1986.
[66] W. M. Hartmann, Signals, Sound, and Sensation, Springer, Woodbury, New
York, 1 edition, 1997.
[67] K. Keller, H. Lauffer, and M. Sinn, “Ordinal analysis of eeg time series,”
in Advanced Methods of Electrophysiological Signal Analysis and Symbol
Grounding, J. Kurths C. Allefeld, P. beim Graben, Ed., chapter 7, pp. 109–
119. Nova Science Publishers, 2008.
[68] G. Ouyang, Z. Ju, and H. Liu, “Mutual information analysis with ordinal
pattern for emg based hand motion recognition,” Intelligent Robotics and
Applications, vol. 7506, pp. 499–506, 2012.
[69] Y. Cao, W-W. Tung, J. B. Gao, V. A. Protopopescu, and L. M. Hively, “Detecting dynamical changes in time series using the permutation entropy,”
Physical Review E, vol. 70, pp. 046217, 2004.
[70] D. Arroyo, P. Chamorro, J. M. Amigó, F. B. Rodríguez, and P Varona, “Event
detection, multimodality and non-stationarity: Ordinal patterns, a tool to
rule them all?,” The European Physical Journal Special Topics, vol. 222, no.
2, pp. 457 – 472, 2013.
[71] B. Ghoraani and S. Krishnan, “Time-frequency matrix feature extraction
and classification of environmental audio signals,” IEEE Transactions on
Speech and Audio Processing, vol. 19, no. 7, pp. 2197–2209, 2011.
[72] K Keller, M Sinn, and J Emonds, “Time series from the ordinal viewpoint,”
Stochastics and Dynamics, vol. 7, no. 2, pp. 247–272, 2007.
[73] C. M. Bishop, Pattern recognition and machine learning, vol. 1, Springer
New York, 2006.
[74] N. Cristianini and J. Shawe-Taylor, An introduction to support vector machines and other kernel-based learning methods, Cambridge University
Press, 2000.
[75] F. Schwenker, “Hierarchical support vector machines for multi-class pattern recognition,” in 4th International Conference on Knowledge-Based Intelligent Engineering Systems and Allied Technologies, 2000, pp. 561–565.
[76] G. Vaca-Castano, D. Rodriguez, J. Castillo, K. Lu, A. Rios, and F. Bird,
“A framework for bioacoustical species classification in a versatile serviceoriented wireless mesh network,” in 18th European Signal Processing Conference (EUSIPCO), Aug 2010.
58
References
[77] T. C. Collier, A. N. G. Kirschel, and C. E. Taylor, “Acoustic localization of
antbirds in a mexican rainforest using a wireless sensor network,” The
Journal of the Acoustical Society of America, vol. 128, no. 1, pp. 182–189,
2010.
[78] I. Potamitis, “Automatic classification of a taxon-rich community recorded
in the wild,” PLoS ONE, vol. 9, no. 5, pp. e96936, 2014.
59
References
60
A
al
t
o
D
D1
6
6
/
2
0
1
4
9HSTFMG*afjcbb+
I
S
BN9
7
89
5
2
6
0
5
9
2
1
1(
p
ri
nt
e
d
)
I
S
BN9
7
89
5
2
6
0
5
9
2
2
8(
p
d
f
)
I
S
S
N
L1
7
9
9
4
9
34
I
S
S
N1
7
9
9
4
9
34(
p
ri
nt
e
d
)
I
S
S
N1
7
9
9
4
9
4
2(
p
d
f
)
A
a
l
t
oU
ni
v
e
r
s
i
t
y
S
c
h
o
o
lo
fE
l
e
c
t
r
i
c
a
lE
ng
i
ne
e
r
i
ng
D
e
p
a
r
t
me
nto
fS
i
g
na
lP
r
o
c
e
s
s
i
nga
ndA
c
o
us
t
i
c
s
w
w
w
.
a
a
l
t
o
.
f
i
BU
S
I
N
E
S
S+
E
C
O
N
O
M
Y
A
R
T+
D
E
S
I
G
N+
A
R
C
H
I
T
E
C
T
U
R
E
S
C
I
E
N
C
E+
T
E
C
H
N
O
L
O
G
Y
C
R
O
S
S
O
V
E
R
D
O
C
T
O
R
A
L
D
I
S
S
E
R
T
A
T
I
O
N
S