fingerprint

Audio Fingerprinting
Wes Hatch
MUMT-614
Mar.13, 2003
What is Audio Fingerprinting?
•
a small, unknown segment of audio
data (it can be as short as just a
couple of seconds) is used to
identify the original audio file from
which it came
Applications
•
Broadcast monitoring
• playlist generation
• royalty collection
• ad verification
•
Connected Audio
• general term for consumer applications
•
Other
• Napster--use of fingerprinting systems to
prohibit the transmission of copywritten
materials
• Finding desired content efficiently in “an
overwhelming amount of audio material”
“Benefits”

Automated search of illegal content on the
Internet
– examines the real audio information rather
than just tag information

For the consumer
– make the meta-data of songs in a library
consistent, allowing for easy organization
– can guarantee that what is downloaded is
actually what it says it is
– will allow consumer to record signatures of
sound and music on small handheld devices
Two principle components
Compute the fingerprint
 Compare it to a database of
previously computed fingerprints

– A text example: “…in a box. I will not eat
them with a fox. I…”
Details to worry about









Robustness (to noise, distortion)
Reliability
Fingerprint size (reduced dimensionality)
Granularity
Search speed and scalablity
Computationally efficient
Resulting features must be informative
about the audio content
Semantic or non-semantic features?
Hash table or vector representation?
Computing the fingerprint

Compare to hash functions…?
– compare computed hash value with that stored
in a database

Drawback
– need to worry about perceptual similarity and not
mathematical similarity
• PCM audio vs. MP3: both sound alike but mathematically
(i.e. spectral content) are quite different
– perceptual similarity is not transitive
• not possible to design a system which computes
mathematical fingerprints for perceptually similar objects
Techniques (general)
Any ‘x’ number of seconds may be used to
compute the fingerprint
 Audio gets separated into frames

– Features computed for each frame:
•
•
•
•

Fourier coefficients
MFCC, LPC
Spectral flatness
sharpness
“features mapped into a more compact
representation by using …HMM, or
quantization”
Techniques (Haitsma, Kalker)

one 32-bit sub-fingerprint every 11.6 ms
– A block consists of 256 sub-fingerprints
• Corresponds to a granularity of only 3 seconds
– Large overlap (31/32), so subsequent subfingerprints are similar and vary slowly in time
– worst-case scenario: the frame boundaries
used during identification are 5.8 ms off with
those in database
Techniques (Haitsma, Kalker)

Data from each frame is sent through
a filterbank
– 33 filters, logarithmically spaced (to
correspond roughly to the Bark scale)
• between 300 and 2000Hz
– phase is neglected (perceptual reasons)
System overview
Techniques (Burges, Platt)

downsampled to 11.025 kHz, split
into frames with overlap of 2
– MCLT is then applied to each frame. A
128-sample log spectrum is generated
by taking the log modulus of each MCLT
coefficient
Techniques (Burges, Platt)

Use prior knowledge to define form of the
feature extractor
 Features computed by a “linear,
convolutional” neural network
 convert signal into a feature vector
– uses Pattern Classification and Scene Analysis
(PCA) to find a set of projections
– generates a vector of 128 values for every
11.6ms interval
• dimensional-reduction method (i.e. lots of math)
Techniques (Burges, Platt)

3 layers of Oriented PCA
(OPCA)
– operates on a frame of 128
values
• layer 1: generates 10
values for each frame
• layer 2: takes 42 ‘layer 1
outputs’ and produces 20
values
• layer 3: takes 40 ‘layer 2
outputs’ and produces 64
values (11K inputs --> 64
outputs)
Searching the Database

Look for the most similar (not
necessarily exact) fingerprint
– 10,000 5-min. songs  250 million subfingerprints
– brute force takes in excess of 20
minutes on a very fast PC
• brute force computes bit-error rate for every
possible position in the database
Searching the Database

make assumption that at least 1 (of
the 256) sub-fingerprints are errorfree
– then, use a hash table (as opposed to
more memory-intensive look-up table)
– 800,000 times faster
Results

false-positive rate of 3.6x10-2 (Haitsma,
Kalker)
 On tests with a large (500,000) set of input
traces
– has a “low” false-positive and false-negative
rate. (Burges, Platt)
– didn’t test on time compression, expansion

can withstand distortions occurring from
transmission over mobile phones.