Audio Fingerprinting Wes Hatch MUMT-614 Mar.13, 2003 What is Audio Fingerprinting? • a small, unknown segment of audio data (it can be as short as just a couple of seconds) is used to identify the original audio file from which it came Applications • Broadcast monitoring • playlist generation • royalty collection • ad verification • Connected Audio • general term for consumer applications • Other • Napster--use of fingerprinting systems to prohibit the transmission of copywritten materials • Finding desired content efficiently in “an overwhelming amount of audio material” “Benefits” Automated search of illegal content on the Internet – examines the real audio information rather than just tag information For the consumer – make the meta-data of songs in a library consistent, allowing for easy organization – can guarantee that what is downloaded is actually what it says it is – will allow consumer to record signatures of sound and music on small handheld devices Two principle components Compute the fingerprint Compare it to a database of previously computed fingerprints – A text example: “…in a box. I will not eat them with a fox. I…” Details to worry about Robustness (to noise, distortion) Reliability Fingerprint size (reduced dimensionality) Granularity Search speed and scalablity Computationally efficient Resulting features must be informative about the audio content Semantic or non-semantic features? Hash table or vector representation? Computing the fingerprint Compare to hash functions…? – compare computed hash value with that stored in a database Drawback – need to worry about perceptual similarity and not mathematical similarity • PCM audio vs. MP3: both sound alike but mathematically (i.e. spectral content) are quite different – perceptual similarity is not transitive • not possible to design a system which computes mathematical fingerprints for perceptually similar objects Techniques (general) Any ‘x’ number of seconds may be used to compute the fingerprint Audio gets separated into frames – Features computed for each frame: • • • • Fourier coefficients MFCC, LPC Spectral flatness sharpness “features mapped into a more compact representation by using …HMM, or quantization” Techniques (Haitsma, Kalker) one 32-bit sub-fingerprint every 11.6 ms – A block consists of 256 sub-fingerprints • Corresponds to a granularity of only 3 seconds – Large overlap (31/32), so subsequent subfingerprints are similar and vary slowly in time – worst-case scenario: the frame boundaries used during identification are 5.8 ms off with those in database Techniques (Haitsma, Kalker) Data from each frame is sent through a filterbank – 33 filters, logarithmically spaced (to correspond roughly to the Bark scale) • between 300 and 2000Hz – phase is neglected (perceptual reasons) System overview Techniques (Burges, Platt) downsampled to 11.025 kHz, split into frames with overlap of 2 – MCLT is then applied to each frame. A 128-sample log spectrum is generated by taking the log modulus of each MCLT coefficient Techniques (Burges, Platt) Use prior knowledge to define form of the feature extractor Features computed by a “linear, convolutional” neural network convert signal into a feature vector – uses Pattern Classification and Scene Analysis (PCA) to find a set of projections – generates a vector of 128 values for every 11.6ms interval • dimensional-reduction method (i.e. lots of math) Techniques (Burges, Platt) 3 layers of Oriented PCA (OPCA) – operates on a frame of 128 values • layer 1: generates 10 values for each frame • layer 2: takes 42 ‘layer 1 outputs’ and produces 20 values • layer 3: takes 40 ‘layer 2 outputs’ and produces 64 values (11K inputs --> 64 outputs) Searching the Database Look for the most similar (not necessarily exact) fingerprint – 10,000 5-min. songs 250 million subfingerprints – brute force takes in excess of 20 minutes on a very fast PC • brute force computes bit-error rate for every possible position in the database Searching the Database make assumption that at least 1 (of the 256) sub-fingerprints are errorfree – then, use a hash table (as opposed to more memory-intensive look-up table) – 800,000 times faster Results false-positive rate of 3.6x10-2 (Haitsma, Kalker) On tests with a large (500,000) set of input traces – has a “low” false-positive and false-negative rate. (Burges, Platt) – didn’t test on time compression, expansion can withstand distortions occurring from transmission over mobile phones.
© Copyright 2026 Paperzz