talk PDF

Learning Word Meanings and
Descriptive Parameter Spaces from Music
Brian Whitman, Deb Roy and Barry Vercoe
MIT Media Lab
Music intelligence
Structure
Structure
Recommendation
Recommendation
Genre
Genre/ /Style
StyleID
ID
Artist
ArtistID
ID
Song
Songsimilarity
similarity
Synthesis
Synthesis
• Extracting salience from a signal
• Learning is features and regression
ROCK/POP
Classical
Semantic decomposition
• Music models from unsupervised methods find statistically
significant parameters
• Can we identify the optimal semantic attributes for
understanding music?
Female/Male
Angry/Calm
“Community metadata”
•
•
•
•
•
Whitman / Lawrence (ICMC2002)
Internet-mined description of music
Embed description as kernel space
Community-derived meaning
Time-aware!
Language Processing for IR
• Web page to feature vector
n1
HTML
Sentence Chunks
….
Aosid asduh asdihu asiuh
oiasjodijasodjioaisjdsaioj
aoijsoidjaosjidsaidoj.
Oiajsdoijasoijd.
Iasoijdoijasoijdaisjd. Asij
aijsdoij. Aoijsdoijasdiojas.
Aiasijdoiajsdj., asijdiojad
iojasodijasiioas asjidijoasd
oiajsdoijasd ioajsdojiasiojd
iojasdoijasoidj. Asidjsadjd
iojasdoijasoijdijdsa. IOJ
iojasdoijaoisjd. Ijiojsad.
XTC was one of the smartest
— and catchiest — British pop
bands to emerge from the
punk and new wave
explosion of the late '70s.
….
XTC
Was
One
Of
the
Smartest
And
Catchiest
British
Pop
Bands
To
Emerge
From
Punk
New
wave
n2
XTC was
Was one
One of
Of the
The smartest
Smartest and
And catchiest
Catchiest british
British pop
Pop bands
Bands to
To emerge
Emerge from
From the
The punk
Punk and
And new
np
XTC
Catchiest british pop bands
British pop bands
Pop bands
Punk and new wave
explosion
n3
XTC was one
Was one of
One of the
Of the smartest
The smartest and
Smartest and catchiest
And catchiest british
Catchiest british pop
British pop bands
Pop bands to
Bands to emerge
To emerge from
Emerge from the
From the punk
The punk and
Punk and new
And new wave
artist
adj
XTC
Smartest
Catchiest
British
New
late
Smoothed TF-IDF
ft
s( f t , f d ) =
fd
s( f t , f d ) =
ft e
- (log( f d ) - m ) 2
2s
2
Query by description (audio)
• “What does loud mean?”
• “Play me something fast with an electronic beat”
• Single-term to frame attachment
Learning QBD
Audio features, aritst 0, frame 1
“Electronic” 0.30
“Loud” 0.30
“Talented” 2.0
Audio features, aritst 0, frame 2
“Electronic” 0.30
“Loud” 0.30
“Talented” 2.0
Audio features, aritst 0, frame 3
“Electronic” 0.30
“Loud” 0.30
“Talented” 2.0
Audio features, aritst 1, frame 1
“Electronic” 0.1
“Loud” 3.23
“Talented” 0.4
Audio features, aritst 1, frame 2
“Electronic” 0.1
“Loud” 3.23
“Talented” 0.4
Audio features, aritst 3, frame 1
“Electronic” 0
“Loud” 0.95
“Talented” 0
Audio features, aritst 3, frame 2
“Electronic” 0
“Loud” 0.95
“Talented” 0
Audio features, aritst 3, frame 3
“Electronic” 0
“Loud” 0.95
“Talented” 0
Learning formalization
• Learn relation between audio and naturally
encountered description
• Can’t trust target class!
– Opinion
– Counterfactuals
– Wrong artist
– Not musical
• 200,000 possible terms (output classes!)
– (For this experiment we limit it to adjectives)
Regularized least-squares
classification (RLSC)
• (Rifkin 2002)
È- x - x
i
j
Í
( xi , x j ) = exp
Í 2d 2
Î
I
( K + )c t = y t
C
2
˘
˙
˙
˚
K
I -1
ct = ( K + ) y t
C
ct = machine for class t
yt = truth vector for class t
C = regularization constant (10)
Time-aware audio features
• MPEG-7 derived state-paths (Casey
2001)
• Music as discrete
path through
time
• Reg’d to 20 states
0.1 s
Per-term accuracy
Good terms
Bad terms
Busy
42%
Artistic
0%
Steady
41%
Homeless
0%
Funky
39%
Hungry
0%
Intense
38%
Great
0%
Acoustic
36%
Awful
0%
African
35%
Warped
0%
Melodic
27%
Illegal
0%
Romantic
23%
Cruel
0%
Slow
21%
Notorious
0%
Wild
25%
Good
0%
Young
17%
Okay
0%
• Weighted accuracy (to allow for bias)
The linguistic expert
• Some semantic attachment requires
‘lookups’ to an expert
“Dark”
“Big”
“Light”
“Small”
“?”
Linguistic expert
• Perception +
“Big”
“Light”
observed
language:
• Lookups to linguistic expert:
Big
Small
Dark
Light
“Dark”
“Small”
• Allows you to infer new gradation:
“?”
Big
Small
Dark
Light
Parameters: synants of “quiet”
“The antonym of every synonym and the synonym of every antonym.”
“thundering”
“quiet”
“noisy”
“soft”
“clangorous”
“hard”
Antonyms
Synonyms
Top descriptive parameters
•All P(a) of terms in anchor synant sets averaged
•P(quiet) = 0.2, P(loud) = 0.4, P(quiet-loud) = 0.3.
•Sorted list gives best grounded parameter map
Good parameters
Bad parameters
Big – little
30%
Evil – good
5%
Present – past
29%
Bad – good
0%
Unusual – familiar
28%
Violent – nonviolent
1%
Low – high
27%
Extraordinary – ordinary
0%
Male – female
22%
Cool – warm
7%
Hard – soft
21%
Red – white
6%
Loud – soft
19%
Second – first
4%
Smooth – rough
14%
Full – empty
0%
Vocal – instrumental
10%
Internal – external
0%
Minor – major
10%
Foul – fair
5%
Learning the knobs
• Nonlinear dimension reduction
– Isomap
• Like PCA/NMF/MDS, but:
– Meaning oriented
– Better perceptual distance
– Only feed polar observations as input
• Future data can be quickly semantically classified with
guaranteed expressivity
Quiet
Loud
Male
Female
Parameter understanding
• Some knobs aren’t 1-D intrinsically
Color spaces &
user models!
Future: music acquisition
Short term music model: auditory scene to events
Structural music model: recurring patterns in music streams
Language of music: relating artists to descriptions (cultural representation)
Music acceptance models: path of music through social network
Grounding sound, “what does loud mean?”
Semantics of music: “what does rock mean?”
What makes a song popular?
Semantic synthesis
Reverse: semantic synthesis
• “What does college rock sound like?”
• Meaning as transition probabilities
Loud rock with electronics
What’s next
• Human evaluation
– Inter-rater reliability
– “can we trust the internet for community
meaning?”
• “Meaning recognition” (time)
• Hierarchy learning