Learning Word Meanings and Descriptive Parameter Spaces from Music Brian Whitman, Deb Roy and Barry Vercoe MIT Media Lab Music intelligence Structure Structure Recommendation Recommendation Genre Genre/ /Style StyleID ID Artist ArtistID ID Song Songsimilarity similarity Synthesis Synthesis • Extracting salience from a signal • Learning is features and regression ROCK/POP Classical Semantic decomposition • Music models from unsupervised methods find statistically significant parameters • Can we identify the optimal semantic attributes for understanding music? Female/Male Angry/Calm “Community metadata” • • • • • Whitman / Lawrence (ICMC2002) Internet-mined description of music Embed description as kernel space Community-derived meaning Time-aware! Language Processing for IR • Web page to feature vector n1 HTML Sentence Chunks …. Aosid asduh asdihu asiuh oiasjodijasodjioaisjdsaioj aoijsoidjaosjidsaidoj. Oiajsdoijasoijd. Iasoijdoijasoijdaisjd. Asij aijsdoij. Aoijsdoijasdiojas. Aiasijdoiajsdj., asijdiojad iojasodijasiioas asjidijoasd oiajsdoijasd ioajsdojiasiojd iojasdoijasoidj. Asidjsadjd iojasdoijasoijdijdsa. IOJ iojasdoijaoisjd. Ijiojsad. XTC was one of the smartest — and catchiest — British pop bands to emerge from the punk and new wave explosion of the late '70s. …. XTC Was One Of the Smartest And Catchiest British Pop Bands To Emerge From Punk New wave n2 XTC was Was one One of Of the The smartest Smartest and And catchiest Catchiest british British pop Pop bands Bands to To emerge Emerge from From the The punk Punk and And new np XTC Catchiest british pop bands British pop bands Pop bands Punk and new wave explosion n3 XTC was one Was one of One of the Of the smartest The smartest and Smartest and catchiest And catchiest british Catchiest british pop British pop bands Pop bands to Bands to emerge To emerge from Emerge from the From the punk The punk and Punk and new And new wave artist adj XTC Smartest Catchiest British New late Smoothed TF-IDF ft s( f t , f d ) = fd s( f t , f d ) = ft e - (log( f d ) - m ) 2 2s 2 Query by description (audio) • “What does loud mean?” • “Play me something fast with an electronic beat” • Single-term to frame attachment Learning QBD Audio features, aritst 0, frame 1 “Electronic” 0.30 “Loud” 0.30 “Talented” 2.0 Audio features, aritst 0, frame 2 “Electronic” 0.30 “Loud” 0.30 “Talented” 2.0 Audio features, aritst 0, frame 3 “Electronic” 0.30 “Loud” 0.30 “Talented” 2.0 Audio features, aritst 1, frame 1 “Electronic” 0.1 “Loud” 3.23 “Talented” 0.4 Audio features, aritst 1, frame 2 “Electronic” 0.1 “Loud” 3.23 “Talented” 0.4 Audio features, aritst 3, frame 1 “Electronic” 0 “Loud” 0.95 “Talented” 0 Audio features, aritst 3, frame 2 “Electronic” 0 “Loud” 0.95 “Talented” 0 Audio features, aritst 3, frame 3 “Electronic” 0 “Loud” 0.95 “Talented” 0 Learning formalization • Learn relation between audio and naturally encountered description • Can’t trust target class! – Opinion – Counterfactuals – Wrong artist – Not musical • 200,000 possible terms (output classes!) – (For this experiment we limit it to adjectives) Regularized least-squares classification (RLSC) • (Rifkin 2002) È- x - x i j Í ( xi , x j ) = exp Í 2d 2 Î I ( K + )c t = y t C 2 ˘ ˙ ˙ ˚ K I -1 ct = ( K + ) y t C ct = machine for class t yt = truth vector for class t C = regularization constant (10) Time-aware audio features • MPEG-7 derived state-paths (Casey 2001) • Music as discrete path through time • Reg’d to 20 states 0.1 s Per-term accuracy Good terms Bad terms Busy 42% Artistic 0% Steady 41% Homeless 0% Funky 39% Hungry 0% Intense 38% Great 0% Acoustic 36% Awful 0% African 35% Warped 0% Melodic 27% Illegal 0% Romantic 23% Cruel 0% Slow 21% Notorious 0% Wild 25% Good 0% Young 17% Okay 0% • Weighted accuracy (to allow for bias) The linguistic expert • Some semantic attachment requires ‘lookups’ to an expert “Dark” “Big” “Light” “Small” “?” Linguistic expert • Perception + “Big” “Light” observed language: • Lookups to linguistic expert: Big Small Dark Light “Dark” “Small” • Allows you to infer new gradation: “?” Big Small Dark Light Parameters: synants of “quiet” “The antonym of every synonym and the synonym of every antonym.” “thundering” “quiet” “noisy” “soft” “clangorous” “hard” Antonyms Synonyms Top descriptive parameters •All P(a) of terms in anchor synant sets averaged •P(quiet) = 0.2, P(loud) = 0.4, P(quiet-loud) = 0.3. •Sorted list gives best grounded parameter map Good parameters Bad parameters Big – little 30% Evil – good 5% Present – past 29% Bad – good 0% Unusual – familiar 28% Violent – nonviolent 1% Low – high 27% Extraordinary – ordinary 0% Male – female 22% Cool – warm 7% Hard – soft 21% Red – white 6% Loud – soft 19% Second – first 4% Smooth – rough 14% Full – empty 0% Vocal – instrumental 10% Internal – external 0% Minor – major 10% Foul – fair 5% Learning the knobs • Nonlinear dimension reduction – Isomap • Like PCA/NMF/MDS, but: – Meaning oriented – Better perceptual distance – Only feed polar observations as input • Future data can be quickly semantically classified with guaranteed expressivity Quiet Loud Male Female Parameter understanding • Some knobs aren’t 1-D intrinsically Color spaces & user models! Future: music acquisition Short term music model: auditory scene to events Structural music model: recurring patterns in music streams Language of music: relating artists to descriptions (cultural representation) Music acceptance models: path of music through social network Grounding sound, “what does loud mean?” Semantics of music: “what does rock mean?” What makes a song popular? Semantic synthesis Reverse: semantic synthesis • “What does college rock sound like?” • Meaning as transition probabilities Loud rock with electronics What’s next • Human evaluation – Inter-rater reliability – “can we trust the internet for community meaning?” • “Meaning recognition” (time) • Hierarchy learning
© Copyright 2026 Paperzz