Meena Nagarajan with Daniel Gruhl*, Jan Pieper

Entity Spotting in
Informal Text
Meena Nagarajan
with
Daniel Gruhl*, Jan Pieper*, Christine Robson*,
Amit P. Sheth
Kno.e.sis, Wright State
IBM Research - Almaden, San Jose CA*
Thursday, October 29, 2009
1
Tracking Online Popularity
http://www.almaden.ibm.com/cs/projects/iis/sound/
Thursday, October 29, 2009
2
Tracking Online
Popularity
http://www.almaden.ibm.com/cs/projects/iis/sound/
•
What is the buzz in the online
Music Community?
•
Ranking and displaying top X
music artists, songs, tracks,
albums..
•
Spotting entities,
despamming, sentiment
identification, aggregation, top X
lists..
Thursday, October 29, 2009
3
Spotting music entities in
user-generated content in
online music forums
(MySpace)
Thursday, October 29, 2009
4
Chatter in Online Music Communities
http://knoesis.wright.edu/research/semweb/projects/music/
Thursday, October 29, 2009
5
Goal: Semantic Annotation of
artists, tracks, songs, albums..
Music Brainz RDF
Ohh these sour times... rock!
Ohh these <track id=574623> sour times </track> ... rock!
Thursday, October 29, 2009
6
Multiple Senses in the same
Domain
• 60 songs with Merry
Christmas
• 3600 songs with
Yesterday
• 195 releases of
American Pie
Caught AMERICAN
PIE on cable so much
fun!
Thursday, October 29, 2009
• 31 artists covering
American Pie
7
Annotating UGC, other
Challenges
• Several Cultural named entities
• artifacts of culture, common words in
everyday language
LOVED UR MUSIC YESTERDAY!
♥ Just showing some Love to you Madonna you are The Queen to me
Lily your face lights up when you smile!
Thursday, October 29, 2009
8
Annotating UGC, other
Challenges
• Informal Text
• slang, abbreviations, misspellings..
• indifferent approach to grammar..
• Context dependent terms
• Unknown distributions
Thursday, October 29, 2009
9
Our Approach
Spotting and subsequent sense
disambiguation of spots
Ohh these sour times... rock!
Ohh these <track id=574623> sour times </track> ... rock!
Thursday, October 29, 2009
10
3.1
Ground Truth Data Set
Ground Truth Data Set
Our experimental evaluation focuses on user comments from the MySpace pages
of three artists: Madonna, Rihanna and Lily Allen (see Table 2). The artists
were selected to be popular enough to draw comment but different enough to
provide variety. The entity definitions were taken from the MusicBrainz RDF (see
Figure 1), which also includes some but not all common aliases and misspellings.
• 3 artists : Madonna, Rihanna, Lily Allen
Madonna
Rihanna
•
•
•
Lilly Allen
an artist with a extensive discography as well as a current album and
concert tour
a pop singer with recent accolades including a Grammy Award and a
very active MySpace presence
an independent artist with song titles that include “Smile,” “Allright,
Still”, “Naive”, and “Friday Night” who also generates a fair amount
of buzz around her personal life not related to music
1858 spots (MySpace UGC) using naive spotter over
MusicBrainz artist metadata
Adjudicate if a spot is an entity or not (or inconclusive)
hand
4 Ground
authors
Tabletagged
2. Artists by
in the
Truth Data Set
We establish a ground truth
data
Artist
Good spots
Bad spots
Precision
Agreement Agreement
set of 1858 entity spots (best
for case
these
for (Spots scored)
100% 75 % 100% 75%
naive
artists (breakdown in Table
3).spotter)
The
Rihanna (615) 165
18
351
8
33%the
data was obtained by crawling
Lily (523)
268
42
10
100
73%and
artist’s MySpace page comments
Madonna (720) 138
24
503
20
23%
dentifying all exact string matches
of the artist’s song titles. Only com- Table 3. Manual scoring agreements on
ments with at least one spot were re- naive entity spotter results.
Thursday, October
29, 2009
tained.
These
spots were then hand
11
Experiments and
Results
Thursday, October 29, 2009
12
Experiments
All entities from
MusicBrainz
1. Light weight, edit distance
based entity spotter
Thursday, October 29, 2009
13
Experiments
1. Naive spotter using all entities from all of
MusicBrainz
2. This new Merry Christmas tune is so good!
? but which one ?
Disambiguate between the 60+ Merry
Christmas entries in MusicBrainz
Thursday, October 29, 2009
14
Experiments
2. Constrain set of possible
entities from Musicbrainz
- to increase spotting accuracy
- constrain using cues from the
comment to eliminate
alternatives
This new Merry
Christmas tune is
so good!
Thursday, October 29, 2009
15
Experiments
3. Eliminate non-music
mentions
Natural language and domain
specific cues
Your SMILE rocks!
Thursday, October 29, 2009
16
Restricted Entity
Spotting
Thursday, October 29, 2009
17
2. Restricted Entity
Spotting
• Investigating the relationship between number
of entities used and spotting accuracy
• Understand systematic ways of scoping
domain models for use in semantic annotation
• Experiments to gauge benefits of implementing
particular constraints in annotator systems
• harder artist age detector vs. easier gender
detector ?
Thursday, October 29, 2009
18
sets of artists that are factors of 10 smaller (10%, 1%, etc). These subsets
ays contain our three actual artists (Madonna, Rihanna and Lily Allen),
ause we are interested in simulating restrictions that remove invalid artists.
e most restricted entity set contains just the songs of one artist (≈0.0001% of
MusicBrainz taxonomy). In order to rule out selection bias, we perform 200
dom draws of sets of artists for each set size - a total of 1200 experiments.
ure 2 Precision
shows that the precision increases as the set of possible entities shrinks.
(best
each
setcase
size, for
all 200 results are plotted and a best fit line has been added
naive the
spotter)
ndicate
average precision. Note that the figure is in log-log scale.
2a. Random Restrictions
33%
73%
23%
#""$
#""$
#"$
#$
%&'()*''+,
/178,,1
%&'()*''+,-.)/(.012+)314+
5&61,,1-.)/(.012+)314+
/178,,1-.)/(.012+)314+
5&61,,1
!#$
!"#$
!#"$.-.(%'()'&*"'56(&&"#
!"""#$
!"#$"%&'()'&*"'+,-.$'/#0.%1'&02(%(34
!""#$
!"#$
!#$
#$
#"$
Domain restrictions of 10% of the RDF
result in approximately 9.8 times
improvement in precision
!""#$
!"""#$
. 2. Precision of a naive spotter using differently sized portions of the MusicBrainz
onomy to spot song titles on artist’s MySpace pages
We observe that the curves in Figure 2 conform to a power law formula,
cifically a Zipf distribution ( nR1 2 ). Zipf’s law was originally applied to demonate the Zipf distribution in frequency of words in natural language corpora
, and has since been demonstrated in other corpora including web searches
Figure 2 shows that song titles in Informal English exhibit the same frency characteristics as plain English. Furthermore, we can see that in the
rage case, a domain restrictions of 10% of the MusicBrainz RDF will result
roximately in a 9.8 times improvement in precision of a naive spotter.
This result is remarkably consistent across all three artists. The R2 values
the power lines on the three artists are 0.9776, 0.979, 0.9836, which gives a
iation of 0.61% in R2 value between spots on the three MySpace pages.
• From all of MusicBrainz (281890 artists, 6220519
tracks) to songs of one artist (for all three artists)
Thursday, October 29, 2009
19
2b. Real-world Constraints
for Restrictions
“Happy 25th Rhi!” (eliminate using Artist DOB - metadata in
MusicBrainz)
“ur new album dummy is awesome” (eliminate using Album release
dates - metadata in MusicBrainz)
• Systematic scoping of the RDF
• Question: Do real-world constraints from
metadata reduce size of the entity spot set in a
meaningful way?
• Experiments: Derived manually and tested for
usefulness
Thursday, October 29, 2009
20
D
1,193 20-30 year career
Recent Album Restrictions- Applied to Madonna
E
6,491 Artists who released an album in the past year
F 10,501 Artists who released an album in the past 5 years
Artist Age Restrictions- Applied to Lily Allen
H
112 over
ArtistMusicBrainz
born 1985, album in past 2 years
Restrictions
J
284 Artists born in 1985 (or bands founded in 1985)
Key Count Restriction
L
4,780 Artists or bands under 25 with album in past 2 years
Artist
CareerArtists
Length
RestrictionsM 10,187
or bands
under 25 Applied
years old to Madonna
B
80’s artists
with recentApplied
(within 1toyear)
Number
of22Album
RestrictionsLilyalbum
Allen
C
154 Only
First one
album
1983 released in the past 2 years
K
1,530
album,
D 19,809
1,193 Artists
20-30 year
N
withcareer
only one album
Recent Album
Album RestrictionsRestrictions- Applied
Applied to
to Rihanna
Madonna
Recent
E
6,491
Artists
who
released
albumlast
in the
Q
83 3
albums
exactly,
firstanalbum
yearpast year
F 10,501
Artists
who first
released
an last
album
in the past 5 years
R
196 3+
albums,
album
year
Artist
Applied
S Age
1,398RestrictionsFirst album last
year to Lily Allen
H
112 Artists
Artist born
album one
in past
2 years
T
2,653
with 1985,
3+ albums,
in the
past year
J
284 Artists who
bornreleased
in 1985 an
(or album
bands in
founded
in year
1985)
U
6,491
the past
L
4,780
Artists
or bands under
25 with
album
in past 2 years
Specific
Artist
RestrictionsApplied
to each
Artist
M
Artists or only
bands under 25 years old
A 10,187
1 Madonna
Number
of 1Album
RestrictionsApplied to Lily Allen
G
Lily Allen
only
P
1 Rihanna
K
1,530
Only oneonly
album, released in the past 2 years
Z 281,890
artists
in only
MusicBrainz
N
19,809 All
Artists
with
one album
Recent Album Restrictions- Applied to Rihanna
Q
83 3 albums exactly, first album last year
Table
4.
The
efficacy
of
various
sample
restrictions.
D. I’ve
been
your
fan
for
25
years!
M. Happy 25th !
R
196 3+ albums, first album last year
S
1,398 First album last year
e Thursday,
consider
three
classes
of
restrictions
career,
age
and
album
T
2,653
Artists
with
3+
albums,
one
in
the
past
year
October 29, 2009
Real-world Constraints
....
....
based
21
Real-world Constraints
• Applied different constraints to different
artists
• Reduce potential entity spot size
• Run naive spotter
• Measure precision
Thursday, October 29, 2009
22
Real-world Constraints
“I heart your new album”
“I love all your 3 albums”
“You are most favorite new pop artist”
!"""#$
!""#$
!"#$
!#$
#$
#"$
%&'()*+,-..)/*./&0%)1*-%*-%23*405&%%&*+-%6+**
*****789!9$*,/):0+0-%;
)A&:.23*8*&2>?@+
&.*2)&+.*8*&2>?@+
&/.0+.+*<5-+)*=0/+.*&2>?@*<&+*
0%*.5)*,&+.*8*3)&/+
*&22*&/.0+.+*<5-*/)2)&+)1*&%*
&2>?@*0%*.5)*,&+.*8*3)&/+
#""$
#""$
#"$
#$
!#$
!"#$
!""#$
!"#$%&%'()'*)+,#)-.'++#"
Rihanna: short career, recent album
releases, 3 album releases etc....
*)%.0/)*B?+0:*C/&0%D*.&A-%-@3*7"!"""8$*,/):0+0-%;
!"""#$
Thursday, October 29, 2009
23
Real-world Constraints
Age restrictions, only one album, last year releases,
extensive career etc...
!"#$
!#$
#$
#"$
3%?@1*)8:''1&*'&%(31A*:3*:340*B%A:33%*):3C)*
********D--!9$*8&12()(:3E
#""$
#""$
1%&40*=">)*%&'()')*+(',*%3*
%4567*(3*',1*8%)'*01%&
%&'()')*+,:)1*;(&)'*
%&'()')*+(',*%*
&141%)1*+%)*(3*#<=/
-"./"*01%&*2%&11&
%&'()')*+(',*%3*%4567*(3*',1*8%)'*01%&
%&'()')*+(',*%3*%4567*(3*',1*8%)'*9*01%&)
13'(&1*B6)(2*F&%(3)*'%G:3:70**D"!"""9$*8&12()(:3E
Madonna
Thursday, October 29, 2009
#"$
#$
!#$
!"#$
!""#$
!"""#$
!"""#$
!""#$
!"#$
!#$
#$
#"$
-%>?5*):,''5&*'&%(-52*,-*,-7<*@(7<*A775-*),-B)
***************1C#$*:&5D()(,-6
#""$
#""$
#"$
%-*%7+48*(-*'95*:%)'*';,*<5%&)
%&'()')*4-25&*=0*<5%&)* #$
,72*1,&*+%-2)*75))*
'95-*=0*<5%&)*,726
!#$
%&'()')*+,&-*(-*#./0*
1,&*+%-2)*3,4-252*(-*#./06
%&'()')*;('9*,-7<*,-5*%7+48
5-'(&5*E4)(D*F&%(-G*'%H,-,8<*1"!""C$*:&5D()(,-6
Lily Allen
!"#$
!""#$
!"#$%&%'()'*)+,#)-.'++#"
!""#$
!"#$%&%'()'*)+,#)-.'++#"
!"""#$
!"""#$
24
Take aways..
• Real world restrictions closely follow distribution
of random restrictions, conforming loosely to a
Zipf distribution
• Confirms general effectiveness of limiting domain
size regardless of restriction
• Choosing which constraints to implement is simple
- pick whatever is easiest first
• use metadata from the model to guide you
Thursday, October 29, 2009
25
Non-music Mentions
Thursday, October 29, 2009
26
Disambiguating Nonmusic References
UGC on Lily Allen’s page about her new track Smile
Got your new album Smile. Loved it!
Keep your SMILE on!
Thursday, October 29, 2009
27
Binary Classification, SVM
Got your new album Smile. Loved it!
Keep your SMILE on!
Syntactic features
+
POS tag of s
POS tag of one token before s
POS tag of one token after s
Typed dependency between s and sentiment word *
Typed dependency between s and domain-specific term *
Boolean Typed dependency between s and sentiment *
Boolean Typed dependency between s and domain-specific term *
Word-level features
+
Capitalization of spot s
+
Capitalization of first letter of s
+
s in Quotes
Domain-specific features
Sentiment expression in the same sentence as s
Sentiment expression elsewhere in the comment
Domain-related term in the same sentence as s
Domain-related term elsewhere in the comment
+
Refers to basic features, others are advanced features
∗
These features apply only to one-word-long spots.
Notation-S
s.POS
s.POSb
s.POSa
s.POS-TDsent ∗
s.POS-TDdom ∗
s.B-TDsent ∗
s.B-TDdom ∗
Notation-W
s.allCaps
s.firstCaps
s.inQuotes
Notation-D
s.Ssent
s.Csent
s.Sdom
s.Cdom
Training data
550 good spots
550 bad spots
Test data
120 good spots
229 * 2 bad spots
Table 6. Features used by the SVM learner
Thursday, October 29, 2009
28
Most Useful Combinations
Precision intensive
FP best : All features,
other combinations
42-91
TP next best : word,
domain, contextual (POS)
78-50
!"#$
TP best : word, domain,
contextual
90-35
Not all syntactic features are
Recall intensive
useless, contrary to general
belief, wrt informal text
Thursday, October 29, 2009
29
Naive MB spotter + NLP
• Annotate using naive
'!!"
spotter
%!"
• best case baseline
$!"
$#32'
$53&%
%!3&5
6!35%
6'36!
6#345
6$3$!
6&35!
2!345
()*+,
-./00,1
!"
%'36!
71,89-9/(:;/1:<9=>:?==,(
71,89-9/(:;/1:@9A)(()
71,89-9/(:;/1:B)C/(()
@,8)==:D)==:0A1,,E
#!"
%#3&$
5('*%$%63)7)8'*#""
&!"
!"#$$%&%'()#**+(#*,)$-"%.$)/0#"%12%30#"%14
PR tradeoffs: choosing feature
combinations depending on end
application requirement
Thursday, October 29, 2009
(artist is known)
• follow with NLP analytics
to weed out FPs
• run on less than entire
input data
30
Summary..
• Real-time large-scale data processing
• prohibits computationally intensive NLP techniques
• Simple inexpensive NL learners over a dictionary-
based naive spotter can yield reasonable performance
• restricting the taxonomy results in proportionally
higher precision
• Spot + Disambiguate a feasible approach for (esply.
Cultural) NER in Informal Text
Thursday, October 29, 2009
31
Thank You!
• Bing,Yahoo, Google: Meena Nagarajan
• Contact us
•
{dgruhl, jhpieper, crobson}@us.ibm.com, {meena, amit}@knoesis.org
• More about this work
•
•
Thursday, October 29, 2009
http://www.almaden.ibm.com/cs/projects/iis/sound/
http://knoesis.wright.edu/researchers/meena
32