Identifying and Explaining Peaks in Music Playing Data - UvA-DARE

Identifying and Explaining Peaks in
Music Playing Data
Bachelor Thesis
Information Studies: Human Centered Multimedia
University of Amsterdam
July 2013
Guido van Bruggen
Supervision
Coordination
Wouter Weerkamp
Anders Bouwer
Manos Tsagkias
1 TABLE OF CONTENTS
2
Abstract
3
3
Introduction
4
4
Related Work
6
5
Data
7
6
5.1
Extraction
7
5.2
Evaluation
7
5.3
User Classification
8
5.4
Dataset
8
Method
6.1
Peak Detection
9
6.2
Sample
9
6.3
Manual peak explaining
11
6.4
User Activity Retrieval
11
6.4.1
Wikipedia view statistics
12
6.4.2
Wikipedia edit statistics
12
6.4.3
Tweets
12
6.5
7
9
Automatic Peak Explanation Retrieval
13
6.5.1
Most retweeted tweet
13
6.5.2
Musicbrainz releases
13
6.5.3
Wikipedia edits
13
Results
15
7.1
Manual Peak Explanation
15
7.2
Examples
16
7.2.1
Lupe Fiasco
16
7.2.2
Mumford & Sons
17
7.3
Correlations
17
1
7.4
8
Automatic Peak Explanation
19
7.4.1
Musicbrainz Releases
19
7.4.2
Wikipedia edits
19
7.4.3
Most Retweeted Tweet
20
Discussion / Future Work
22
8.1
Twitter as a source
22
8.2
Real-time
22
8.3
Influencing sources
22
8.4
Automatically retrieving explanations
23
8.5
Other sources
23
9
Conclusion
24
10
References
25
11
Appendix
27
2
2 ABSTRACT
Social media platforms provide users with the possibility to let the word know what’s on their mind
at every moment. Twitter users use the hashtag #nowplaying to let others know what music they are
listening to. By retrieving message (tweets) containing this tag we are able to get an idea of what
music the world is playing right now. We use a dataset containing all these tweets retrieved in a
period of 12 days. Using a peak detection method we identify peaks in the number of plays for all
artists. We assume that these peaks are caused by news concerning the artists, such as the release
of a new album or other ‘news’ such as gossip. A manual search of a sample of these peaks and
internet sources gives us an idea of the reasons people start listening to certain artists. The statistics
on the number of related Wikipedia page views, page edits and tweets provides us with the
possibility to check these sources for similar patterns. Results show that new releases of albums or
singles are the most common cause for people to start listening to certain artists. Calculated
correlations between the statistics from several sources show that there are similar patterns for
most cases in the sample. Three simple methods for automatically retrieving explanations using
Wikipedia, Musicbrainz and Twitter are developed. Although results vary, we show that it is feasible
to automatically explain peaks in the listening behavior of Twitter users. Musicbrainz provided the
best explanations for peaks.
3
3 INTRODUCTION
Since the launch of their microblogging platform in 2006, Twitter is growing very fast. With over
200 million active users in February 2013 it has become one of the most used websites worldwide.
Using only 140 characters Twitter users send messages (tweets) about any topic and are able to
follow other users. [1] shows that tweets are mostly covering very recent stories and Twitter is
therefore known for bringing news faster than conventional news media. [2] confirms this by
showing that tweets about Michael Jackson’s death appeared more than an hour before the first
conventional news sources started reporting on it.
So-called hashtags are used in Twitter messages to define the subject or the event the tweet is
about. Hashtags are represented by the symbol ‘#’ followed by a word. The hashtag
#elections2012, for example, was used during the presidential elections in the United States in
2012. Although every Twitter user is able to make up and use their own tags, there are many tags
commonly used by the Twitter community. The hashtag #nowplaying (or the abbreviation #np),
is one of the most popular hashtags used by the Twitter community to let other users know what
music someone is listening to at the moment [3].
Because of the almost real-time characteristic of Twitter and the worldwide spread of the users,
the retrieval of tweets containing #nowplaying hashtags, can be used to get an idea of what the
world is listening to. This thesis uses a dataset created by monitoring the Twitter stream and saving
all tweets containing these tags during a period of twelve days.
By counting the number of tweets mentioning an artist and the nowplaying hashtag, it is possible
to get a daily amount of ‘plays’ for all artists in the dataset. Comparing these numbers makes it
possible to detect bursts in the listening behavior of Twitter users.
The assumption is that most of these bursts are caused by news concerning music artists, and that
we might be able to explain these bursts with news sources. For example: the release of a new
album of the band Coldplay will probably result in many people listening to that new album.
Furthermore, we have seen popularity and album sales rise as a reaction to the death of music
artists, like Michael Jackson.
Because the internet provides us with many news sources and ways to retrieve these, the task of
automatically finding bursts and related news sources is an interesting one. This thesis takes the
first steps and explores challenges in the task of automatically finding bursts and related news
sources in the ‘nowplaying’ dataset.
4
First, statistics on the activity of internet users on sources like Wikipedia and Twitter for the same
period as the ‘nowplaying’ dataset are retrieved, and checked for similarities. The assumption is
that people, next to listening to an artist, might also react to news by searching the internet for
additional information, and that burst patterns therefore might also appear in other internet sources
providing news and information on music artists. Checking statistics for user activity for
similarities, makes it able to verify bursts and gives an idea about what sources might be useful to
automatically explain bursts in the ‘nowplaying’ dataset.
Next, three sources are used in an attempt to automatically retrieve possible explanations for bursts:
Twitter, Wikipedia edits and Musicbrainz, a user-generated database containing information about
music artists and their releases [4].
By looking for peaks in the playing data and manually explaining them using internet sites, we try
to get an idea about the reasons behind the listening behavior of people. Findings derived from
this manual analysis are used to explore the challenges in the task of automatically retrieving
explanations.
5
4 RELATED WORK
Twitter provides us with a huge source of information about what is happening and popular right
now. Among others, [5] and [6] prove that Twitter provides a solid source to get an idea of how
the world feels about certain topics, and how this information can be used to make predictions on
subjects like stock markets and movie revenues. These sentiment analysis techniques provide us
with information on how people feel about certain topics.
Another popular field of research is the early detection of new emerging popular topics using social
media like Twitter. So-called first story detection algorithms monitor the Twitter stream and try to
cluster messages to find emerging popular stories or events among Twitter users. [7] provides an
algorithm for performing this analysis in a live, streaming context, with new tweets arriving every
moment. Results show that news or gossips covering celebrity deaths, like Michael Jackson, spread
the fastest among Twitter users. Applications of these techniques include the early detection of
earthquakes [8].
These methods are not solely used on information available on Twitter, but also on sites like
Facebook and Wikipedia. [9] uses Wikipedia view statistics for the detection of popularity of
Wikipedia articles and uses this to provide users with popular pages related to their personal
interest.
While these techniques mainly focus on analyzing one source to detect emerging topics, combining
data from different sources could be of use in the verification of possible found emerging topics.
[10] combines first story detection algorithms performed on tweets with page view statistics of
related Wikipedia pages to verify and improve the first story detection on Twitter. Results show
that Wikipedia view statistics can be a good source to verify emerging trending topics on Twitter.
The basis of detecting emerging topics lies in the so-called peak detection, which covers the task
of identifying big increases and decreases in data streams. [11] provides a method for detecting
peaks in queries to the MSN Search engine. [12] uses a similar method in recognizing popular
queries made to eBay.
By using peak detection to mood levels in blogs, [13] finds interesting transitions and tries to
automatically explain peaks with a related event.
6
5 DATA
The used dataset consists of a collection of tweets retrieved by monitoring the Twitter stream from
17 September 2012 until 28 September 2012 for the hashtags #nowplaying and #np. A brief
explanation of how this data was processed to extract song titles and artist names is given here,
[14] describes these methods and the evaluation in more detail.
5.1 EXTRACTION
The second step, after retrieving the tweets containing information on what people are listening
to, is to extract the name of the song and artist mentioned in the tweet. The extraction has been
done using a three-step approach. First, by using basic regular expressions, candidate song titles
and artist names are extracted. Musicbrainz is used to try to match these candidates, and to store
related information when a match is found.
If this first step fails and no match is found, Youtube1 is used as a source to find possible matches.
The text of a tweet is used as a query to Youtube, and the top 10 results appearing in the Music
category are retrieved. Regular expressions on these results are used to find candidate song and
artist names. Again, by matching these candidates to Musicbrainz, possible candidates are
confirmed and saved. The third step, used when no matches are found, uses a ‘fuzzy’ matching in
which only a matching artist is required and the other most common candidate found in the tweet
text is used as the song title, even when there is no match with Musicbrainz.
5.2 EVALUATION
A manually annotated test set of 200 tweets was used to evaluate the extraction methods. The first
method, using regular expressions only, scores really high on precision but low on recall. The
second step, using Youtube results, leads to an increase in recall but a slight drop in precision. At
last, the fuzzy method shows a drop in precision but an increase in recall. Using a combination of
these methods in the three-step approach leads to a precision of 0.6871 and a recall of 0.6175 for
song titles, and a precision of 0.7349 and a recall of 0.6100 for artist names.
1
http://www.youtube.com
7
5.3 USER CLASSIFICATION
Initial analysis showed that there were some radio stations whose Twitter accounts were producing
noise in the data by constantly tweeting about the same song or artist. Because most radio users
show a higher usage of the #nowplaying hashtag, data containing information on the use of this
tag was used as features in training a random forest decision tree classifier. Evaluation based on a
manually annotated set of Twitter accounts shows high scores for presion and recall (P=0.7011,
R=0.9839).
5.4 DATASET
This thesis uses a dataset containing the results of the mentioned extraction and classification
methods for a period of twelve days. The total amount of retrieved tweets is 6,5 million. Every
tweet in the dataset related to an artist is considered to be one ‘play’ for this artist. By grouping
these tweets by date and calculating the amount of entries per day, a daily play amount is calculated
for every artists in the dataset.
The dataset contains tweets mentioning 112.834 unique artists. Although this seems like a big
amount, there are many unique artists found in only a few tweets. Figure 1 shows the distribution
of unique artists related to the amount of plays found in the dataset.
Unique Artists and Plays
80000
70000
Unique Artists
60000
50000
40000
30000
20000
10000
0
1
10
100
1000
10000
Amount of Plays
Figure 1 Division of Unique Artists
8
6 METHOD
6.1 PEAK DETECTION
Peak detection is a task that has been done several times. Most approaches use the mean and
standard deviation to calculate a threshold value for peaks. When doing this in a live-context, with
new information constantly arriving, it is not possible to define a threshold value at forehand. Most
approaches therefore use methods where means and standard deviations are calculated for a
specific time range. [11] used this approach to find burst in MSN Search queries, by checking
whether values are 1.5-2.0 times more than the ‘moving average’.
Because the time range for our used dataset is only 12 days, it is not necessary to use such an
approach based on moving averages and therefore we calculate one threshold value for every artist
in the dataset to detect peaks. Every value that is larger than this threshold is labeled as a peak. The
threshold value is calculated as follows: 𝑡ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 = 𝜇 + 1.5 ∗ 𝜎, where µ is the mean and σ is
the standard deviation of tweets for an artist over the entire dataset.
6.2 SAMPLE
Because we are dealing with a large dataset, containing playing data for 112.843 artists, a small
sample of 20 artists is picked to be able to analyze this data in more detail. These artists are picked
by ranking the dataset on the standard deviation of the daily play amounts. This ranking gives us a
way to find the most interesting cases, providing us with artists with high amount of plays and high
variance, having big peaks with possibly interesting explanations.
The example of ´The Beatles´ provides a motivation for this ranking. Play amounts for this band
show a quite low standard deviation, while total amounts of plays are quite high. Because our peak
detection method uses the standard deviation to define a threshold, it finds 2 peaks in the playing
data: September 23 and September 24. While these days do show a small increase in amount of
plays, these peaks are probably not worth looking for, since their amounts are just slightly higher
than the other days (Figure 2). This low variance can also be seen in the scatterplot for ‘The Beatles’
(Figure 3). These observations can be explained by the fact that this band is no longer active and
won’t appear in the news, which gives no reason for the mass to suddenly listen and search for the
band.
By ranking on standard deviation we avoid artists whose data show similar statistics as ‘The Beatles’
to appear in our sample, which leaves interesting cases for further analysis.
9
The Beatles
Wikipedia Page Views
Plays
20000
1200
1000
16000
14000
800
12000
10000
600
8000
Plays
Wikipedia page views
18000
400
6000
4000
200
2000
0
0
17
18
19
20
21
22
23
24
Day in September 2012
25
26
27
28
Figure 2 Play and Wikipedia data for The Beatles
The Beatles
1,2
Wikipedia page views
1
0,8
0,6
0,4
0,2
0
0
0,2
0,4
0,6
0,8
1
1,2
Plays
Figure 3 Scatter Plot showing data for The Beatles
After applying the proposed ranking, we set two restrictions to select artists for the sample. First,
only artists that have one or more peaking days in the dataset are considered. Second, peaks at the
first or last day of the dataset are not considered as useful peaks, since it’s not possible to compare
the amount of plays with the amount a day before, or after, the peak. This way it is not possible to
tell whether peaks are part of larger trends or not.
10
Table 1 shows the names, mean and standard deviation of the 20 artists in our sample. The amount
of plays and detected peaks can be found in Table 5.
Table 1 Sample
Artist Name
Mean
One Direction
Rihanna
Justin Bieber
The Weeknd
Wiz Khalifa
Taylor Swift
Kendrick Lamar
Lupe Fiasco
Kanye West
fun.
Usher
Sean Paul
Booba
Big Sean
Muse
Mumford & Sons
Chris Brown
LMFAO
Avenged Sevenfold
Lana Del Rey
5929
3117
4129
1671
2379
3995
1721
940
3443
645
1970
702
571
2328
952
513
5089
419
1492
1166
Standard
Deviation
2920
1248
677
635
539
530
387
381
379
367
342
334
316
302
299
296
285
234
223
240
6.3 MANUAL PEAK EXPLAINING
For the peaks appearing in the playing data of our sample of 20 artists, online sources were searched
manually for possible explanations. This gives an idea of why peaks are happening and whether
these (possible) explanations can be found using internet sources. Google Search, Wikipedia and
Musicbrainz were searched for events like news facts or new releases that might have been the
reason for increases in plays.
6.4 USER ACTIVITY RETRIEVAL
Daily statistics on the user activity on Wikipedia and Twitter were retrieved to get an idea about
the popularity of music artists every day. By comparing these statistics and by calculating
correlations between them, we are able to get an idea of how similar these patterns are and whether
these statistics might be useful to verify peaks found in the ‘nowplaying’ dataset.
11
6.4.1
Wikipedia view statistics
Wikipedia provides view statistics for all their pages as dumps 2 , which can be used to get an
indication of the popularity of a Wikipedia article. In order to download these statistics it is
important to first retrieve the correct Wikipedia pages for every artist. This was done using the
Musicbrainz API, which provides URLs for the corresponding Wikipedia pages for many artists.
A website3 offering JSON files for the daily view statistics, extracted from the Wikipedia dumps,
was used to retrieve the view statistics for every artist in our sample.
6.4.2
Wikipedia edit statistics
Because of the big community that Wikipedia editors are part of, it seems to be a good source for
topical information about any subject. On 23 July 2011, the day that Amy Winehouse died, a total
amount of 303 edits were made on her Wikipedia page4, showing that (popular) Wikipedia articles
are likely to be up to date and these edits could provide us with interesting information.
The Wikipedia API provides information on all Wikipedia articles, including a revision history. By
retrieving and counting all edits for one day, we are able to get the daily amount of edits made to
Wikipedia pages related to music artists.
6.4.3
Tweets
The public streaming API provided by Twitter5 offers representative samples of around 1% of the
total amount of tweets for every day. Every sample contains around 3.5 million tweets. These
archives can be used to get an idea of what was popular among Twitter users on a specific day. The
samples for all twelve days in the dataset were collected. The number of tweets mentioning a
specific music artist gives a reflection of the popularity of that artist among Twitter users. To
retrieve this number, the sample archives were queried for artist names using regular expressions
on all tweets.
By looking at the results, we found that the archive for September 18 was not complete and
contained only 1 million tweets (Table 6). To still be able to compare this data, the amount of
matching tweets were normalized by dividing the amount by the total amount of tweets that day.
http://dumps.wikimedia.org/other/pagecounts-raw/
http://stats.groks.se
4 http://en.wikipedia.org/w/index.php?title=Amy_Winehouse&offset=20110724000000&action=history
5 https://dev.twitter.com/docs/streaming-apis/streams/public
2
3
12
6.5 AUTOMATIC PEAK EXPLANATION RETRIEVAL
6.5.1
Most retweeted tweet
People tweet about what is happening right now, therefore tweets mentioning a music artist might
contain information about news on that artist. The assumption is that the most popular tweet
(reflected by retweet amount) concerning an artist might be useful to describe bursts in the playing
data. The publically available sample stream of Twitter, also used in the user activity retrieval, is
used to find the most retweeted tweet by searching for tweets mentioning a music artists on the
day of a peak in play amounts.
6.5.2
Musicbrainz releases
The release of a new single or album can be a reason for people to start listening to a specific artist.
Finding releases near a peaking day can therefore be a good source for the possible explanation of
a peak. Musicbrainz offers access to their database containing information like releases. Their API
was used to search for releases and to retrieve information like title and release date.
Manual analysis shows that some peaks are caused by early leaks of new albums or singles. Because
the Musicbrainz database contains only official releases, peaks that occur because of early leaks of
albums or singles might not be directly related to releases. Setting a bigger date range to query for
releases can easily solve this problem, since leaks are likely to appear on the internet only a couple
of days before the official release. The date range used in the algorithm was set to 4 days before
the peaking day until 4 days after the day of a peak.
6.5.3
Wikipedia edits
Wikipedia edits are made for fluency or factual reasons: fluency edits are made to improve
readability and style, while factual edits are made to improve the content of the page [15]. The
Wikimedia API is used to check whether edits are made near the day of a peak for Wikipedia pages
related to music artists. The API provides several data about these revisions like time, user id, user
comment and the increase or decrease in page size. Using this metadata, it is possible to filter
revisions. Besides this metadata, it is also possible to retrieve a snippet of HTML, showing the
previous revision of the page and marking the changes in text (Figure 4)
13
Figure 4 Edit Made to 'Avenged Sevenfold´ Wikipedia Article
Based on the manual analysis of our sample, we choose to look for Wikipedia edits made from one
day before until one day after the day of the peak. The assumption is that revisions making the
biggest increase in page size are probably the most declarative for a news fact happening that day.
Therefore the revision with the biggest increase in page size is picked as a possible explanation for
a peak.
14
7 RESULTS
7.1 MANUAL PEAK EXPLANATION
Manual analysis of the peaks appearing in the sample shows that most peaks can be related to new
albums or singles in the same period (Table 2). Out of all 20 peaks, 13 are related to releases. Of
the remaining 7 cases, 3 peaks are news-related. The remaining 4 cases have no clear explanation.
Table 2 Peak Explanation
Artist Name
Peak Date
Explanation Type
Explanation
One Direction
21-9-2012 Release
Single 'Live While We're Young' leaked on 20-9-2012
Rihanna
27-9-2012 Release
Single 'Diamonds' released on 27-9-2012
Justin Bieber
22-9-2012 News
Biebers mother talked in TV show about possible abortion of
her son
The Weeknd
25-9-2012 Release
Single Release "Remember You" of Wiz Khalifa, featuring
The Weeknd
Wiz Khalifa
25-9-2012 Release
Single Release "Remember You"
Taylor Swift
25-9-2012 Release
Single release: 'Begin Again'
Kendrick Lamar
22-9-2012 Unknown
Unknown
Lupe Fiasco
25-9-2012 Release
Album release 'Food & Liquor II: The Great American Rap
Album, Part 1'
Kanye West
19-9-2012 Release
Album release 'Cruel Summer', containing songs from Kanye
West
fun.
18-9-2012 News
Nominated for MTV Europe Awards
Usher
22-9-2012 Unknown
Unknown
Sean Paul
19-9-2012 Release
Album release 'Tomahawk Technique'
Booba
21-9-2012 Release
Single release 'Caramel'
Big Sean
19-9-2012 Unknown
Unknown
Muse
24-9-2012 Release
Album release 'The 2nd Law', pre-listen online at 24-9
Mumford &
Sons
25-9-2012 Release
Album release 'Babel' at 25-9-2012
Chris Brown
23-9-2012 Unknown
Unknown
Lana del Rey
25-9-2012 Release
Single release 'Ride'
15
LMFAO
20-9-2012 News
Announced split
Avenged
Sevenfold
24-9-2012 Release
Single release 'Carry On (Call of Duty: Black Ops II Version)'
7.2 EXAMPLES
The cases of the artists ‘Lupe Fiasco’ and ‘Mumford and Sons’ are used to give an idea of the
findings.
7.2.1
Lupe Fiasco
Figure 5 shows the division of statistics on plays, Wikipedia page views and tweets for the rapper
‘Lupe Fiasco’. The peak detection algorithm found 2 peaks in the playing data for ‘Lupe Fiasco’:
September 25 and 26, which are caused by the release of a new album on September 25.
Although the peak detection algorithm discovered just 2 peaks, we do see another smaller peak for
all 3 data sources on September 20. When searching for this day, we find several Google Search
results mentioning the leak of the album, which is the probable cause for the peak.
The graph shows similar patterns for the three sources, having the same peaks and falls. This
similarity is found by calculating Pearson correlation coefficients between sources as well, with
coefficients all exceeding 0.90.
Album Leak
Lupe Fiasco
Album Release
0,18
0,16
Fraction of Total
0,14
0,12
0,1
0,08
0,06
0,04
0,02
0
17
18
19
20
21
22
23
24
25
26
27
28
Day in September 2012
Plays
Wikipedia page views
Tweets
Figure 5 Data for Lupe Fiasco
16
7.2.2
Mumford & Sons
Figure 6 shows the statistics retrieved for the band ‘Mumford & Sons’. The peak detection
algorithm found a peak at September 25, which turned out to be the date of the release of their
new album. The amount of tweets mentioning the band shows a similar pattern as the amount of
plays. The statistics for the amount of Wikipedia page views however, show a peak 2 days in
advance, at September 23. An explanation for this different pattern is not found.
Album
Release
Mumford & Sons
0,25
Fraction of Total
0,2
0,15
0,1
0,05
0
17
18
19
20
21
22
23
24
25
26
27
28
Day in September 2012
Plays
Wikipedia page views
Tweets
Figure 6 Data for Mumford and Sons
7.3 CORRELATIONS
The 4 sources on online user activity are checked for similarities by calculating the Pearson
correlation coefficients between the data for every artist in the sample. This shows us that there
are high correlations (> 0.80) between several sources (Table 3). Only the amount of Wikipedia
edits doesn’t seem to have high correlations with any of the other sources of data.
17
Table 3 Pearson Correlation Coefficients Between Sources
Artist
Plays - Wiki
Views
Plays - Wiki
Edits
Plays Tweets
Wiki Views - Wiki Wiki Views Edits
Tweets
Wiki Edits Tweets
One Direction
0,765
-0,078
0,000
-0,311
0,594
0,000
Rihanna
0,869
0,306
0,820
0,288
0,829
0,578
Justin Bieber
0,180
0,013
0,724
0,049
0,029
0,261
The Weeknd
0,986
0,268
0,938
0,264
0,922
0,289
Wiz Khalifa
0,537
Taylor Swift
0,622
0,439
0,818
0,317
0,694
0,147
-0,062
0,126
0,130
0,736
0,581
0,439
Lupe Fiasco
0,968
-0,006
0,943
-0,030
0,943
0,015
Kanye West
0,815
0,258
0,612
0,256
0,427
0,053
fun.
0,340
0,005
-0,894
0,106
-0,248
-0,184
Usher
0,449
-0,268
0,514
-0,238
0,912
-0,122
Sean Paul
0,883
-0,298
0,562
-0,065
0,247
-0,607
Booba
0,806
0,406
0,934
0,542
0,587
0,317
Big Sean
0,586
-0,136
0,762
0,229
0,760
0,147
Muse
0,853
0,012
0,888
0,171
0,929
0,038
Mumford &
Sons
0,537
Chris Brown
0,209
-0,465
0,086
-0,057
0,749
-0,226
Lana Del Rey
0,815
0,522
0,814
0,499
0,733
0,813
LMFAO
-0,311
-0,099
-0,113
0,628
0,358
0,437
Avenged
Sevenfold
0,809
0,102
0,578
0,409
0,861
0,543
Kendrick
Lamar
0,903
0,154
0,939
0,775
Correlations between amount of plays, Wikipedia page views and tweets are quite high for several
artists. 7 artists have high scores (> 0.80) on more than one of the correlation coefficients. Four of
these 7 artists (Rihanna, The Weeknd, Lupe Fiasco and Muse) have three high correlation
coefficients, showing that statistics on the amount of plays, tweets and Wikipedia page views tend
to be similar.
18
When combining the correlations with our manual peak explanation, we find that of all 3 artists
whose peaks are news-related, none have high correlations between any of the retrieved sources.
Of all 4 artists whose peak explanations are unknown, only one case (Usher) shows a high
correlation. Furthermore, 12 of the 13 artists, having release-related peaks, show at least one high
correlation coefficient, suggesting that especially releases lead to high and concurrent increases in
plays, tweets and Wikipedia page views.
Although these observations are based on only 20 cases, the high correlations found for several
cases show first evidence in similarities between the usage of different internet sources, which
might be an interesting feature in the verification of peaks found in the nowplaying dataset.
7.4 AUTOMATIC PEAK EXPLANATION
7.4.1
Musicbrainz Releases
The algorithm used to retrieve new releases of songs and albums using the Musicbrainz API
performs reasonably well. The database seems to be very complete containing most official releases
for many artists. The sample of 20 artists showing peaks in listening behavior contains 13 cases in
which new releases are likely to be the cause of peaks. The algorithm is able to find 10 of these 13
cases using the Musicbrainz API. The cases in which the algorithm fails to retrieve releases can be
explained by the fact that these are releases of albums or singles that are in cooperation with other
artists. Apparently the Musicbrainz database does not holds this information for every release.
7.4.2
Wikipedia edits
The algorithm used to select relevant Wikipedia edits found a total of 16 edits for the 20 artists in
our sample. Ten of these edits turn out to be relevant and explanatory for peaks, which leaves a
total of 6 edits that were of no relevance. In 4 of the 20 cases in the sample there were no possible
explanatory edits found by the algorithm.
The retrieved edit for the band ‘One Direction’ provides an example for a positive outcome:
"On 20 September 2012, the fist single Live While We're Young and its official video leaked first on the internet
through the site SoundCloud…”
19
Another example provides a possible explanation for a peak in plays for ‘Justin Bieber’:
“In a September 2012 interview with the TODAY show, Bieber's mother Pattie talked about how everyone around
her tried to push her toward abortion, but refused to abort…”
While these examples show news-related edits, the retrieved edit for ‘Rihanna’ shows that this is
not always the case:
“Originally marketed as a reggae singer, Rihanna's musical genre has changed throughout the course of her career,
which includes pop music, R&B, hip hop…..”
7.4.3
Most Retweeted Tweet
Only 3 of the 20 retrieved tweets contain information relevant for peak explanation (Table 4).
Although the amount of relevant selected tweets is low, it did help in finding a possible new
explanation for one case. The initial manual search for an explanation of the peak for ‘Usher’ at
September 22 did not give a proper reason for the peak. However, when retrieving the most
retweeted tweet that day, we do get an idea of why people started listening to Usher that day.
Apparently he did a good performance at the iHeart Radio Festival in Las Vegas, as mentioned by
the most retweeted tweet:
“ @scooterbraun: And that ladies and gentleman is why @UsherRaymondIV is one of the greatest entertainers of
all time!! Wow! #IHEART radio #VEGAS ”
While in this case the most retweeted tweet actually was news-related, in most cases this is not true,
as can be seen in the example of One Direction:
“ @googlefacts: Koutaliaphobia is the fear of spoons. Liam Payne from One Direction says he's scared of spoons. ”
20
Table 4 Most Retweed Tweets
RT
Amount
Artist
Tweet
Relevant?
One
Direction
@Real_Liam_Payne: “@googlefacts: Koutaliaphobia is the fear of spoons. Liam Payne from One
Direction says he's scared of spoons."
Rihanna
@ElChisteLatino: Katy Perry: Cabello azul. Nicki Minaj: Cabello rosado. Rihanna: Cabello rojo.
Lady Gaga: Cabello verde. ¡Los Power Rangers Regresaron!
Justin
Bieber
@joejonas: I cry because I love Justin Bieber!!!
The
Weeknd
@MixedBoyTatted: If Adele, The Weeknd, Drake, and Frank Ocean made an album together.
Everyone would be in their deepest feelings.
1064 N
Wiz
Khalifa
@CelebFactstory: Wiz Khalifa admits to spending at least $10,000 a month on weed.
5014 N
Taylor
Swift
@factsonIove: Taylor Swift's son: Now that's going to be a boy who knows how to treat a girl.
3758 N
Kendrick
Lamar
@Yo__SheBAd_DoE: Kendrick Lamar is the most over looked rapper out right now and when
he drop his album ya'll ain't allowed on the banwagon
380 N
Lupe
Fiasco
@FxckShugz: Lupe Fiasco could diss Chief Keef and the lyrics would be so developed that Keef
wouldn't even understand how he got dissed.
7811 N
Kanye
West
@GhettoEnglish: "Watchu know bout that Yeezy!?!?!" = Do you listen to Kanye West?
fun.
@justinbieber: yes i would really like to go to a prom someday. just be a normal kid and have fun.
maybe even get a kiss during the ..
Usher
@scooterbraun: And that ladies and gentleman is why @UsherRaymondIV is one of the greatest
entertainers of all time!! Wow! #IHEART radio #VEGAS
Sean Paul
@BeyonceLite: Sean Paul Talks about Beyoncé & the No.1 single 'Baby Boy'
http://t.co/DLuNaIll
Booba
@HaterzFr: Le clip "Caramel" de Booba porte bien son nom. Bon marché, trop sucré, un peu
mou et collant, il glisse mais n'a aucune co
Big Sean
@Drake: "Higher" off Big Sean mixtape is CRAZY.
Muse
@Sr_Colmenero: Muse es TT por su música, Justin Bieber es TT porque tiene 28 millones de
seguidoras que se pasan el día escribiendo
1486 N
Chris
Brown
@Realtaeyang: Just covered Chris Brown-Don't judge me. This song defines many meanings to
me. http://t.co/utegJXMj
4838 N
Mumford
& Sons
@PigeonJon: Turns out Mumford and Sons are not actually a firm of Removal Men. Frankly livid
about this.
140 N
Lana Del
Rey
@LyricalPhrase: "Heaven's in your eyes" - Lana Del Rey
165 N
LMFAO
@KevinHart4real: Security grabbed her and she yelled out "I HAD TO LICK HIS FACE......I
LOVE HIS LITTLE ASS" LMFAO
Avenged
Sevenfold
@Metal_Hammer: Avenged Sevenfold have released a new song! Check out 'Carry On' right here,
people: http://t.co/R2QitBVl
49817 N
4503 N
74197 N
791 N
26304 N
726 Y
13 N
222 Y
11593 N
2391 N
276 Y
21
8 DISCUSSION / FUTURE WORK
8.1 TWITTER AS A SOURCE
It is important to keep in mind that people are free to decide whether they let the world know what
music they are playing or to keep this private. The information used in this thesis could therefore
be biased because people might only share their listening behavior for certain music. The amounts
used in this thesis will therefore only give an indication of what was listened to in September 2012.
Using Twitter as a source also give possibilities for interesting future research. Because the Twitter
API provides not only the text of a tweet, but also features like user-id and location, it is possible
to retrieve more information which could lead to new questions for research. For example
retrieving tweets just before and after a tweet containing the ‘nowplaying’ hashtag could give an
insight in what people do just before or after listening to a song.
8.2 REAL-TIME
While this thesis uses a dataset of twelve days to analyze different sources on the internet, future
work could focus on retrieving and analyzing this data from various sources in a (near) real-time
context. This provides new challenges for example in detecting peaks, where time windows should
be needed to define new thresholds as new data arrives. This also gives new opportunities for using
other sources, since the restriction (encountered in this thesis) of using sources with archives is not
relevant anymore.
8.3 INFLUENCING SOURCES
Another interesting aspect of this data that might be worth analyzing is the assumption that sources
might influence each other and therefore show bursts not at the exact same moment. An example
would be a news fact, which causes people to first look for the Wikipedia page of an artist and
maybe later on, influenced by the Wikipedia page, listen to the artist. New releases might cause
streams to behave in the opposite way, caused by people first listening to a new album, and
afterwards using Wikipedia to find more information about an artist. Retrieving hourly data instead
of aggregated daily amounts might be needed to analyze this.
22
8.4 AUTOMATICALLY RETRIEVING EXPLANATIONS
Although this thesis shows that Musicbrainz can be a solid source to retrieve information about
new releases of albums or singles, the automatic retrieval of possible explanatory Wikipedia edits
appears to be a bigger challenge. Although choosing the revision with the biggest increase in page
size leaves most (small) fluency edits away, not all retrieved edits contain news-related information.
Examples are edits being made to the biography or related artists sections. Using other metadata
about revisions might be needed to correctly classify news-related edits. For example, retrieving
whether an edit gets reverted by another Wikipedia user might a good feature that is able to filter
out spam or vandalism edits. Another possible solution could be to combine multiple edits instead
of choosing only one.
Another reason why Wikipedia might not be a real good source for retrieving explanations is
because of their policy that every fact on the page should have a good source. Gossips are therefore
removed until there is a solid confirmation of the fact. These gossips however are likely to cause
increases in plays and page views and therefore have the potential of being good explanations for
observed peaks.
8.5 OTHER SOURCES
Using sources other than Twitter, Musicbrainz and Wikipedia might give other and new insights in
how people react to news and might provide other possibilities for (automatically) explaining
bursts.
Last.fm or Spotify are possible sources that could provide more playing statistics, which might be
more neutral than tweets written by people willing to share the music they listen to. Querying news
sites or social media might give better ways of explaining peaks with news facts than the Wikipedia
revision history.
23
9 CONCLUSION
Retrieving Twitter messages containing the hashtag ‘#nowplaying’ gives us an idea about what the
world is listening to right now. Counting the daily amount of these tweets and grouping them by
artists provides a daily play amount which can be used to detect peaks in popularity. The
assumption that these peaks can be related to new releases or news facts is supported by the manual
analysis of a sample of 20 artists in the dataset: only 3 cases have no clear explanation, while the
remaining 17 cases can be related to news or releases. A new releases of a single or an album is the
most common reason for increases in plays: 13 of the 17 peaks can be related to new releases.
Useful sources to find explanations for these peaks include Musicbrainz, Google Search and
Wikipedia revision histories.
The assumption that the ups and downs in the ‘nowplaying’ dataset are also visible in data from
other internet sources seems to be legitimate; calculated correlations between number of plays,
Wikipedia page views and tweets mentioning artist names show that patterns are quite similar.
However, the amount of daily Wikipedia edits shows no high correlations between other sources.
Using the combination of statistics on Wikipedia and Twitter usage might be useful for verifying
peak detection: most peaks show high correlations between different sources, while the peaks with
unknown reason don’t.
Using the insights gain by manually finding explanations for peaks, three algorithms were
developed to automatically retrieve explanations for peaks found in the dataset.
Releases can easily be found using the Musicbrainz API, which proves to be a solid source in
finding this information. Wikipedia edits can also be used to explain peaks, however, automatically
choosing possible explanatory edits provides some challenges. While using increase in page size as
the only feature is useful in selecting factual edits over fluency edits, this doesn’t seem to be
sufficient enough for classifying edits as news-related or not.
The assumption that the most retweeted tweet containing the name of an artist might be interesting,
news-related information is not supported by the results: only 3 of the 20 retrieved tweets contain
information related to peaks.
By manually exploring the data in the ‘nowplaying’ dataset and by combining this with data
retrieved from other internet sources, we provide an insight in how people react to news on music
artists. Furthermore, by developing three algorithms, we make the first steps in the task of
automatically relating news-facts to peaks found in the playing data retrieved from Twitter.
24
10 REFERENCES
[1]
Kwak, H., Lee, C., Park, H., & Moon, S. (2010, April). What is Twitter, a social network or
a news media?. In Proceedings of the 19th international conference on World wide web (pp. 591-600).
ACM.
[2]
Sankaranarayanan, J., Samet, H., Teitler, B. E., Lieberman, M. D., & Sperling, J. (2009,
November). Twitterstand: news in tweets. In Proceedings of the 17th ACM SIGSPATIAL
International Conference on Advances in Geographic Information Systems (pp. 42-51). ACM.
[3]
Petrovic, S., Osborne, M., & Lavrenko, V. (2010, June). The Edinburgh Twitter Corpus.
In Proceedings of the NAACL HLT 2010 Workshop on Computational Linguistics in a World of
Social Media (pp. 25-26).
[4]
Swartz, A. (2002). Musicbrainz: A semantic web service. Intelligent Systems, IEEE, 17(1), 7677.
[5]
Asur, S., & Huberman, B. A. (2010, August). Predicting the future with social media. In Web
Intelligence and Intelligent Agent Technology (WI-IAT), 2010 IEEE/WIC/ACM International
Conference on (Vol. 1, pp. 492-499). IEEE.
[6]
Bollen, J., Mao, H., & Zeng, X. (2011). Twitter mood predicts the stock market. Journal of
Computational Science, 2(1), 1-8.]
[7]
Petrović, S., Osborne, M., & Lavrenko, V. (2010, June). Streaming first story detection with
application to twitter. In Human Language Technologies: The 2010 Annual Conference of the North
American Chapter of the Association for Computational Linguistics (pp. 181-189). Association for
Computational Linguistics.
[8]
Sakaki, T., Okazaki, M., & Matsuo, Y. (2010, April). Earthquake shakes Twitter users: realtime event detection by social sensors. In Proceedings of the 19th international conference on World
wide web (pp. 851-860). ACM.
[9]
Ciglan, M., & Nørvåg, K. (2010, October). WikiPop: personalized event detection system
based on Wikipedia page view statistics. In Proceedings of the 19th ACM international conference
on Information and knowledge management (pp. 1931-1932). ACM.
[10]
Osborne, M., Petrovic, S., McCreadie, R., Macdonald, C., & Ounis, I. (2012). Bieber no
more: First story detection using Twitter and Wikipedia. In SIGIR 2012 Workshop on Timeaware Information Access.
[11]
Vlachos, M., Meek, C., Vagena, Z., & Gunopulos, D. (2004, June). Identifying similarities,
periodicities and bursts for online search queries. In Proceedings of the 2004 ACM SIGMOD
international conference on Management of data (pp. 131-142). ACM.
25
[12]
Parikh, N., & Sundaresan, N. (2008, August). Scalable and near real-time burst detection
from eCommerce queries. In Proceedings of the 14th ACM SIGKDD international conference on
Knowledge discovery and data mining (pp. 972-980). ACM.
[13]
Balog, K., Mishne, G., & de Rijke, M. (2006, April). Why are they excited?: identifying and
explaining spikes in blog mood levels. In Proceedings of the Eleventh Conference of the European
Chapter of the Association for Computational Linguistics: Posters & Demonstrations (pp. 207-210).
Association for Computational Linguistics.
[14]
Tsagkias, M., Weerkamp, W., Meij, E.J., van Bruggen, G., and de Rijke, M. (2013). Music
in Our Ears: An Analysis of Collective Music Listening Behavior. To appear.
[15]
Bronner, A., & Monz, C. (2012, April). User edits classification using document revision
histories. In Proceedings of the 13th Conference of the European Chapter of the Association for
Computational Linguistics (pp. 356-366). Association for Computational Linguistics.
[16]
Giles, J. (2005). Internet encyclopaedias go head to head. Nature, 438(7070), 900-901.
[17]
Lih, A. (2004). Wikipedia as participatory journalism: Reliable sources? metrics for
evaluating collaborative media as a news resource. Nature.
26
11 APPENDIX
Table 5 Play Amounts (peaks are marked red)
Artist
17-92012
18-92012
19-92012
20-92012
21-92012
22-92012
23-92012
24-92012
25-92012
26-92012
27-92012
28-92012
One
Direction
2619
2357
2642
9911
10958
9188
9110
5303
4884
4916
4632
4624
Rihanna
2553
2761
2431
2209
2682
2468
2521
2162
2205
6176
5062
4171
Justin Bieber
4528
4034
4032
4109
3907
5809
5108
3889
3595
3562
3423
3549
The Weeknd
1437
1333
1287
1233
1084
1107
1194
1817
3286
2473
2073
1726
Wiz Khalifa
1899
2057
2151
2028
2126
2027
1985
2465
3864
2986
2616
2344
Taylor Swift
3969
3455
3592
3606
3554
4171
4191
3328
5294
4569
4191
4021
Kendrick
Lamar
1490
1398
1678
1646
1388
2797
1533
1511
1802
2211
1659
1541
Lupe Fiasco
517
502
628
1184
832
698
643
886
1657
1624
1209
894
Kanye West
3731
3745
4069
3835
3738
3438
3275
3296
3241
3173
3145
2630
469
1849
625
642
575
527
500
570
532
471
534
451
2408
1771
1889
1913
2037
2913
1935
1820
1784
1831
1743
1593
Sean Paul
639
1126
1618
821
695
591
600
432
452
488
462
496
Booba
364
239
268
349
1354
997
815
620
488
476
472
415
2727
2625
2786
2628
2286
2398
2127
2160
2273
2151
2038
1733
Muse
719
666
710
725
693
698
862
1626
1250
1249
1163
1063
Mumford &
Sons
254
249
307
360
326
225
381
616
1107
1076
682
578
Chris Brown
5183
5020
5134
4775
4801
5237
5758
4950
4616
5077
5324
5194
Lana Del Rey
956
908
919
1004
998
1111
1141
1204
1735
1497
1333
1185
LMFAO
411
340
433
1185
357
370
362
312
319
321
333
290
Avenged
Sevenfold
1416
1167
1219
1318
1367
1444
1524
1526
2033
1710
1629
1555
fun.
Usher
Big Sean
27
Table 6 Tweets Mentioning Artist Name
Artist
17-92012
18-92012
19-92012
20-92012
21-92012
22-92012
23-92012
24-92012
25-92012
26-92012
27-92012
28-92012
One
Direction
1601
461
1670
2551
2240
2355
2496
1743
1462
1741
1565
1684
Rihanna
704
176
621
592
657
901
761
855
702
2173
1060
782
Justin Bieber
1387
336
1208
1114
1086
1396
1625
1321
1049
1407
1226
1213
The Weeknd
81
21
88
76
64
70
87
167
196
168
123
102
Wiz Khalifa
135
53
196
163
128
173
164
208
290
220
190
170
Taylor Swift
475
122
536
407
372
490
544
425
616
570
434
397
Kendrick
Lamar
129
69
192
176
132
140
144
162
184
205
169
159
Lupe Fiasco
38
11
40
95
38
29
41
64
92
112
68
56
Kanye West
186
55
260
254
325
197
180
275
173
164
160
138
19991
5149
19987
19607
19233
17360
21722
20494
18568
19897
19911
19073
712
94
293
349
224
294
283
242
231
243
212
245
Sean Paul
18
10
37
36
23
19
25
13
28
30
31
24
Booba
40
20
55
46
320
155
120
82
58
59
39
57
Big Sean
221
62
306
219
207
154
159
177
201
192
187
146
Muse
479
150
505
504
539
414
472
828
687
874
657
738
Chris Brown
647
134
492
475
424
407
420
384
430
530
590
503
Mumford &
Sons
58
8
58
75
82
68
295
278
444
423
248
221
127
17
100
138
124
110
131
221
222
180
138
143
1655
378
1682
1478
1705
1331
1646
1698
1464
1498
1551
1409
53
15
46
40
50
44
42
122
99
64
51
48
fun.
Usher
Lana Del Rey
LMFAO
Avenged
Sevenfold
28
Table 7 Wikipedia Page Views
Artist
17-92012
18-92012
19-92012
20-92012
21-92012
22-92012
23-92012
24-92012
25-92012
26-92012
27-92012
28-92012
One
Direction
43542
40190
38915
52565
66942
66926
69991
61740
56044
56684
57460
58296
Rihanna
22788
20403
19714
19327
20020
22175
26847
25124
21460
35051
31441
25774
Justin Bieber
25382
35677
43740
33437
33934
35103
35688
33378
31934
36907
28481
26227
The Weeknd
5772
5399
5235
5406
5213
4837
5423
6867
10083
8252
7668
7183
Wiz Khalifa
14312
15344
13852
13415
14100
11455
11857
14020
16444
14164
15942
16766
Taylor Swift
23507
21995
23065
20482
20380
20931
24280
23126
25366
24277
22599
21318
Kendrick
Lamar
9271
9862
11377
10160
9472
8449
8860
10258
10179
11753
10450
9548
Lupe Fiasco
6793
10948
9445
17509
12520
9900
9235
12480
21001
20396
15577
12432
Kanye West
25727
26980
27173
24738
23533
20917
26120
24801
21502
20264
18728
16498
fun.
15369
19897
15970
15645
16011
23781
19548
15864
14253
14605
16264
14583
Usher
18724
11268
9113
7965
8517
9853
10212
9385
8555
9869
8281
7450
3020
3358
3670
3216
3213
3038
3176
3058
2646
2772
2789
2952
300
258
283
338
509
612
534
549
438
389
397
330
Big Sean
8431
9023
10264
9478
8968
8086
8186
8296
7802
8299
8706
8325
Muse
9781
9368
8946
9092
10120
9129
9890
17027
20812
18893
16374
18684
Chris Brown
10586
9553
9627
9452
9469
9132
9408
9375
9944
9972
12709
10397
Mumford &
Sons
13487
11466
13454
15745
18287
16855
74075
46883
43869
45929
38566
32543
Lana del Rey
16900
16694
17133
21064
32763
22071
24944
28890
37072
31574
29493
26437
LMFAO
5746
5510
5050
5497
29239
36749
19976
19648
21878
12809
10655
9335
Avenged
Sevenfold
6303
5274
5022
5306
5367
5016
4741
9487
10444
8402
7905
6703
Sean Paul
Booba
29
Table 8 Wikipedia Page Edits
17-92012
18-92012
19-92012
20-92012
21-92012
22-92012
23-92012
24-92012
25-92012
26-92012
27-92012
28-92012
One
Direction
7
1
6
9
2
2
3
0
2
2
7
8
Rihanna
7
0
3
1
1
6
0
1
3
7
2
1
Justin Bieber
2
0
0
2
2
3
0
0
2
6
1
0
The Weeknd
3
0
2
1
1
4
3
2
5
1
0
4
Wiz Khalifa
0
0
0
0
0
0
0
0
0
0
0
0
Taylor Swift
5
1
1
0
2
0
9
1
5
19
4
18
Kendrick
Lamar
1
0
6
2
0
0
1
2
6
5
1
2
Lupe Fiasco
0
0
3
1
0
1
0
0
0
1
1
1
Kanye West
0
0
4
0
0
3
3
0
0
0
1
0
10
3
1
3
0
4
2
0
3
2
1
1
Usher
0
0
4
0
0
0
0
0
0
0
0
3
Sean Paul
0
0
0
0
0
0
0
6
0
0
0
2
Booba
0
0
0
0
0
1
0
0
0
0
0
0
Big Sean
1
0
1
0
0
0
0
0
0
0
2
1
Muse
0
0
6
3
2
3
0
0
4
0
9
4
Chris Brown
0
1
1
0
1
0
2
4
14
2
1
0
Mumford &
Sons
0
0
0
0
0
0
0
0
0
0
0
0
Lana Del Rey
2
0
0
9
1
0
2
18
11
8
3
8
LMFAO
0
2
1
1
10
3
3
3
1
0
0
0
Avenged
Sevenfold
0
4
3
7
12
6
0
15
5
10
5
4
Artist
fun.
30