Identifying and Explaining Peaks in Music Playing Data Bachelor Thesis Information Studies: Human Centered Multimedia University of Amsterdam July 2013 Guido van Bruggen Supervision Coordination Wouter Weerkamp Anders Bouwer Manos Tsagkias 1 TABLE OF CONTENTS 2 Abstract 3 3 Introduction 4 4 Related Work 6 5 Data 7 6 5.1 Extraction 7 5.2 Evaluation 7 5.3 User Classification 8 5.4 Dataset 8 Method 6.1 Peak Detection 9 6.2 Sample 9 6.3 Manual peak explaining 11 6.4 User Activity Retrieval 11 6.4.1 Wikipedia view statistics 12 6.4.2 Wikipedia edit statistics 12 6.4.3 Tweets 12 6.5 7 9 Automatic Peak Explanation Retrieval 13 6.5.1 Most retweeted tweet 13 6.5.2 Musicbrainz releases 13 6.5.3 Wikipedia edits 13 Results 15 7.1 Manual Peak Explanation 15 7.2 Examples 16 7.2.1 Lupe Fiasco 16 7.2.2 Mumford & Sons 17 7.3 Correlations 17 1 7.4 8 Automatic Peak Explanation 19 7.4.1 Musicbrainz Releases 19 7.4.2 Wikipedia edits 19 7.4.3 Most Retweeted Tweet 20 Discussion / Future Work 22 8.1 Twitter as a source 22 8.2 Real-time 22 8.3 Influencing sources 22 8.4 Automatically retrieving explanations 23 8.5 Other sources 23 9 Conclusion 24 10 References 25 11 Appendix 27 2 2 ABSTRACT Social media platforms provide users with the possibility to let the word know what’s on their mind at every moment. Twitter users use the hashtag #nowplaying to let others know what music they are listening to. By retrieving message (tweets) containing this tag we are able to get an idea of what music the world is playing right now. We use a dataset containing all these tweets retrieved in a period of 12 days. Using a peak detection method we identify peaks in the number of plays for all artists. We assume that these peaks are caused by news concerning the artists, such as the release of a new album or other ‘news’ such as gossip. A manual search of a sample of these peaks and internet sources gives us an idea of the reasons people start listening to certain artists. The statistics on the number of related Wikipedia page views, page edits and tweets provides us with the possibility to check these sources for similar patterns. Results show that new releases of albums or singles are the most common cause for people to start listening to certain artists. Calculated correlations between the statistics from several sources show that there are similar patterns for most cases in the sample. Three simple methods for automatically retrieving explanations using Wikipedia, Musicbrainz and Twitter are developed. Although results vary, we show that it is feasible to automatically explain peaks in the listening behavior of Twitter users. Musicbrainz provided the best explanations for peaks. 3 3 INTRODUCTION Since the launch of their microblogging platform in 2006, Twitter is growing very fast. With over 200 million active users in February 2013 it has become one of the most used websites worldwide. Using only 140 characters Twitter users send messages (tweets) about any topic and are able to follow other users. [1] shows that tweets are mostly covering very recent stories and Twitter is therefore known for bringing news faster than conventional news media. [2] confirms this by showing that tweets about Michael Jackson’s death appeared more than an hour before the first conventional news sources started reporting on it. So-called hashtags are used in Twitter messages to define the subject or the event the tweet is about. Hashtags are represented by the symbol ‘#’ followed by a word. The hashtag #elections2012, for example, was used during the presidential elections in the United States in 2012. Although every Twitter user is able to make up and use their own tags, there are many tags commonly used by the Twitter community. The hashtag #nowplaying (or the abbreviation #np), is one of the most popular hashtags used by the Twitter community to let other users know what music someone is listening to at the moment [3]. Because of the almost real-time characteristic of Twitter and the worldwide spread of the users, the retrieval of tweets containing #nowplaying hashtags, can be used to get an idea of what the world is listening to. This thesis uses a dataset created by monitoring the Twitter stream and saving all tweets containing these tags during a period of twelve days. By counting the number of tweets mentioning an artist and the nowplaying hashtag, it is possible to get a daily amount of ‘plays’ for all artists in the dataset. Comparing these numbers makes it possible to detect bursts in the listening behavior of Twitter users. The assumption is that most of these bursts are caused by news concerning music artists, and that we might be able to explain these bursts with news sources. For example: the release of a new album of the band Coldplay will probably result in many people listening to that new album. Furthermore, we have seen popularity and album sales rise as a reaction to the death of music artists, like Michael Jackson. Because the internet provides us with many news sources and ways to retrieve these, the task of automatically finding bursts and related news sources is an interesting one. This thesis takes the first steps and explores challenges in the task of automatically finding bursts and related news sources in the ‘nowplaying’ dataset. 4 First, statistics on the activity of internet users on sources like Wikipedia and Twitter for the same period as the ‘nowplaying’ dataset are retrieved, and checked for similarities. The assumption is that people, next to listening to an artist, might also react to news by searching the internet for additional information, and that burst patterns therefore might also appear in other internet sources providing news and information on music artists. Checking statistics for user activity for similarities, makes it able to verify bursts and gives an idea about what sources might be useful to automatically explain bursts in the ‘nowplaying’ dataset. Next, three sources are used in an attempt to automatically retrieve possible explanations for bursts: Twitter, Wikipedia edits and Musicbrainz, a user-generated database containing information about music artists and their releases [4]. By looking for peaks in the playing data and manually explaining them using internet sites, we try to get an idea about the reasons behind the listening behavior of people. Findings derived from this manual analysis are used to explore the challenges in the task of automatically retrieving explanations. 5 4 RELATED WORK Twitter provides us with a huge source of information about what is happening and popular right now. Among others, [5] and [6] prove that Twitter provides a solid source to get an idea of how the world feels about certain topics, and how this information can be used to make predictions on subjects like stock markets and movie revenues. These sentiment analysis techniques provide us with information on how people feel about certain topics. Another popular field of research is the early detection of new emerging popular topics using social media like Twitter. So-called first story detection algorithms monitor the Twitter stream and try to cluster messages to find emerging popular stories or events among Twitter users. [7] provides an algorithm for performing this analysis in a live, streaming context, with new tweets arriving every moment. Results show that news or gossips covering celebrity deaths, like Michael Jackson, spread the fastest among Twitter users. Applications of these techniques include the early detection of earthquakes [8]. These methods are not solely used on information available on Twitter, but also on sites like Facebook and Wikipedia. [9] uses Wikipedia view statistics for the detection of popularity of Wikipedia articles and uses this to provide users with popular pages related to their personal interest. While these techniques mainly focus on analyzing one source to detect emerging topics, combining data from different sources could be of use in the verification of possible found emerging topics. [10] combines first story detection algorithms performed on tweets with page view statistics of related Wikipedia pages to verify and improve the first story detection on Twitter. Results show that Wikipedia view statistics can be a good source to verify emerging trending topics on Twitter. The basis of detecting emerging topics lies in the so-called peak detection, which covers the task of identifying big increases and decreases in data streams. [11] provides a method for detecting peaks in queries to the MSN Search engine. [12] uses a similar method in recognizing popular queries made to eBay. By using peak detection to mood levels in blogs, [13] finds interesting transitions and tries to automatically explain peaks with a related event. 6 5 DATA The used dataset consists of a collection of tweets retrieved by monitoring the Twitter stream from 17 September 2012 until 28 September 2012 for the hashtags #nowplaying and #np. A brief explanation of how this data was processed to extract song titles and artist names is given here, [14] describes these methods and the evaluation in more detail. 5.1 EXTRACTION The second step, after retrieving the tweets containing information on what people are listening to, is to extract the name of the song and artist mentioned in the tweet. The extraction has been done using a three-step approach. First, by using basic regular expressions, candidate song titles and artist names are extracted. Musicbrainz is used to try to match these candidates, and to store related information when a match is found. If this first step fails and no match is found, Youtube1 is used as a source to find possible matches. The text of a tweet is used as a query to Youtube, and the top 10 results appearing in the Music category are retrieved. Regular expressions on these results are used to find candidate song and artist names. Again, by matching these candidates to Musicbrainz, possible candidates are confirmed and saved. The third step, used when no matches are found, uses a ‘fuzzy’ matching in which only a matching artist is required and the other most common candidate found in the tweet text is used as the song title, even when there is no match with Musicbrainz. 5.2 EVALUATION A manually annotated test set of 200 tweets was used to evaluate the extraction methods. The first method, using regular expressions only, scores really high on precision but low on recall. The second step, using Youtube results, leads to an increase in recall but a slight drop in precision. At last, the fuzzy method shows a drop in precision but an increase in recall. Using a combination of these methods in the three-step approach leads to a precision of 0.6871 and a recall of 0.6175 for song titles, and a precision of 0.7349 and a recall of 0.6100 for artist names. 1 http://www.youtube.com 7 5.3 USER CLASSIFICATION Initial analysis showed that there were some radio stations whose Twitter accounts were producing noise in the data by constantly tweeting about the same song or artist. Because most radio users show a higher usage of the #nowplaying hashtag, data containing information on the use of this tag was used as features in training a random forest decision tree classifier. Evaluation based on a manually annotated set of Twitter accounts shows high scores for presion and recall (P=0.7011, R=0.9839). 5.4 DATASET This thesis uses a dataset containing the results of the mentioned extraction and classification methods for a period of twelve days. The total amount of retrieved tweets is 6,5 million. Every tweet in the dataset related to an artist is considered to be one ‘play’ for this artist. By grouping these tweets by date and calculating the amount of entries per day, a daily play amount is calculated for every artists in the dataset. The dataset contains tweets mentioning 112.834 unique artists. Although this seems like a big amount, there are many unique artists found in only a few tweets. Figure 1 shows the distribution of unique artists related to the amount of plays found in the dataset. Unique Artists and Plays 80000 70000 Unique Artists 60000 50000 40000 30000 20000 10000 0 1 10 100 1000 10000 Amount of Plays Figure 1 Division of Unique Artists 8 6 METHOD 6.1 PEAK DETECTION Peak detection is a task that has been done several times. Most approaches use the mean and standard deviation to calculate a threshold value for peaks. When doing this in a live-context, with new information constantly arriving, it is not possible to define a threshold value at forehand. Most approaches therefore use methods where means and standard deviations are calculated for a specific time range. [11] used this approach to find burst in MSN Search queries, by checking whether values are 1.5-2.0 times more than the ‘moving average’. Because the time range for our used dataset is only 12 days, it is not necessary to use such an approach based on moving averages and therefore we calculate one threshold value for every artist in the dataset to detect peaks. Every value that is larger than this threshold is labeled as a peak. The threshold value is calculated as follows: 𝑡ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 = 𝜇 + 1.5 ∗ 𝜎, where µ is the mean and σ is the standard deviation of tweets for an artist over the entire dataset. 6.2 SAMPLE Because we are dealing with a large dataset, containing playing data for 112.843 artists, a small sample of 20 artists is picked to be able to analyze this data in more detail. These artists are picked by ranking the dataset on the standard deviation of the daily play amounts. This ranking gives us a way to find the most interesting cases, providing us with artists with high amount of plays and high variance, having big peaks with possibly interesting explanations. The example of ´The Beatles´ provides a motivation for this ranking. Play amounts for this band show a quite low standard deviation, while total amounts of plays are quite high. Because our peak detection method uses the standard deviation to define a threshold, it finds 2 peaks in the playing data: September 23 and September 24. While these days do show a small increase in amount of plays, these peaks are probably not worth looking for, since their amounts are just slightly higher than the other days (Figure 2). This low variance can also be seen in the scatterplot for ‘The Beatles’ (Figure 3). These observations can be explained by the fact that this band is no longer active and won’t appear in the news, which gives no reason for the mass to suddenly listen and search for the band. By ranking on standard deviation we avoid artists whose data show similar statistics as ‘The Beatles’ to appear in our sample, which leaves interesting cases for further analysis. 9 The Beatles Wikipedia Page Views Plays 20000 1200 1000 16000 14000 800 12000 10000 600 8000 Plays Wikipedia page views 18000 400 6000 4000 200 2000 0 0 17 18 19 20 21 22 23 24 Day in September 2012 25 26 27 28 Figure 2 Play and Wikipedia data for The Beatles The Beatles 1,2 Wikipedia page views 1 0,8 0,6 0,4 0,2 0 0 0,2 0,4 0,6 0,8 1 1,2 Plays Figure 3 Scatter Plot showing data for The Beatles After applying the proposed ranking, we set two restrictions to select artists for the sample. First, only artists that have one or more peaking days in the dataset are considered. Second, peaks at the first or last day of the dataset are not considered as useful peaks, since it’s not possible to compare the amount of plays with the amount a day before, or after, the peak. This way it is not possible to tell whether peaks are part of larger trends or not. 10 Table 1 shows the names, mean and standard deviation of the 20 artists in our sample. The amount of plays and detected peaks can be found in Table 5. Table 1 Sample Artist Name Mean One Direction Rihanna Justin Bieber The Weeknd Wiz Khalifa Taylor Swift Kendrick Lamar Lupe Fiasco Kanye West fun. Usher Sean Paul Booba Big Sean Muse Mumford & Sons Chris Brown LMFAO Avenged Sevenfold Lana Del Rey 5929 3117 4129 1671 2379 3995 1721 940 3443 645 1970 702 571 2328 952 513 5089 419 1492 1166 Standard Deviation 2920 1248 677 635 539 530 387 381 379 367 342 334 316 302 299 296 285 234 223 240 6.3 MANUAL PEAK EXPLAINING For the peaks appearing in the playing data of our sample of 20 artists, online sources were searched manually for possible explanations. This gives an idea of why peaks are happening and whether these (possible) explanations can be found using internet sources. Google Search, Wikipedia and Musicbrainz were searched for events like news facts or new releases that might have been the reason for increases in plays. 6.4 USER ACTIVITY RETRIEVAL Daily statistics on the user activity on Wikipedia and Twitter were retrieved to get an idea about the popularity of music artists every day. By comparing these statistics and by calculating correlations between them, we are able to get an idea of how similar these patterns are and whether these statistics might be useful to verify peaks found in the ‘nowplaying’ dataset. 11 6.4.1 Wikipedia view statistics Wikipedia provides view statistics for all their pages as dumps 2 , which can be used to get an indication of the popularity of a Wikipedia article. In order to download these statistics it is important to first retrieve the correct Wikipedia pages for every artist. This was done using the Musicbrainz API, which provides URLs for the corresponding Wikipedia pages for many artists. A website3 offering JSON files for the daily view statistics, extracted from the Wikipedia dumps, was used to retrieve the view statistics for every artist in our sample. 6.4.2 Wikipedia edit statistics Because of the big community that Wikipedia editors are part of, it seems to be a good source for topical information about any subject. On 23 July 2011, the day that Amy Winehouse died, a total amount of 303 edits were made on her Wikipedia page4, showing that (popular) Wikipedia articles are likely to be up to date and these edits could provide us with interesting information. The Wikipedia API provides information on all Wikipedia articles, including a revision history. By retrieving and counting all edits for one day, we are able to get the daily amount of edits made to Wikipedia pages related to music artists. 6.4.3 Tweets The public streaming API provided by Twitter5 offers representative samples of around 1% of the total amount of tweets for every day. Every sample contains around 3.5 million tweets. These archives can be used to get an idea of what was popular among Twitter users on a specific day. The samples for all twelve days in the dataset were collected. The number of tweets mentioning a specific music artist gives a reflection of the popularity of that artist among Twitter users. To retrieve this number, the sample archives were queried for artist names using regular expressions on all tweets. By looking at the results, we found that the archive for September 18 was not complete and contained only 1 million tweets (Table 6). To still be able to compare this data, the amount of matching tweets were normalized by dividing the amount by the total amount of tweets that day. http://dumps.wikimedia.org/other/pagecounts-raw/ http://stats.groks.se 4 http://en.wikipedia.org/w/index.php?title=Amy_Winehouse&offset=20110724000000&action=history 5 https://dev.twitter.com/docs/streaming-apis/streams/public 2 3 12 6.5 AUTOMATIC PEAK EXPLANATION RETRIEVAL 6.5.1 Most retweeted tweet People tweet about what is happening right now, therefore tweets mentioning a music artist might contain information about news on that artist. The assumption is that the most popular tweet (reflected by retweet amount) concerning an artist might be useful to describe bursts in the playing data. The publically available sample stream of Twitter, also used in the user activity retrieval, is used to find the most retweeted tweet by searching for tweets mentioning a music artists on the day of a peak in play amounts. 6.5.2 Musicbrainz releases The release of a new single or album can be a reason for people to start listening to a specific artist. Finding releases near a peaking day can therefore be a good source for the possible explanation of a peak. Musicbrainz offers access to their database containing information like releases. Their API was used to search for releases and to retrieve information like title and release date. Manual analysis shows that some peaks are caused by early leaks of new albums or singles. Because the Musicbrainz database contains only official releases, peaks that occur because of early leaks of albums or singles might not be directly related to releases. Setting a bigger date range to query for releases can easily solve this problem, since leaks are likely to appear on the internet only a couple of days before the official release. The date range used in the algorithm was set to 4 days before the peaking day until 4 days after the day of a peak. 6.5.3 Wikipedia edits Wikipedia edits are made for fluency or factual reasons: fluency edits are made to improve readability and style, while factual edits are made to improve the content of the page [15]. The Wikimedia API is used to check whether edits are made near the day of a peak for Wikipedia pages related to music artists. The API provides several data about these revisions like time, user id, user comment and the increase or decrease in page size. Using this metadata, it is possible to filter revisions. Besides this metadata, it is also possible to retrieve a snippet of HTML, showing the previous revision of the page and marking the changes in text (Figure 4) 13 Figure 4 Edit Made to 'Avenged Sevenfold´ Wikipedia Article Based on the manual analysis of our sample, we choose to look for Wikipedia edits made from one day before until one day after the day of the peak. The assumption is that revisions making the biggest increase in page size are probably the most declarative for a news fact happening that day. Therefore the revision with the biggest increase in page size is picked as a possible explanation for a peak. 14 7 RESULTS 7.1 MANUAL PEAK EXPLANATION Manual analysis of the peaks appearing in the sample shows that most peaks can be related to new albums or singles in the same period (Table 2). Out of all 20 peaks, 13 are related to releases. Of the remaining 7 cases, 3 peaks are news-related. The remaining 4 cases have no clear explanation. Table 2 Peak Explanation Artist Name Peak Date Explanation Type Explanation One Direction 21-9-2012 Release Single 'Live While We're Young' leaked on 20-9-2012 Rihanna 27-9-2012 Release Single 'Diamonds' released on 27-9-2012 Justin Bieber 22-9-2012 News Biebers mother talked in TV show about possible abortion of her son The Weeknd 25-9-2012 Release Single Release "Remember You" of Wiz Khalifa, featuring The Weeknd Wiz Khalifa 25-9-2012 Release Single Release "Remember You" Taylor Swift 25-9-2012 Release Single release: 'Begin Again' Kendrick Lamar 22-9-2012 Unknown Unknown Lupe Fiasco 25-9-2012 Release Album release 'Food & Liquor II: The Great American Rap Album, Part 1' Kanye West 19-9-2012 Release Album release 'Cruel Summer', containing songs from Kanye West fun. 18-9-2012 News Nominated for MTV Europe Awards Usher 22-9-2012 Unknown Unknown Sean Paul 19-9-2012 Release Album release 'Tomahawk Technique' Booba 21-9-2012 Release Single release 'Caramel' Big Sean 19-9-2012 Unknown Unknown Muse 24-9-2012 Release Album release 'The 2nd Law', pre-listen online at 24-9 Mumford & Sons 25-9-2012 Release Album release 'Babel' at 25-9-2012 Chris Brown 23-9-2012 Unknown Unknown Lana del Rey 25-9-2012 Release Single release 'Ride' 15 LMFAO 20-9-2012 News Announced split Avenged Sevenfold 24-9-2012 Release Single release 'Carry On (Call of Duty: Black Ops II Version)' 7.2 EXAMPLES The cases of the artists ‘Lupe Fiasco’ and ‘Mumford and Sons’ are used to give an idea of the findings. 7.2.1 Lupe Fiasco Figure 5 shows the division of statistics on plays, Wikipedia page views and tweets for the rapper ‘Lupe Fiasco’. The peak detection algorithm found 2 peaks in the playing data for ‘Lupe Fiasco’: September 25 and 26, which are caused by the release of a new album on September 25. Although the peak detection algorithm discovered just 2 peaks, we do see another smaller peak for all 3 data sources on September 20. When searching for this day, we find several Google Search results mentioning the leak of the album, which is the probable cause for the peak. The graph shows similar patterns for the three sources, having the same peaks and falls. This similarity is found by calculating Pearson correlation coefficients between sources as well, with coefficients all exceeding 0.90. Album Leak Lupe Fiasco Album Release 0,18 0,16 Fraction of Total 0,14 0,12 0,1 0,08 0,06 0,04 0,02 0 17 18 19 20 21 22 23 24 25 26 27 28 Day in September 2012 Plays Wikipedia page views Tweets Figure 5 Data for Lupe Fiasco 16 7.2.2 Mumford & Sons Figure 6 shows the statistics retrieved for the band ‘Mumford & Sons’. The peak detection algorithm found a peak at September 25, which turned out to be the date of the release of their new album. The amount of tweets mentioning the band shows a similar pattern as the amount of plays. The statistics for the amount of Wikipedia page views however, show a peak 2 days in advance, at September 23. An explanation for this different pattern is not found. Album Release Mumford & Sons 0,25 Fraction of Total 0,2 0,15 0,1 0,05 0 17 18 19 20 21 22 23 24 25 26 27 28 Day in September 2012 Plays Wikipedia page views Tweets Figure 6 Data for Mumford and Sons 7.3 CORRELATIONS The 4 sources on online user activity are checked for similarities by calculating the Pearson correlation coefficients between the data for every artist in the sample. This shows us that there are high correlations (> 0.80) between several sources (Table 3). Only the amount of Wikipedia edits doesn’t seem to have high correlations with any of the other sources of data. 17 Table 3 Pearson Correlation Coefficients Between Sources Artist Plays - Wiki Views Plays - Wiki Edits Plays Tweets Wiki Views - Wiki Wiki Views Edits Tweets Wiki Edits Tweets One Direction 0,765 -0,078 0,000 -0,311 0,594 0,000 Rihanna 0,869 0,306 0,820 0,288 0,829 0,578 Justin Bieber 0,180 0,013 0,724 0,049 0,029 0,261 The Weeknd 0,986 0,268 0,938 0,264 0,922 0,289 Wiz Khalifa 0,537 Taylor Swift 0,622 0,439 0,818 0,317 0,694 0,147 -0,062 0,126 0,130 0,736 0,581 0,439 Lupe Fiasco 0,968 -0,006 0,943 -0,030 0,943 0,015 Kanye West 0,815 0,258 0,612 0,256 0,427 0,053 fun. 0,340 0,005 -0,894 0,106 -0,248 -0,184 Usher 0,449 -0,268 0,514 -0,238 0,912 -0,122 Sean Paul 0,883 -0,298 0,562 -0,065 0,247 -0,607 Booba 0,806 0,406 0,934 0,542 0,587 0,317 Big Sean 0,586 -0,136 0,762 0,229 0,760 0,147 Muse 0,853 0,012 0,888 0,171 0,929 0,038 Mumford & Sons 0,537 Chris Brown 0,209 -0,465 0,086 -0,057 0,749 -0,226 Lana Del Rey 0,815 0,522 0,814 0,499 0,733 0,813 LMFAO -0,311 -0,099 -0,113 0,628 0,358 0,437 Avenged Sevenfold 0,809 0,102 0,578 0,409 0,861 0,543 Kendrick Lamar 0,903 0,154 0,939 0,775 Correlations between amount of plays, Wikipedia page views and tweets are quite high for several artists. 7 artists have high scores (> 0.80) on more than one of the correlation coefficients. Four of these 7 artists (Rihanna, The Weeknd, Lupe Fiasco and Muse) have three high correlation coefficients, showing that statistics on the amount of plays, tweets and Wikipedia page views tend to be similar. 18 When combining the correlations with our manual peak explanation, we find that of all 3 artists whose peaks are news-related, none have high correlations between any of the retrieved sources. Of all 4 artists whose peak explanations are unknown, only one case (Usher) shows a high correlation. Furthermore, 12 of the 13 artists, having release-related peaks, show at least one high correlation coefficient, suggesting that especially releases lead to high and concurrent increases in plays, tweets and Wikipedia page views. Although these observations are based on only 20 cases, the high correlations found for several cases show first evidence in similarities between the usage of different internet sources, which might be an interesting feature in the verification of peaks found in the nowplaying dataset. 7.4 AUTOMATIC PEAK EXPLANATION 7.4.1 Musicbrainz Releases The algorithm used to retrieve new releases of songs and albums using the Musicbrainz API performs reasonably well. The database seems to be very complete containing most official releases for many artists. The sample of 20 artists showing peaks in listening behavior contains 13 cases in which new releases are likely to be the cause of peaks. The algorithm is able to find 10 of these 13 cases using the Musicbrainz API. The cases in which the algorithm fails to retrieve releases can be explained by the fact that these are releases of albums or singles that are in cooperation with other artists. Apparently the Musicbrainz database does not holds this information for every release. 7.4.2 Wikipedia edits The algorithm used to select relevant Wikipedia edits found a total of 16 edits for the 20 artists in our sample. Ten of these edits turn out to be relevant and explanatory for peaks, which leaves a total of 6 edits that were of no relevance. In 4 of the 20 cases in the sample there were no possible explanatory edits found by the algorithm. The retrieved edit for the band ‘One Direction’ provides an example for a positive outcome: "On 20 September 2012, the fist single Live While We're Young and its official video leaked first on the internet through the site SoundCloud…” 19 Another example provides a possible explanation for a peak in plays for ‘Justin Bieber’: “In a September 2012 interview with the TODAY show, Bieber's mother Pattie talked about how everyone around her tried to push her toward abortion, but refused to abort…” While these examples show news-related edits, the retrieved edit for ‘Rihanna’ shows that this is not always the case: “Originally marketed as a reggae singer, Rihanna's musical genre has changed throughout the course of her career, which includes pop music, R&B, hip hop…..” 7.4.3 Most Retweeted Tweet Only 3 of the 20 retrieved tweets contain information relevant for peak explanation (Table 4). Although the amount of relevant selected tweets is low, it did help in finding a possible new explanation for one case. The initial manual search for an explanation of the peak for ‘Usher’ at September 22 did not give a proper reason for the peak. However, when retrieving the most retweeted tweet that day, we do get an idea of why people started listening to Usher that day. Apparently he did a good performance at the iHeart Radio Festival in Las Vegas, as mentioned by the most retweeted tweet: “ @scooterbraun: And that ladies and gentleman is why @UsherRaymondIV is one of the greatest entertainers of all time!! Wow! #IHEART radio #VEGAS ” While in this case the most retweeted tweet actually was news-related, in most cases this is not true, as can be seen in the example of One Direction: “ @googlefacts: Koutaliaphobia is the fear of spoons. Liam Payne from One Direction says he's scared of spoons. ” 20 Table 4 Most Retweed Tweets RT Amount Artist Tweet Relevant? One Direction @Real_Liam_Payne: “@googlefacts: Koutaliaphobia is the fear of spoons. Liam Payne from One Direction says he's scared of spoons." Rihanna @ElChisteLatino: Katy Perry: Cabello azul. Nicki Minaj: Cabello rosado. Rihanna: Cabello rojo. Lady Gaga: Cabello verde. ¡Los Power Rangers Regresaron! Justin Bieber @joejonas: I cry because I love Justin Bieber!!! The Weeknd @MixedBoyTatted: If Adele, The Weeknd, Drake, and Frank Ocean made an album together. Everyone would be in their deepest feelings. 1064 N Wiz Khalifa @CelebFactstory: Wiz Khalifa admits to spending at least $10,000 a month on weed. 5014 N Taylor Swift @factsonIove: Taylor Swift's son: Now that's going to be a boy who knows how to treat a girl. 3758 N Kendrick Lamar @Yo__SheBAd_DoE: Kendrick Lamar is the most over looked rapper out right now and when he drop his album ya'll ain't allowed on the banwagon 380 N Lupe Fiasco @FxckShugz: Lupe Fiasco could diss Chief Keef and the lyrics would be so developed that Keef wouldn't even understand how he got dissed. 7811 N Kanye West @GhettoEnglish: "Watchu know bout that Yeezy!?!?!" = Do you listen to Kanye West? fun. @justinbieber: yes i would really like to go to a prom someday. just be a normal kid and have fun. maybe even get a kiss during the .. Usher @scooterbraun: And that ladies and gentleman is why @UsherRaymondIV is one of the greatest entertainers of all time!! Wow! #IHEART radio #VEGAS Sean Paul @BeyonceLite: Sean Paul Talks about Beyoncé & the No.1 single 'Baby Boy' http://t.co/DLuNaIll Booba @HaterzFr: Le clip "Caramel" de Booba porte bien son nom. Bon marché, trop sucré, un peu mou et collant, il glisse mais n'a aucune co Big Sean @Drake: "Higher" off Big Sean mixtape is CRAZY. Muse @Sr_Colmenero: Muse es TT por su música, Justin Bieber es TT porque tiene 28 millones de seguidoras que se pasan el día escribiendo 1486 N Chris Brown @Realtaeyang: Just covered Chris Brown-Don't judge me. This song defines many meanings to me. http://t.co/utegJXMj 4838 N Mumford & Sons @PigeonJon: Turns out Mumford and Sons are not actually a firm of Removal Men. Frankly livid about this. 140 N Lana Del Rey @LyricalPhrase: "Heaven's in your eyes" - Lana Del Rey 165 N LMFAO @KevinHart4real: Security grabbed her and she yelled out "I HAD TO LICK HIS FACE......I LOVE HIS LITTLE ASS" LMFAO Avenged Sevenfold @Metal_Hammer: Avenged Sevenfold have released a new song! Check out 'Carry On' right here, people: http://t.co/R2QitBVl 49817 N 4503 N 74197 N 791 N 26304 N 726 Y 13 N 222 Y 11593 N 2391 N 276 Y 21 8 DISCUSSION / FUTURE WORK 8.1 TWITTER AS A SOURCE It is important to keep in mind that people are free to decide whether they let the world know what music they are playing or to keep this private. The information used in this thesis could therefore be biased because people might only share their listening behavior for certain music. The amounts used in this thesis will therefore only give an indication of what was listened to in September 2012. Using Twitter as a source also give possibilities for interesting future research. Because the Twitter API provides not only the text of a tweet, but also features like user-id and location, it is possible to retrieve more information which could lead to new questions for research. For example retrieving tweets just before and after a tweet containing the ‘nowplaying’ hashtag could give an insight in what people do just before or after listening to a song. 8.2 REAL-TIME While this thesis uses a dataset of twelve days to analyze different sources on the internet, future work could focus on retrieving and analyzing this data from various sources in a (near) real-time context. This provides new challenges for example in detecting peaks, where time windows should be needed to define new thresholds as new data arrives. This also gives new opportunities for using other sources, since the restriction (encountered in this thesis) of using sources with archives is not relevant anymore. 8.3 INFLUENCING SOURCES Another interesting aspect of this data that might be worth analyzing is the assumption that sources might influence each other and therefore show bursts not at the exact same moment. An example would be a news fact, which causes people to first look for the Wikipedia page of an artist and maybe later on, influenced by the Wikipedia page, listen to the artist. New releases might cause streams to behave in the opposite way, caused by people first listening to a new album, and afterwards using Wikipedia to find more information about an artist. Retrieving hourly data instead of aggregated daily amounts might be needed to analyze this. 22 8.4 AUTOMATICALLY RETRIEVING EXPLANATIONS Although this thesis shows that Musicbrainz can be a solid source to retrieve information about new releases of albums or singles, the automatic retrieval of possible explanatory Wikipedia edits appears to be a bigger challenge. Although choosing the revision with the biggest increase in page size leaves most (small) fluency edits away, not all retrieved edits contain news-related information. Examples are edits being made to the biography or related artists sections. Using other metadata about revisions might be needed to correctly classify news-related edits. For example, retrieving whether an edit gets reverted by another Wikipedia user might a good feature that is able to filter out spam or vandalism edits. Another possible solution could be to combine multiple edits instead of choosing only one. Another reason why Wikipedia might not be a real good source for retrieving explanations is because of their policy that every fact on the page should have a good source. Gossips are therefore removed until there is a solid confirmation of the fact. These gossips however are likely to cause increases in plays and page views and therefore have the potential of being good explanations for observed peaks. 8.5 OTHER SOURCES Using sources other than Twitter, Musicbrainz and Wikipedia might give other and new insights in how people react to news and might provide other possibilities for (automatically) explaining bursts. Last.fm or Spotify are possible sources that could provide more playing statistics, which might be more neutral than tweets written by people willing to share the music they listen to. Querying news sites or social media might give better ways of explaining peaks with news facts than the Wikipedia revision history. 23 9 CONCLUSION Retrieving Twitter messages containing the hashtag ‘#nowplaying’ gives us an idea about what the world is listening to right now. Counting the daily amount of these tweets and grouping them by artists provides a daily play amount which can be used to detect peaks in popularity. The assumption that these peaks can be related to new releases or news facts is supported by the manual analysis of a sample of 20 artists in the dataset: only 3 cases have no clear explanation, while the remaining 17 cases can be related to news or releases. A new releases of a single or an album is the most common reason for increases in plays: 13 of the 17 peaks can be related to new releases. Useful sources to find explanations for these peaks include Musicbrainz, Google Search and Wikipedia revision histories. The assumption that the ups and downs in the ‘nowplaying’ dataset are also visible in data from other internet sources seems to be legitimate; calculated correlations between number of plays, Wikipedia page views and tweets mentioning artist names show that patterns are quite similar. However, the amount of daily Wikipedia edits shows no high correlations between other sources. Using the combination of statistics on Wikipedia and Twitter usage might be useful for verifying peak detection: most peaks show high correlations between different sources, while the peaks with unknown reason don’t. Using the insights gain by manually finding explanations for peaks, three algorithms were developed to automatically retrieve explanations for peaks found in the dataset. Releases can easily be found using the Musicbrainz API, which proves to be a solid source in finding this information. Wikipedia edits can also be used to explain peaks, however, automatically choosing possible explanatory edits provides some challenges. While using increase in page size as the only feature is useful in selecting factual edits over fluency edits, this doesn’t seem to be sufficient enough for classifying edits as news-related or not. The assumption that the most retweeted tweet containing the name of an artist might be interesting, news-related information is not supported by the results: only 3 of the 20 retrieved tweets contain information related to peaks. By manually exploring the data in the ‘nowplaying’ dataset and by combining this with data retrieved from other internet sources, we provide an insight in how people react to news on music artists. Furthermore, by developing three algorithms, we make the first steps in the task of automatically relating news-facts to peaks found in the playing data retrieved from Twitter. 24 10 REFERENCES [1] Kwak, H., Lee, C., Park, H., & Moon, S. (2010, April). What is Twitter, a social network or a news media?. In Proceedings of the 19th international conference on World wide web (pp. 591-600). ACM. [2] Sankaranarayanan, J., Samet, H., Teitler, B. E., Lieberman, M. D., & Sperling, J. (2009, November). Twitterstand: news in tweets. In Proceedings of the 17th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems (pp. 42-51). ACM. [3] Petrovic, S., Osborne, M., & Lavrenko, V. (2010, June). The Edinburgh Twitter Corpus. In Proceedings of the NAACL HLT 2010 Workshop on Computational Linguistics in a World of Social Media (pp. 25-26). [4] Swartz, A. (2002). Musicbrainz: A semantic web service. Intelligent Systems, IEEE, 17(1), 7677. [5] Asur, S., & Huberman, B. A. (2010, August). Predicting the future with social media. In Web Intelligence and Intelligent Agent Technology (WI-IAT), 2010 IEEE/WIC/ACM International Conference on (Vol. 1, pp. 492-499). IEEE. [6] Bollen, J., Mao, H., & Zeng, X. (2011). Twitter mood predicts the stock market. Journal of Computational Science, 2(1), 1-8.] [7] Petrović, S., Osborne, M., & Lavrenko, V. (2010, June). Streaming first story detection with application to twitter. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics (pp. 181-189). Association for Computational Linguistics. [8] Sakaki, T., Okazaki, M., & Matsuo, Y. (2010, April). Earthquake shakes Twitter users: realtime event detection by social sensors. In Proceedings of the 19th international conference on World wide web (pp. 851-860). ACM. [9] Ciglan, M., & Nørvåg, K. (2010, October). WikiPop: personalized event detection system based on Wikipedia page view statistics. In Proceedings of the 19th ACM international conference on Information and knowledge management (pp. 1931-1932). ACM. [10] Osborne, M., Petrovic, S., McCreadie, R., Macdonald, C., & Ounis, I. (2012). Bieber no more: First story detection using Twitter and Wikipedia. In SIGIR 2012 Workshop on Timeaware Information Access. [11] Vlachos, M., Meek, C., Vagena, Z., & Gunopulos, D. (2004, June). Identifying similarities, periodicities and bursts for online search queries. In Proceedings of the 2004 ACM SIGMOD international conference on Management of data (pp. 131-142). ACM. 25 [12] Parikh, N., & Sundaresan, N. (2008, August). Scalable and near real-time burst detection from eCommerce queries. In Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 972-980). ACM. [13] Balog, K., Mishne, G., & de Rijke, M. (2006, April). Why are they excited?: identifying and explaining spikes in blog mood levels. In Proceedings of the Eleventh Conference of the European Chapter of the Association for Computational Linguistics: Posters & Demonstrations (pp. 207-210). Association for Computational Linguistics. [14] Tsagkias, M., Weerkamp, W., Meij, E.J., van Bruggen, G., and de Rijke, M. (2013). Music in Our Ears: An Analysis of Collective Music Listening Behavior. To appear. [15] Bronner, A., & Monz, C. (2012, April). User edits classification using document revision histories. In Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics (pp. 356-366). Association for Computational Linguistics. [16] Giles, J. (2005). Internet encyclopaedias go head to head. Nature, 438(7070), 900-901. [17] Lih, A. (2004). Wikipedia as participatory journalism: Reliable sources? metrics for evaluating collaborative media as a news resource. Nature. 26 11 APPENDIX Table 5 Play Amounts (peaks are marked red) Artist 17-92012 18-92012 19-92012 20-92012 21-92012 22-92012 23-92012 24-92012 25-92012 26-92012 27-92012 28-92012 One Direction 2619 2357 2642 9911 10958 9188 9110 5303 4884 4916 4632 4624 Rihanna 2553 2761 2431 2209 2682 2468 2521 2162 2205 6176 5062 4171 Justin Bieber 4528 4034 4032 4109 3907 5809 5108 3889 3595 3562 3423 3549 The Weeknd 1437 1333 1287 1233 1084 1107 1194 1817 3286 2473 2073 1726 Wiz Khalifa 1899 2057 2151 2028 2126 2027 1985 2465 3864 2986 2616 2344 Taylor Swift 3969 3455 3592 3606 3554 4171 4191 3328 5294 4569 4191 4021 Kendrick Lamar 1490 1398 1678 1646 1388 2797 1533 1511 1802 2211 1659 1541 Lupe Fiasco 517 502 628 1184 832 698 643 886 1657 1624 1209 894 Kanye West 3731 3745 4069 3835 3738 3438 3275 3296 3241 3173 3145 2630 469 1849 625 642 575 527 500 570 532 471 534 451 2408 1771 1889 1913 2037 2913 1935 1820 1784 1831 1743 1593 Sean Paul 639 1126 1618 821 695 591 600 432 452 488 462 496 Booba 364 239 268 349 1354 997 815 620 488 476 472 415 2727 2625 2786 2628 2286 2398 2127 2160 2273 2151 2038 1733 Muse 719 666 710 725 693 698 862 1626 1250 1249 1163 1063 Mumford & Sons 254 249 307 360 326 225 381 616 1107 1076 682 578 Chris Brown 5183 5020 5134 4775 4801 5237 5758 4950 4616 5077 5324 5194 Lana Del Rey 956 908 919 1004 998 1111 1141 1204 1735 1497 1333 1185 LMFAO 411 340 433 1185 357 370 362 312 319 321 333 290 Avenged Sevenfold 1416 1167 1219 1318 1367 1444 1524 1526 2033 1710 1629 1555 fun. Usher Big Sean 27 Table 6 Tweets Mentioning Artist Name Artist 17-92012 18-92012 19-92012 20-92012 21-92012 22-92012 23-92012 24-92012 25-92012 26-92012 27-92012 28-92012 One Direction 1601 461 1670 2551 2240 2355 2496 1743 1462 1741 1565 1684 Rihanna 704 176 621 592 657 901 761 855 702 2173 1060 782 Justin Bieber 1387 336 1208 1114 1086 1396 1625 1321 1049 1407 1226 1213 The Weeknd 81 21 88 76 64 70 87 167 196 168 123 102 Wiz Khalifa 135 53 196 163 128 173 164 208 290 220 190 170 Taylor Swift 475 122 536 407 372 490 544 425 616 570 434 397 Kendrick Lamar 129 69 192 176 132 140 144 162 184 205 169 159 Lupe Fiasco 38 11 40 95 38 29 41 64 92 112 68 56 Kanye West 186 55 260 254 325 197 180 275 173 164 160 138 19991 5149 19987 19607 19233 17360 21722 20494 18568 19897 19911 19073 712 94 293 349 224 294 283 242 231 243 212 245 Sean Paul 18 10 37 36 23 19 25 13 28 30 31 24 Booba 40 20 55 46 320 155 120 82 58 59 39 57 Big Sean 221 62 306 219 207 154 159 177 201 192 187 146 Muse 479 150 505 504 539 414 472 828 687 874 657 738 Chris Brown 647 134 492 475 424 407 420 384 430 530 590 503 Mumford & Sons 58 8 58 75 82 68 295 278 444 423 248 221 127 17 100 138 124 110 131 221 222 180 138 143 1655 378 1682 1478 1705 1331 1646 1698 1464 1498 1551 1409 53 15 46 40 50 44 42 122 99 64 51 48 fun. Usher Lana Del Rey LMFAO Avenged Sevenfold 28 Table 7 Wikipedia Page Views Artist 17-92012 18-92012 19-92012 20-92012 21-92012 22-92012 23-92012 24-92012 25-92012 26-92012 27-92012 28-92012 One Direction 43542 40190 38915 52565 66942 66926 69991 61740 56044 56684 57460 58296 Rihanna 22788 20403 19714 19327 20020 22175 26847 25124 21460 35051 31441 25774 Justin Bieber 25382 35677 43740 33437 33934 35103 35688 33378 31934 36907 28481 26227 The Weeknd 5772 5399 5235 5406 5213 4837 5423 6867 10083 8252 7668 7183 Wiz Khalifa 14312 15344 13852 13415 14100 11455 11857 14020 16444 14164 15942 16766 Taylor Swift 23507 21995 23065 20482 20380 20931 24280 23126 25366 24277 22599 21318 Kendrick Lamar 9271 9862 11377 10160 9472 8449 8860 10258 10179 11753 10450 9548 Lupe Fiasco 6793 10948 9445 17509 12520 9900 9235 12480 21001 20396 15577 12432 Kanye West 25727 26980 27173 24738 23533 20917 26120 24801 21502 20264 18728 16498 fun. 15369 19897 15970 15645 16011 23781 19548 15864 14253 14605 16264 14583 Usher 18724 11268 9113 7965 8517 9853 10212 9385 8555 9869 8281 7450 3020 3358 3670 3216 3213 3038 3176 3058 2646 2772 2789 2952 300 258 283 338 509 612 534 549 438 389 397 330 Big Sean 8431 9023 10264 9478 8968 8086 8186 8296 7802 8299 8706 8325 Muse 9781 9368 8946 9092 10120 9129 9890 17027 20812 18893 16374 18684 Chris Brown 10586 9553 9627 9452 9469 9132 9408 9375 9944 9972 12709 10397 Mumford & Sons 13487 11466 13454 15745 18287 16855 74075 46883 43869 45929 38566 32543 Lana del Rey 16900 16694 17133 21064 32763 22071 24944 28890 37072 31574 29493 26437 LMFAO 5746 5510 5050 5497 29239 36749 19976 19648 21878 12809 10655 9335 Avenged Sevenfold 6303 5274 5022 5306 5367 5016 4741 9487 10444 8402 7905 6703 Sean Paul Booba 29 Table 8 Wikipedia Page Edits 17-92012 18-92012 19-92012 20-92012 21-92012 22-92012 23-92012 24-92012 25-92012 26-92012 27-92012 28-92012 One Direction 7 1 6 9 2 2 3 0 2 2 7 8 Rihanna 7 0 3 1 1 6 0 1 3 7 2 1 Justin Bieber 2 0 0 2 2 3 0 0 2 6 1 0 The Weeknd 3 0 2 1 1 4 3 2 5 1 0 4 Wiz Khalifa 0 0 0 0 0 0 0 0 0 0 0 0 Taylor Swift 5 1 1 0 2 0 9 1 5 19 4 18 Kendrick Lamar 1 0 6 2 0 0 1 2 6 5 1 2 Lupe Fiasco 0 0 3 1 0 1 0 0 0 1 1 1 Kanye West 0 0 4 0 0 3 3 0 0 0 1 0 10 3 1 3 0 4 2 0 3 2 1 1 Usher 0 0 4 0 0 0 0 0 0 0 0 3 Sean Paul 0 0 0 0 0 0 0 6 0 0 0 2 Booba 0 0 0 0 0 1 0 0 0 0 0 0 Big Sean 1 0 1 0 0 0 0 0 0 0 2 1 Muse 0 0 6 3 2 3 0 0 4 0 9 4 Chris Brown 0 1 1 0 1 0 2 4 14 2 1 0 Mumford & Sons 0 0 0 0 0 0 0 0 0 0 0 0 Lana Del Rey 2 0 0 9 1 0 2 18 11 8 3 8 LMFAO 0 2 1 1 10 3 3 3 1 0 0 0 Avenged Sevenfold 0 4 3 7 12 6 0 15 5 10 5 4 Artist fun. 30
© Copyright 2026 Paperzz