Your Two Weeks of Fame and Your Grandmother’s∗ James Cook UC Berkeley† [email protected] Atish Das Sarma eBay Research Labs‡ [email protected] Alex Fabrikant Google Research [email protected] Andrew Tomkins Google Research [email protected] April 19, 2012 Abstract much shorter timescales specifically in the context of information spreading on Twitter and similar social Did celebrity last longer in 1929, 1992 or 2009? We networking sites. To the best of our knowledge, this is investigate the phenomenon of fame by mining a col- the first massive scale study of this nature that spans lection of news articles that spans the twentieth cen- over a century of archived data, thereby allowing us tury, and also perform a side study on a collection to track changes across decades. of blog posts from the last 10 years. By analyzing Keywords: culturomics, media, attention modeling, mentions of personal names, we measure each persocial media, time series, historical trends, fame duson’s time in the spotlight, using two simple metration, news archives rics that evaluate, roughly, the duration of a single news story about a person, and the overall duration of public interest in a person. We watched Introduction the distribution evolve from 1895 to 2010, expect- 1 ing to find significantly shortening fame durations, Beginning in the 19th century, long-distance commuper the much popularly bemoaned shortening of socinication transitioned from foot to telegraph on land, ety’s attention spans and quickening of media’s news and from sail to steam to cable by sea. Each new cycles. Instead, we conclusively demonstrate that, form of technology began with a limited number of through many decades of rapid technological and sodedicated routes, then expanded to reach a large fraccietal change, through the appearance of Twitter, tion of the accessible audience, eventually resulting in communication satellites, and the Internet, fame dunear-complete deployment of digital electronic comrations did not decrease, neither for the typical case munication. Each transition represented an opportunor for the extremely famous, with the last statisnity for news to travel faster, break more uniformly, tically significant fame duration decreases coming in and reach a broad audience closer to its time of inthe early 20th century, perhaps from the spread of ception. telegraphy and telephony. Furthermore, while meEven today, the increasing speed of the news cydian fame durations stayed persistently constant, for cle is a common theme in discussions of the socithe most famous of the famous, as measured by either etal implications of technology. Stories break faster, volume or duration of media attention, fame duraare covered in less detail, and news sources quickly tions have actually trended gently upward since the move on to other topics. Online and cable outlets ag1940s, with statistically significant increases on 40gressively search for novelty in order to keep eyeballs year timescales. Similar studies have been done with glued to screens. Popular non-fiction dedicates significant coverage to this trend, which by 2007 prompted ∗ This version supercedes the short version of this paper pubThe Onion 1 , a satirical website, to offer the followlished in the proceedings of WWW 2012. † Work ‡ Work done while interning at Google. done while at Google Research 1 http://www.theonion.com/ 1 ing commentary on cable news provider CNN’s2 offerings [1]: “CNN is widely credited with initiating the acceleration of the modern news cycle with the fall 2006 debut of its spin-off channel CNN:24, which provides a breaking news story, an update on that story, and a news recap all within 24 seconds.” With this speed-up of the news cycle comes an associated concern that, whether or not causality is at play, attention spans are shorter, and consumers are only able to focus for progressively briefer periods on any one news subject. Stories that might previously have occupied several days of popular attention might emerge, run their course, and vanish in a single day. This popular theory is consistent with a suggestion by Herbert Simon [10] that as the world grows rich in information, the attention necessary to process that information becomes a scarce and valuable resource. The speed of the news cycle is a difficult concept to pin down. We focus our study on the most common object of news: the individual. An individual’s fame on a particular day might be thought of as the probability with which a reader picking up a news article at random would see their name. From this idea we develop two notions of the duration of the interval when an individual is in the news. The first is based on fall-off from a peak, and intends to capture the spike around a concrete, narrowly-defined news story. The second looks for period of sustained public interest in an individual, from the time the public first notices that person’s existence until the public loses interest and the name stops appearing in the news. We study the interaction of these two notions of “duration of fame” with the radical shifts in the news cycle we outline above. For this purpose, we employ Google’s public news archive corpus, which contains over sixty million pages covering 250 years, and we perform what we believe to be the first study of the dynamics of fame over such a time period. Data within the archive is heterogeneous in nature, ranging from directly captured digital content to optical character recognition employed against microfilm representations of old newspapers. The crawl is not complete, and we do not have full information about which items are missing. Rather than attempt topic detection and tracking in this error-prone environment, we instead directly employ a recognizer for person names to all content within the corpus; this approach is more robust, and more aligned with our goal of studying fame of individuals. Based on these different notions of periods of refer- ence to a particular person, we develop at each point in time a distribution over the duration of fame of different individuals. Our expectation upon undertaking this study was that in early periods, improvements to communication would cause the distribution of duration of coverage of a particular person to shrink over time. We hypothesized that, through the 20th century, the continued deployment of technology, and the changes to modern journalism resulting from competition to offer more news faster, would result in a continuous shrinking of fame durations, over the course of the century into the present day. Summary of findings. We did indeed observe fame durations shortening somewhat in the early 20th century, in line with our hypothesis regarding accelerating communications. However, from 1940 to 2010, we see quite a different picture. Over the course of 70 years, through a world war, a global depression, a two order of magnitude growth in (available) media volume, and a technological curve moving from party-line telephones to satellites and Twitter, both of our fame duration metrics showed that neither the typical person in the news, i.e. the median fame duration, nor the most famous, i.e. high-volume or longduration outliers, experienced any statistically significant decrease in fame durations. As a matter of fact, the bulk of the distribution, as characterized by median fame durations, stayed constant throughout the entire century-long span of the news study and was also the same through the decade of Blogger posts on which we ran the same experiments. As another heuristic characterization of the bulk of the distribution, both news and Blogger data produced roughly comparable parameters when fitted to a power law: an exponent of around -2.5, although with substantial error bars, suggesting that the fits were mediocre. Furthermore, when we focused our attention on the very famous, by various definitions, all signs pointed to a slow but observable growth in fame durations. From 1940 onward, on the scale of 40-year intervals, we found statistically significant fame duration growth for the “very famous”, defined as either: • people whose fame lasts exceptionally long: 90th and 99th percentiles of fame duration distributions; or • exceptionally highly-discussed people: using dis- 2 http://www.cnn.com/ 2 there is consistently a very substantial volume of articles per day, as well as a wide diversity of publications. For the sake of statistical significance, our study focuses on the years 1895–2011. The news corpus contains a mix of modern articles obtained from the publisher in the original digital form, as well as historical articles scanned from archival microform and OCRed, both by Google and by third parties. For scanned articles, per-article metadata such as titles, issue dates, and boundaries between articles are also derived algorithmically from the OCRed data, rather than manually curated. Our study design was driven by several features that we discovered in this massive corpus. We list them here to explain our study design. Also, data mining for high-level behavioral patterns in a diachronous, heterogeneous, partially-OCRed corpus of this scale is quite new, precedented on this scale perhaps only by [9] (which brands this new area as “culturomics”). But, with the rapid digitization of historical data, we expect such work to boom in the near future. We thus hope that the lessons we have learned about this corpus will also be of independent interest to others examining this corpus and other similar archive corpuses. Figure 1: The volume of news articles by date. tributions among just the top 1000 people or the top 0.1% of people by number of mentions within each year. In the case of taking the 1000 most-oftenmentioned names in each year, the increasing could be explained as follows: as the corpus increases in volume toward later years, a larger number of names appear, representing more draws from the same underlying distribution of fame durations. The quantiles of the distribution of duration for the top 1000 elements will therefore grow over time as the corpus volume increases. On the other hand, our experiments that took the top 0.1% most-often-mentioned names, or the top quantiles of duration, still showed an increasing trend. We therefore conclude that the increasing trend is not completely caused by an increase in corpus volume. To summarize, we find that the most famous figures 2.1 Corpus features, misfeatures, and missteps in today’s news stay in the limelight for longer than their counterparts did in the past. At the same time, 2.1.1 News mentions as a unit of attention however, the average newsworthy person remains in the limelight for essentially the same amount of time Our 116-year study of the news corpus aims to extoday as in the past. tend the rich literature studying topic attention in online social media like Twitter, typically over the span of the last 3–5 years. Needless to say, 100-year2 Working with the news cor- old printed newspapers are an imperfect proxy for the attention of individuals, which has only recently pus become directly observable via online behavior. ImWe perform our main study on a collection of the plicit in the heart of our study is the assumption that more than 60 million news articles in the Google news articles are published to serve an audience, and archive that are both (1) in English, and (2) search- the media makes an effort, even if imperfect, to cater able and readable by Google News users at no cost. In to the audience’s information appetites. We coarsely Section 5, we cross-validate our observations against approximate a unit of attention as one occurrence the corpus of public blog posts on Blogger, which is in a Google News archive article, and we leave open a number of natural extensions to this work, such as described there. The articles of the news corpus span a wide range weighting articles by historical publication subscriber of time, with the relative daily volume of articles over counts, or by size and position on the printed page. the range of the corpus shown in Figure 1. There are a Due to the automated OCR process, not every handful of articles from the late 18th century onward, “item” in the corpus can be reasonably declared a and the article coverage grows rapidly over the course news article. For example, a single photo caption of the 19th century. From the last decade of the 19th might be extracted as an independent article, or a century through the end of the corpus (March 2011), sequence of articles on the same page might be mis3 interpreted as a single article. Rather than weighting each of these corpus items equally when measuring the attention paid to a name, we elected to count multiple mentions of a name within an item separately, so that articles will tend to count more than captions, and there is no harm in mistakenly grouping multiple articles as one. We manually examined (A) a uniform sample of 50 articles from the whole corpus (which, per Fig. 1, contains overwhelmingly articles from the last decade), and (B) a uniform sample of 50 articles from 1900– Figure 2: Articles with recognized personal names 1925. We classified each sample into: • News articles: timely content, formatted as a per decade stand-alone “item”, published without external sponsorship, for the benefit of part of the publiThere are a myriad heuristics to define a computacation’s audience, tionally feasible model of a “single topic” that can • News-like items: non-article text chunks where a be thought to receive and lose the public’s attention. name mention can qualify as that person being But over the course of a century, the changes in so“in the news” — e.g. photo captions or inset ciety, media formatting, subjects of public discourse, quotes, writing styles, and even language itself are substan• Non-news: ads and paid content, sports scores, tial enough that neither sophisticated statistical modrecipes, news website comments miscategorized els trained on plentiful, well-curated training data from modern media nor simple generic approaches as news, etc. The number of items of each type in the two samples like word co-occurrence in titles are guaranteed to work well. Very few patterns connect articles from are given in the following table. 1910 newspapers’ “social” sections (now all but forfull corpus sample 1900–1925 sample gotten) about tea at Mrs. Smith’s, to 1930 articles news articles 31 28 about the arrival of a trans-oceanic liner, to 2009 arnews-like items 3 2 ticles about a viral Youtube video. 16 20 non-news items After trying out general proper noun phrases proWe expect that the similarity in these distributions duced inconclusively noisy results, we decided to foshould result in minimal noise in the cross-temporal cus on occurrences of personal names, detected in comparisons, and leave to future work the task of the text by a proprietary state-of-the-art statistical automatically distinguishing real news stories from recognizer. Personal names have a relatively stable non-news. presence in the media: even with high OCR error rates in old microform, over 1/7th of the articles even 2.1.2 Compensating for coverage in the earliest decades since 1900 contain recognized personal names (see Figure 2). Even once we discard the more sparsely covered 18th But personal names are not without historical and 19th centuries, there is still more than an order of caveats, either. A woman appearing in 2005 stories magnitude difference between article volume in 1895 as “Jane Smith” would be much more likely to be exand 2011. We address these coverage differences by clusively referenced as “Mrs. Smith”, or even “Mrs. downsampling the data down to the same number of John Smith”, in 1915. Also, the English-speaking articles for each month in this range. We address the world was much more Anglo-centric in 1900 than now, nuanced effects of this downsampling on our methodwith much less diversity of names. An informal samology in Section 3.3. ple suggests that most names with non-trivial news presence 100 years ago referred overwhelmingly to a 2.1.3 Evolution of discourse and media — single bearer of that name for the duration of a parwhy names? ticular news topic, but many names are not unique We set out originally to understand changes in the when taken across the duration of the whole corpus public’s attention as measured by news story topics. — for instance, “John Jacob Astor”, appearing in the 1 0.8 0.6 0.4 0.2 0 1900 4 1920 1940 1960 1980 2000 news heavily over several decades (Fig. 3), in reference to a number of distinct relatives. On account of both of these phenomena, among others, we aim to focus on name appearance patterns that are most likely to represent a single news story or contiguous span of public attention involving that person, rather than trying to model the full media “lifetime” of individuals, as we had considered doing at the start of this project. 2.1.4 newsworthy people. We should note that OCR errors are noticeably more frequent on older microfilm, but the reasonable availability of recognizable personal names even in 100-year-old articles, per Fig. 2, suggests that this problem is not dire. A manuallycoded sample of 50 articles with recognized names from the first decade of the 1900s showed only 8 out of 50 articles having incorrectly recognized names (including both OCR errors and non-names mis-tagged as names). OCR errors in data and metadata 2.1.5 We empirically discovered another downfall of studying long-term “media lifetimes” of individuals. In an early experiment, we measured, for each personal name, the 10th and 90th percentiles of the dates of that name’s occurrence in the news. We then looked at the time interval between 10th and 90th percentiles, postulating that a large enough fraction of names are unique among newsworthy individuals that the distribution of these inter-quantile gaps could be a robust measure of media lifetime. After noticing a solid fraction of the dataset showing interquantile gaps on the scale of 10-30 years, we examined a heat map of gap durations, and discovered a regular pattern of gap durations at exact-integer year offsets, which, other than for Santa Claus, Guy Fawkes, and a few other clear exceptions, seemed an improbable phenomenon. This turned out to be an artifact of OCRed metadata. In particular, the culprit was single-digit OCR errors in the scanned article year. While year errors are relatively rare, every long-tail name that occurred in fewer than 10 articles (often within a day or two of each other), and had a mis-OCRed error for one of those occurrences contributed probability mass to integral-number-of-years media lifetimes. As extra evidence, the heat map had a distinct outlier segment of high probability mass for inter-quantile range of exactly 20 years, starting in the 1960s and ending in the 1980s — the digits 6 and 8 being particularly easy to mistake on blurry microfilm. Note that shortterm phenomena are relatively safe from OCR date errors, thanks to the common English convention of written-out month names, and to the low impact of OCR errors in the day number. OCR errors in the article text itself are ubiquitous. Conveniently, the edit distance between two recognizable personal names is rarely very short, so by agreeing to discard any name that occurs only once in the corpus, we are likely to discard virtually all OCR errors as well, with no impact on data on substantially Simultaneity and publishing cycles There are also pitfalls with examining short timelines. In the earliest decades we examine, telegraph was widely available to news publishers, but not fully ubiquitous, with rural papers often reporting news “from the wire” several days after the event. An informal sample seems to suggest that most news by 1900 propagated across the world on the scale of a few days. Also, many publications in the corpus until the last 20 years or so were either published exclusively weekly or, in the case of Sunday newspaper issues, had substantially higher volume once a week, resulting in many otherwise obscure names having multiple news mentions separated by one week — a rather different phenomenon than a person remaining in the daily news for a full week. On account of both of these, we generally disregard news patterns that are shorter than a few days in our study design. 3 Measuring Fame We begin by producing a list of names for each article. To do this, we extract short capitalized phrases from the body text of each article, and keep phrases recognized by an algorithm to be personal names. For every name that appears in the input, we consider that name’s timeline, which is the multiset of dates at which that name appears, including multiple occurrences within an article. We intend the timeline to approximate the frequency with which a person browsing the news at random on a given day would encounter that name. The accuracy of this approximation will depend on the volume of news articles available. In order to avoid the possibility that any trends we detect are caused by variations in this accuracy caused by variations in the volume of the corpus, we randomly choose an approximately equal number of articles to work with from each month. We describe and analyze this process in Section 3.3. 5 In general, our method can be applied to any collection of timelines. In Section 5, we apply it to names extracted from blog posts. 3.1 Finding Periods of Fame Once we have computed a timeline for each name that appears in the corpus, we select a time during which we consider that name to have had its period of fame, using one of the two methods described below. In order to compare the phenomenon of fame at different points in time, we consider the joint distribution of two variables over the set of names: the peak date and the duration of the name’s period of fame. We try the following two methods to compute a peak date and Figure 3: Timelines for “Marilyn Monroe” (top) and duration for each timeline. “John Jacob Astor” (bot). • Spike method. This method intends to capture the spike in public attention surrounding a particular news story. We divide time into one-week “John Jacob Astor”, normalized by article counts. intervals and consider the name’s rate of occur- The spike method identifies as the peak the death rence in each interval. The week with the highest of John Jacob Astor III of the wealthy Astor family, rate is considered to be the peak date, and the with a duration of 38 days (March 8 to February 15, period extends backward and forward in time as 1890). The continuity method identifies instead the long as the rate does not drop below one tenth death of his nephew John Jacob Astor IV, who died its maximum rate. Yang and Leskovec [13] used on the Titanic, with a period of five months [12]. The a similar method in their study of digital media, period begins on March 23, 1912, three weeks before using a time scale of hours where we use weeks. the Titanic sank, and ends August 31. Many of the later occurrences of the name are historical mentions • Continuity method. This method intends to of the sinking of the Titanic. measure the duration of public interest in a person. We define a name’s period of popularity to be the longest span of time within which there is 3.2 Choosing the Set of Names no seven-day period during which it is not men- Basic filtering In all our experiments, to reduce tioned. The peak date falls halfway between the noise, we discard the names which occurred less than beginning and the end of the period. We find, in ten times, or whose fame durations are less than two Section 4, that durations are short compared to days. (In both methods, a name whose fame begins the time span of the study, so using any choice Monday and ends Wednesday is considered to have of peak date between the beginning and end will a duration of two days.) We also remove peaks that produce similar distributions. end in 2011 or later, since these peaks might extend further if our news corpus extended further in the future. To demonstrate the distinction between these two methods, Figure 3 shows the occurrence timeline for Marilyn Monroe. The “continuity method” picks out the bulk of her fame — 1952-02-13 (“A”) through 1961-11-15 (“D”), by which point her appearance in the news was reduced to a fairly low background level. The “spike method” picks out the intense spike in interest surrounding her death, yielding the range 19627-18 (“E”) – 1962-8-29 (“H”). Very often these two methods identify short moments of fame within a much longer context. For example, in Figure 3, we see the timeline for the name Top 1000 by year For each peak type, we repeat our experiment with the set of names restricted in the following way. We counted the total number of times each name appeared in each year (counting repeats within an article). For each year, we produced the set of the 1000 most frequently mentioned names in that year. We took the union of these sets over all years, and ran our experiments using only the names in this set. Note that a name’s peak of popularity 6 name ν is a sequence of independent random variables Xν,t ∼ Binom(fν (t), nt ). Our goal is to ensure that any measurements we take are independent of the values nt . To accomplish this independence of news volume, we randomly sampled news articles so that the ex′ pected number in each month was nmin . Let Xν,t be the number of sampled articles containing name ν. If we were to randomly sample nmin articles ′ without replacement, then we would have Xν,t ∼ Binom(fν (t), nmin ). Notice that the joint distribu′ tion of the random variables Xν,t is unaffected by the article volumes nt . Any further measurement based ′ on the variables Xν,t will therefore also be unrelated to the sequence nt . In practice, instead of sampling exactly nmin articles without replacement, we flipped a biased coin for each of the nt articles at time t, including each article with probability nmin /nt . For a large enough volume of articles, the resulting measurements will be the same. We removed all articles published before 1895, since the months before 1895 had less than our target number nmin of articles. We also removed articles published after the end of the year 2010, to avoid having a month with news articles at the beginning but not the end of the month, but with the same number of sampled articles. As an example of the effect of downsampling, the blue dotted lines in Figure 9 show the 50th, 90th and 99th percentiles of the distribution of fame durations using the continuity method. We see that they increase suddenly in the last ten years, when our coverage of articles surges with the digital age. The red lines show the same measurement after downsampling: the surge no longer appears. need not be the same year in which that name was in the top 1000: so if a name is included in the top-1000 set because it was popular in a certain year, we may yet consider that name’s peak date to be a different year. Top 0.1% by year We consider that filtering to the top 1000 names in each year might introduce the following undesirable bias. Suppose names are assigned peak durations according to some universal distribution, and later years have more names, perhaps because of the increasing volume of news. If a name’s frequency of occurrence is proportional to its duration, then selecting the top 1000 names in each year will tend to produce names with longer durations of fame in years with a greater number of names. With this in mind, we considered one more restriction on the set of names. In each year y, we considered the total number of distinct names ny mentioned in that year. We then collected the top ny /1000 names in each year y. We ran our experiments using only the names in the union of those sets. As with the top-1000 filtering, a name’s peak date will not necessarily be the same year for which it was in the top 0.1% of names. 3.3 Sampling for Uniform Coverage The spike and continuity methods for identifying periods of fame may be affected by the volume of articles available in our corpus. For example, suppose a name’s timeline is generated stochastically, with every article between February 1 and March 31 containing the name with a 1% probability. If the corpus contains 10000 articles in every week, then both the spike and continuity methods will probably decide that the article’s duration is two months. However, if the corpus contains less than 100 articles in each week, then the durations will tend to be short, since there will be many weeks during which the name is not mentioned. We propose a model for this effect. Each name ν has a “true” timeline which assigns to each day t a probability fν (t) ∈ [0, 1] that an article on that day will mention ν.3 For each day, there is a total number of articles nt ; we have no knowledge of the relation between nt and ν, except that there is some lower bound nt > nmin for all t within some reasonable range of time. Then we suppose the timeline for 3.4 Graphing the Distributions We graph the joint distribution of peak dates and durations in two different ways. We consider the set of names which peak in successive five-year periods. Among each set of names, we graph the 50th, 90th and 99th percentile durations of fame. These appear as darker lines in the graphs; for example, the top of Fig. 6 shows the distribution for the spike method. The lighter solid red lines show the same three quantiles for shorter three-month periods. For comparison, the dashed light blue lines show the same results if the article sampling described in Sec. 3.3 is not performed (and articles before 1895 and after 2010 are not removed). Fig. 9 shows the same set of lines using the continuity method. All the later figures are 3 In fact, articles could mention the name multiple times, but in the limit of a large number of articles, this will not affect our analysis. 7 10 1 Estimating Power Law Exponents We test the hypothesis that the tail of the distribution of fame durations follows a power law. For a given five-year period, we collect all names which peak in that period, and consider 20% of the names with the longest fame durations – that is, we set dmin to be the 80th percentile of durations, and consider durations d > dmin . Among those 20%, we compute a maximum likelihood estimate of the power law exponent α̂, predicting that the probability of a duration d > dmin is p(d) ∝ dα̂ . Clauset et al [3] show that the P maximum likelihood estimate α̂ is given by n α̂ = 1+( i=1 ln(di /dmin )). We include a line on each plot of cumulative distributions of fame durations, of slope α̂ + 1 on the log-log graph because we plot cumulative distributions rather than density functions. The α̂ values we measure are discussed in the following sections, and summarized in Figure 4 for the news corpus and Figure 5 for the blog corpus. 3.6 50,90,99 not downsampled 3-month groups downsampled 3-month groups downsampled 5-year groups 100 190419161928194019521964197619882000 Peak of fame Cumulative fraction of names 3.5 Duration 50% 90% 99% (days) produced in the same way, except they do not include the non-sampled full distributions. The second type of graph focuses on one five-year period at a time. The bottom of Fig. 6 shows a cumulative plot showing the number of names with duration greater than that shown on the x-axis. This is plotted for many five-year periods. The graphs of measurements using the spike method look more like step functions because that method measures durations in seven-day increments, whereas the longeststretch method can yield any number of days. (Recall that peaks that last less than two days are removed.) 0.1 1905-9 1925-9 1945-9 1965-9 1985-9 2005-9 power law slope 2005-9 0.01 10 100 1000 Duration of fame (days) Figure 6: Fame durations measured using the spike method, plotted as the 50th, 90th and 99th percentiles over time (top) and for specific five-year periods (bottom). The bottom graph also includes a line showing the max-likelihood power law exponent for the years 2005-9. (The slope on the graph is one plus the exponent from Fig. 4, since we graph the cumulative distribution function.) To illustrate the effect of sampling for uniform article volume, the first graph includes measurements taken before sampling; see Sec. 3.3. Section 3.4 describes the format of the graphs in detail. Statistical Measurements We used bootstrapping to estimate the uncertainty in the four statistics we measured: the 50th, 90th and 99th percentile durations and of the best-fit power law exponents. For selected five-year periods, we sampled |S| names with replacement from the set S of names that peaked in that period of time. For each statistic, we repeated this process 25000 times, and reported the range of numbers within which 99% of our samples fell. The results are presented in Figures 4 (for the news corpus) and 5 (for the blog corpus). in each case plot the distribution of duration as it changes over time. Figures 6 and 9 show the evolution of the distribution of fame durations for the full set of names in the corpus (after the basic filtering described in Section 3.2) using the spike and continuity methods, respectively. (Section 3.4 describes the format of the graphs in detail.) Median durations For the entire period we studied, the median fame duration did not decrease, as we had expected, but rather remained completely con4 Results: News Corpus stant at exactly 7 days, for both the spike and the We measure periods of popularity using the spike continuity peak measurement methods. For the spike and continuity methods described in Section 3, and method alone, this would not have been surprising. 8 Fame duration quantiles (months) Fame duration quantiles (months) 50 90 99 3-month groups 5-year groups 10 1 190419161928194019521964197619882000 Peak of fame 1 1 (5-year buckets) 1905-9 1925-9 1945-9 1965-9 1985-9 2005-9 power law slope 2005-9 Cumulative fraction of names Cumulative fraction of names 10 190419161928194019521964197619882000 Peak of fame 1 0.1 50 90 99 3-month groups 5-year groups 0.01 0.1 (5-year buckets) 1905-9 1925-9 1945-9 1965-9 1985-9 2005-9 power law slope 2005-9 0.01 10 100 1000 Duration of fame (days) 10 100 1000 Duration of fame (days) Figure 7: Fame durations, restricting to the union of Figure 8: Fame durations, restricting to the union of the 1000 most-mentioned names in every year, using the 0.1% most-mentioned names in every year, meathe spike method to identify periods of fame. sured using the spike method. Peaks measured by the spike method are discretized two lines in the timelines of Figures 6 and 9, and to multiples of weeks, so a perennial median of 7 the columns “90 %ile” and “99 %ile” of the first days just shows that multi-week durations have never and fourth blocks of Figure 4. been common. On the other hand, the continuity • “Volume outliers” – the names which appear method freely admits fame durations in increments of the most frequently in the news, by being ei1 day, with only 1-day-long peaks filtered out. Yet, ther in the top 1000 most frequent names in some the median has remained at exactly 7 days for all year, or, separately, names in the top 0.1%, as the years studied, and, per the full-corpus “50th perper Section 3.2. The graphs for these subsets of centile” measurements, shown in blue in Figure 4, for names are shown in Figures 7 and 8 for the spike all decades where we’ve tried bootstrapping, 99% of method, and Figures 10 and 11 for the contibootstrapped samples also matched the 7-day meanuity method, and the statistical measurements surement exactly (for the continuity method and, less appear in blocks 2, 3, 5 and 6 of Figure 4. surprisingly, for the spike method). This gives strong statistical significance to the claim that 7 days is inFrom the 1900’s to the 1940’s, the fame durations deed a very robust measurement of typical fame duin both categories of outliers do tend to decrease, with ration, which has not varied in a century. the decreases across that time interval statistically signicantly lower-bounded by 1-2 weeks via 99% bootThe most famous We next consider specially the strapping intervals. Heuristically, this seems consisfame durations of the most famous names, in two tent with our original hypothesis that accelerating correlated, but distinct senses of “most famous”: communications shorten fame durations: 1-2 weeks • “Duration outliers” — people whose fame lasts is a reasonable delay to be incurred by sheer commumuch longer than typical, as measured by nications delay before the omnipresence of telegraphy the 90th and 99th percentiles of fame durations and telephony. We note with curiosity that this efwithin each year. These correspond to the top fect applies only to the highly-famous outliers rather 9 Fame duration quantiles (months) 100 50 90 99 3-month groups 5-year groups 10 1 1 Cumulative fraction of names Duration 50% 90% 99% (days) 190419161928194019521964197619882000 Peak of fame 50,90,99 not downsampled 3-month groups 50,90,99 downsampled 3-month groups 100 50,90,99 downsampled 5-year groups 10 0.1 (5-year buckets) 1905-9 1925-9 1945-9 1965-9 1985-9 2005-9 power law slope 2005-9 0.01 10 100 1000 Duration of fame (days) 190419161928194019521964197619882000 Peak of fame Cumulative fraction of names 1 0.1 Figure 10: Fame durations, restricting to the union of the 1000 most-mentioned names in every year, measured using the continuity method. (5-year buckets) 1905-9 1925-9 1945-9 1965-9 1985-9 2005-9 power law slope 2005-9 than the typical fame durations. We posit that this is perhaps due to median fame durations being typically attributable to people with only geographically local0.01 ized fame, which does not get affected by long com10 100 1000 munication delays. We leave to further work a more Duration of fame (days) nuanced study to test these hypotheses around localFigure 9: Fame durations measured using the con- ity and communication delays affecting news spread tinuity method, plotted as the 50th, 90th and 99th in the early 20th century. After the 1940’s, on the other hand, we see no such percentiles over time (top), and for specific five-year decrease. On the contrary, the durations of fame for periods (bottom). To illustrate the effect of sampling, both the duration outliers and the volume outliers rethe first graph includes measurements taken before verse the trend, and actually begin to slowly increase. sampling; see Section 3.3. Section 3.4 describes the Using the bootstrapping method, per Section 3.6, we format of the graphs in detail. get the results marked in red in Figure 4: in almost all of the outlier studies4 , we see that the increase in durations is statistically significant over 40-year gaps for both categories of fame outliers. For example, the median fame duration according to continuity peaks 4 7 out of the 8 outlier studies show statistically significant increases between the 1940’s and the 1980’s, and between the 1960’s and the 2000’s. The sole exception is the 90th percentile of the spike method. Given that the bootstrap values in that experiment, discretized to whole weeks, range between 3 and 4 weeks, we don’t consider it surprising that the increases there were not measured to be significant by 99% bootstrap intervals. 10 Fame duration quantiles (months) 100 -2.32 for spike peaks. In Figures 9 and 6 we show the actual distributions, and, for reference, comparisons with the power-law fit for the 2005-2009 data (a straight line on these log-log plots). Furthermore, the continuity peaks fits also support the above observation of slowly-growing long-tail fame durations from 1940 onward. That is, powerlaw exponents from 1940 onward slowly move toward zero, with statistically significant changes when compared at 40-year intervals. The fluctuations and the error bars for both methods are rather noticeable, though, suggesting that power laws make for only a mediocre fit to this data. 50 90 99 3-month groups 5-year groups 10 1 190419161928194019521964197619882000 Peak of fame Cumulative fraction of names 1 0.1 (5-year buckets) 1905-9 1925-9 1945-9 1965-9 1985-9 2005-9 power law slope 2005-9 5 0.01 10 100 1000 Duration of fame (days) Figure 11: Fame durations, restricting to the union of the 0.1% most-mentioned names in every year, measured using the continuity method. for the top 1000 names (50th percentile column of the fifth block) appears as “27 (25 .. 29)” in the period 1945-9 and “52 (49 .. 56)” for the period 1985-9: with 99% confidence, the median duration was less than 29 days in the former period, but greater than 49 days in the latter. We also ran experiments for names that have outlier durations within the subset of names with outlier volumes. The same general trends were seen there as with the above outlier studies, but, with a far shallower pool of data, the bootstrapping-based error bars were generally large enough to not paint a convincing, statistically significant picture. Power law fits The column titled “power law exponent” in Figure 4 shows the maximum likelihood estimates of the power law exponents for various fiveyear-long peak periods. We focus on the first and fourth blocks, which show the estimates for the full set of names for the spike method and the continuity method respectively. For both peak methods, the fitted power law exponents remain in fairly small ranges — between -2.77 and -2.45 for continuity peaks, and between -2.63 and 11 Results: Blog Posts We also ran our experiments on a second set of data consisting of public English-language blog posts from the Blogger service. We began by sampling so that the number of blog posts in each month in our data set was equal to the number of news articles we sampled in each month, as per Sec. 3.3. The cumulative graphs of fame duration from six experiments are shown in Fig. 12. We combine the two methods for identifying periods of fame with three sets of names described in Section 3.2. The respective distributions from the news corpus are superimposed for comparison. The graphs of fame duration measured using the continuity method are much smoother for the blog corpus than for the news corpus. This happens because whereas we only know which day each news article was written, we know the time of day each blog entry was posted. The continuity-method graphs (bottom of Figure 12) had a distinctive rounded cap which surprised us at first. We believe it is caused by the following effect. Peaks with only two mentions in them are fairly common, and have a simple distinctive distribution that is the difference between two sample dates conditioned on being less than a week apart. Since two dates that are longer than one week apart cannot constitute a longest-stretch peak, the portion of the graph with durations longer than one week does not include any names from this two-sample distribution, and so it looks different. Our estimates of power-law exponents only consider the longest 20% of durations, so they ignore this part of the graph. The estimates we computed for the power-law exponents of the duration distributions for blog data 7 are shown in Figure 5, and can be compared to the figures for news articles in Figure 4. The medians for both blogs and news for both methods are remarkably the same, with no statistically significant differences. The power law fits are also quite similar, although they show enough variation to produce statistically significant differences. Qualitatively, we take these as evidence that the fame distributions in news and blogs are coarsely similar, and that it is not unreasonable to consider these results as casting some light on more fundamental aspects of human attention to and interest in celebrities, rather than just on the quirks of the news business. We do leave open the question of accounting for the occasionally significant distinctions between outlier results for blogs, as compared to news, especially for outlier-volume continuity peaks. Acknowledgements The authors would like to thank Zoran Dimitrijevic and the Google News Archive team for their help with the data; Danny Wyatt, Ed Chi, and Rachel Schutt for statistical advice; and the anonymous reviewers for helpful suggestions. References [1] Media landscape redefined by 24-second news cycle. The Onion, 2007-06-01. http://www.theonion.com/articles/medialandscape-redefined-by-24second-newscycle,2213/. [2] E. Adar and L. A. Adamic. Tracking information epidemics in blogspace. WI 2005, pages 207–214. [3] A. Clauset, C. R. Shalizi, and M. E. J. Newman. Power-law distributions in empirical data. SIAM Review, 51(4):661–703, 2009. 6 Related Work Michel et al. [9] study a massive corpus of digitized content in an attempt to study cultural trends. The corpus they study is even larger than ours in terms of both volume and temporal extension. Leetaru [7] presents evidence that sentiment analysis of news articles from the past decade could have been used to predict the revolutions in Tunisia, Egypt and Libya. Our spike method for identifying periods of fame is motivated in part by the work of Yang and Lescovec [13] on identifying patterns of temporal variation on the web. Szabo and Huberman [11] also consider temporal patterns, in their case regarding consumption of particular content items. Kleinberg studies other approaches to identification of bursts [6]. Numerous works have studied the propagation of topics through online media. Leskovec et al. [8] develop techniques for tracking short “memes” as they propagate through online media, as a means to understanding the news cycle. Adar and Adamic [2], and Gruhl et al. [5] consider propagation of information across blogs. [4] E. Gabrilovich, S. Dumais, and E. Horvitz. Newsjunkie: providing personalized newsfeeds via analysis of information novelty. WWW 2004, pages 482–490. [5] D. Gruhl, R. Guha, D. Liben-Nowell, and A. Tomkins. Information diffusion through blogspace. WWW 2004, pages 491–501. [6] J. Kleinberg. Bursty and hierarchical structure in streams. KDD 2002, pages 91–101. [7] K. Leetaru. Culturomics 2.0: Forecasting largescale human behavior using global news media tone in time and space. First Monday, 16(9-5), 2011. [8] J. Leskovec, L. Backstrom, and J. Kleinberg. Meme-tracking and the dynamics of the news cycle. KDD 2009, pages 497–506. [9] J-B. Michel, Y. K. Shen, A. P. Aiden, A. Veres, M. K. Gray, The Google Books Team, J. P. Pickett, D. Hoiberg, D. Clancy, P. Norvig, J. Orwant, S. Pinker, M. A. Nowak, and E. L. Aiden. Quantitative analysis of culture using millions of digitized books. Science, 331(6014):176–182, 2011. Finally, a range of tools and systems provide access to personalized news information; see Gabrilovich et [10] H. A. Simon. Designing organizations for an al [4] and the references therein for pointers. information-rich world. 1971. 12 [11] G. Szabo and B. A. Huberman. Predicting the popularity of online content. Commun. ACM, 53:80–88, August 2010. [12] Wikipedia. Astor family — Wikipedia, the free encyclopedia, 2011. [Online; accessed 10August-2011]. [13] J. Yang and J. Leskovec. Patterns of temporal variation in online media. WSDM 2011, pages 177–186. 13 method spike spike spike spike spike spike spike spike spike spike spike spike spike spike spike spike spike spike continuity continuity continuity continuity continuity continuity continuity continuity continuity continuity continuity continuity continuity continuity continuity continuity continuity continuity filtering all all all all all all top 1000 top 1000 top 1000 top 1000 top 1000 top 1000 top 0.1% top 0.1% top 0.1% top 0.1% top 0.1% top 0.1% all all all all all all top 1000 top 1000 top 1000 top 1000 top 1000 top 1000 top 0.1% top 0.1% top 0.1% top 0.1% top 0.1% top 0.1% period 1905-9 1925-9 1945-9 1965-9 1985-9 2005-9 1905-9 1925-9 1945-9 1965-9 1985-9 2005-9 1905-9 1925-9 1945-9 1965-9 1985-9 2005-9 1905-9 1925-9 1945-9 1965-9 1985-9 2005-9 1905-9 1925-9 1945-9 1965-9 1985-9 2005-9 1905-9 1925-9 1945-9 1965-9 1985-9 2005-9 50th %ile (days) 7 (7 .. 7) 7 (7 .. 7) 7 (7 .. 7) 7 (7 .. 7) 7 (7 .. 7) 7 (7 .. 7) 21 (21 .. 21) 21 (14 .. 21) 21 (14 .. 21) 21 (21 .. 21) 21 (21 .. 28) 35 (28 .. 35) 35 (28 .. 42) 28 (21 .. 35) 21 (21 .. 28) 28 (21 .. 35) 35 (28 .. 35) 35 (35 .. 42) 7 (7 .. 7) 7 (7 .. 7) 7 (7 .. 7) 7 (7 .. 7) 7 (7 .. 7) 7 (7 .. 7) 24 (23 .. 26) 22 (21 .. 24) 27 (25 .. 29) 34 (32 .. 35) 52 (49 .. 56) 87 (80 .. 91) 66 (59 .. 79) 53 (47 .. 61) 57 (52 .. 66) 69 (61 .. 79) 85 (78 .. 94) 113 (107 .. 119) 90th %ile (days) 28 (28 .. 28) 28 (28 .. 28) 21 (21 .. 28) 21 (21 .. 28) 21 (21 .. 28) 28 (28 .. 28) 63 (56 .. 70) 49 (46 .. 56) 49 (42 .. 49) 56 (49 .. 63) 63 (56 .. 78) 99 (84 .. 119) 122 (91 .. 155) 63 (56 .. 82) 56 (49 .. 67) 70 (63 .. 99) 90 (70 .. 113) 119 (99 .. 140) 20 (19 .. 21) 18 (17 .. 19) 16 (15 .. 16) 17 (16 .. 18) 18 (17 .. 18) 21 (20 .. 21) 69 (62 .. 76) 58 (53 .. 66) 66 (57 .. 80) 92 (81 .. 104) 135 (118 .. 147) 229 (211 .. 250) 146 (126 .. 176) 125 (104 .. 161) 150 (123 .. 194) 168 (143 .. 214) 187 (158 .. 216) 271 (246 .. 306) 99th %ile (days) 91 (78 .. 106) 65 (63 .. 78) 56 (49 .. 63) 63 (56 .. 70) 70 (63 .. 78) 84 (78 .. 91) 155 (133 .. 192) 91 (78 .. 113) 91 (70 .. 130) 119 (99 .. 164) 161 (121 .. 366) 309 (224 .. 439) 289 (161 .. 381) 145 (91 .. 218) 133 (84 .. 161) 162 (119 .. 494) 327 (140 .. 443) 338 (263 .. 557) 70 (64 .. 79) 64 (56 .. 71) 53 (49 .. 58) 66 (58 .. 75) 77 (71 .. 83) 101 (96 .. 108) 166 (136 .. 229) 176 (131 .. 338) 211 (169 .. 332) 262 (203 .. 622) 312 (231 .. 739) 649 (532 .. 752) 968 (209 .. 4287) 476 (258 .. 2498) 419 (218 .. 1089) 713 (261 .. 874) 732 (276 .. 892) 681 (614 .. 874) power law exponent -2.45 (-2.55 .. -2.21) -2.63 (-2.74 .. -2.33) -2.44 (-2.50 .. -2.38) -2.37 (-2.44 .. -2.31) -2.32 (-2.36 .. -2.27) -2.48 (-2.53 .. -2.43) -2.75 (-3.15 .. -2.56) -3.22 (-3.74 .. -2.99) -3.33 (-3.73 .. -2.89) -2.90 (-3.54 .. -2.65) -2.85 (-3.19 .. -2.57) -2.64 (-2.96 .. -2.44) -2.82 (-3.96 .. -2.36) -3.49 (-4.82 .. -2.92) -3.35 (-4.32 .. -2.78) -2.90 (-3.77 .. -2.47) -2.66 (-3.13 .. -2.35) -2.76 (-3.10 .. -2.44) -2.67 (-2.76 .. -2.59) -2.64 (-2.72 .. -2.53) -2.74 (-2.82 .. -2.66) -2.58 (-2.69 .. -2.52) -2.48 (-2.56 .. -2.44) -2.43 (-2.46 .. -2.40) -3.01 (-3.35 .. -2.70) -3.01 (-3.39 .. -2.67) -2.92 (-3.32 .. -2.59) -2.75 (-3.11 .. -2.48) -3.20 (-3.62 .. -2.83) -2.97 (-3.32 .. -2.75) -3.29 (-5.20 .. -2.24) -2.67 (-3.72 .. -2.20) -3.19 (-4.26 .. -2.52) -3.01 (-4.01 .. -2.45) -3.40 (-4.30 .. -2.80) -3.16 (-3.59 .. -2.85) Figure 4: Percentiles and best-fit power-law exponents for five-year periods of the news corpus. Each entry shows the estimate based on the corpus, and the 99% boostrap interval in parentheses, as described in Section 3.6. Results discussed in section 4. 14 method spike spike spike spike spike spike continuity continuity continuity continuity continuity continuity filtering all all top 1000 top 1000 top 0.1% top 0.1% all all top 1000 top 1000 top 0.1% top 0.1% period 2000-4 2005-9 2000-4 2005-9 2000-4 2005-9 2000-4 2005-9 2000-4 2005-9 2000-4 2005-9 50th %ile (days) 7 (7 .. 7) 7 (7 .. 7) 21 (14 .. 21) 14 (14 .. 21) 39 (28 .. 56) 28 (25 .. 35) 7 (7 .. 7) 6 (6 .. 7) 20 (18 .. 21) 21 (20 .. 22) 102 (89 .. 123) 83 (70 .. 93) 90th %ile (days) 35 (28 .. 35) 28 (21 .. 28) 56 (49 .. 63) 49 (42 .. 54) 189 (106 .. 305) 88 (74 .. 102) 22 (20 .. 23) 18 (17 .. 19) 71 (59 .. 83) 59 (53 .. 73) 372 (236 .. 768) 302 (193 .. 617) 99th %ile (days) 123 (84 .. 189) 75 (63 .. 84) 265 (148 .. 479) 109 (91 .. 151) 717 (286 .. 840) 213 (113 .. 1674) 114 (95 .. 160) 80 (66 .. 93) 387 (237 .. 819) 408 (211 .. 1057) 2010 (768 .. 2238) 2083 (954 .. 2991) power law exponent -2.37 (-2.52 .. -2.23) -2.34 (-2.76 .. -2.27) -2.51 (-2.83 .. -2.18) -2.74 (-3.03 .. -2.41) -2.26 (-3.05 .. -1.85) -3.29 (-5.40 .. -2.23) -2.38 (-2.49 .. -2.28) -2.62 (-2.72 .. -2.53) -2.32 (-2.54 .. -2.12) -2.37 (-2.62 .. -2.18) -2.24 (-3.15 .. -1.86) -2.12 (-2.75 .. -1.79) Figure 5: Percentiles and best-fit power-law exponents for five-year periods of the blog corpus. Each entry shows the estimate based on the corpus, and the 99% boostrap interval in parentheses, as described in Section 3.6. Results discussed in Section 5. 0.01 10 100 All names: longest-stretch peaks Cumulative fraction of names 1 blogs 2000-4 blogs 2005_9 news 1965-9 news 2005-9 fit blogs 2005-9 0.1 0.01 10 100 blogs 2000-4 blogs 2005-9 news 1965-9 news 2005-9 fit blogs 2005-9 0.1 0.01 1000 Duration of fame (days) 1000 Duration of fame (days) Yearly top 0.1%: max-rate peaks Cumulative fraction of names 0.1 1 10 100 Duration of fame (days) 1 blogs 2000-4 blogs 2005-9 news 1965-9 news 2005-9 fit blogs 2005-9 0.1 0.01 10 100 1 blogs 2000-4 blogs 2005-9 news 1965-9 news 2005-9 fit blogs 2005-9 0.1 0.01 1000 1000 Duration of fame (days) 10 Yearly top 0.1%: longest-stretch peaks Cumulative fraction of names Yearly top 1000: max-rate peaks Cumulative fraction of names blogs 2000-4 blogs 2005-9 news 1965-9 news 2005-9 fit blogs 2005-9 Yearly top 1000: longest-stretch peaks Cumulative fraction of names All names: max-rate peaks Cumulative fraction of names 1 100 1000 Duration of fame (days) 1 blogs 2000-4 blogs 2005-9 news 1965-9 news 2005-9 fit blogs 2005-9 0.1 0.01 10 100 1000 Duration of fame (days) Figure 12: Cumulative duration-of-fame graphs for the blog corpus. The graphs at the top show the spike method results (for all names, top 1000, and top 0.1%), and those at the bottom show the continuity method results. 15
© Copyright 2026 Paperzz