SkyServer Traffic Report (4/2001 – 6/2006) <author, place, date,> Introduction SkyServer is a popular eScience web site (150M+ web hits) that offers both HTTP and SQL access to a multiterabyte astronomy archive. Over the last five years Ani Thakar (with help from Alex Szalay, Tanu Malik, xxx, and Jim Gray) has been recording the SkyServer [SkyServer] and CasJobs [CasJobs] activity logs. The collector running at JHU harvests the logs every few hours from across the Internet using a web services interface offered by each of the SkyServers and CasJobs servers. These logs are aggregated in a database and are publicly accessible [SkyServerLogs]. A current summary of the total SkyServer activity is published every few hours at http://skyserver.sdss.org/log/en/traffic/ Web and SQL Traffic 10000000 Hits 1000000 100000 10000 Web Hits SQL Queries Expon. (Web Hits) 1000 1 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 58 61 Month Figure 1: Growth in web hits and ad-hoc SQL queries. Thus far, there have been some informal analysis of the web logs by Jim Gray, and there have been more substantive analyses done by Australia, Livermore, SF State[?][?]. In addition, Tanu Malik has done some classification on the structure of the SQL queries [?]. However, there has been little work mining the traffic trends from the logs. In this traffic report we analyze the long-term SkyServer usage patterns. We split up our discussion into two sections: Web Hits and SQL queries. In both sections we compare the amount of traffic coming from bot and mortal users since we expect their behavior on the site to differ greatly (a detailed discussion of how we identified bots and produced the figures in this report can be found here [?]). Web Hits Aggregate Traffic For the past five years the total number of web hits has nearly doubled at the end of each successive year. However, one should note an unusual spike in traffic in the May 2006 logs, where the number of total hits exceeded twelve million – the highest ever reported by SkyServer and over four million more hits than the preceding month's traffic count. To find out where these hits were directed to on the site, we broke down the traffic by data release product (Figure 1.1). Additionally, as described in the Introduction, we also show the bots;mortals traffic ratios Web Traffic 14000000 EDR Users EDR Bots 12000000 DR6 Users 10000000 Hits DR6 Bots DR5 Users 8000000 DR5 Bots DR4 Users 6000000 DR4 Bots 4000000 DR3 Users DR3 Bots 2000000 DR2 Users Month 61 56 51 46 41 36 31 26 21 16 DR1 Users 6 DR2 Bots 11 0 1 Our analysis discovered several key findings: (1) SkyServer web hits have nearly doubled each year (2) DR1 product usage still dominates recent monthly traffic (3) Bots contribute a significant percentage of the overall SQL traffic as well as consume nearly half of the total bytes transferred out of the web server (4) The number of SQL queries has been declining in the last eight months and (5) In the past six months the number of CasJob queries has had a much more noticeable impact on overall SQL traffic. (5) Data release traffic follows a common pattern: small early mortal traffic, then a burst of bot traffic soon after, followed by a delay of about two to three months before mortal users take over the majority of traffic. This pattern seems to cycle with each new data release. 4 DR1 Bots Figure 2: Web hits by data product (EDR,..DR5) and within a product by “program” (bot) or human (user) traffic. for each data release – highlighted by lighter and darker shades of the same color. bots, mortal, and CasJobs user traffic trends may fill in some of the gaps at the end of this graph. Figure 2 shows the web traffic trend per month, which clearly shows an exponential growth curve as exhibited by the doubling of hits each year. For the first thirtyone months of service EDR mortal users dominate the overall traffic. It is not until month thirty-three (December 2003) we notice a considerable spike in traffic as well as a takeover by EDR bots and DR2 mortals. Say something about the user/bot ratio. Did it change? What is interesting here is DR1, both in bots and mortal user traffic, never took over EDR's market share nor reached DR2's traffic boom in any of the previous months despite its earlier release. Users may have been waiting to play around with the SQL system in the winter break and jumped onboard the newer DR2 release, subsuming DR1's potential traffic share. Traffic dips after winter break but spikes up again in months thirty-eight to thirty-nine (May and June 2004), the beginning of summer break. DR2 mortal users still dominate the traffic but there is also a considerable increase in DR2 bot user hits – reflecting an overall delay theme between data releases and corresponding bot hits. Now surprisingly, in months forty-one (July 2004) on, DR1 traffic begins dominate. By looking at the graph one can see DR1 bot traffic first rises dramatically, which and is followed by increased mortal user traffic in the following months. This logically makes sense since these bots include search engines crawling and indexing the SkyServer web pages, making the product content more accessible to mortal users. DR1 mortal user hits, even in recent months, continue to dominate the overall web traffic. The later data releases (DR3, DR4, DR5) follow similar growth patterns like the EDR, DR1, DR2 cases where later releases eventually take over (minus the DR1 exception). Now in May 2006, when the highest number of hits recorded took place, one can see most mortal and bot user hits were concentrated on the DR5 and DR1 product pages, which could definitely be predicted based off the earlier months' traffic and data release cycle trends. The total number webhits have nearly doubled at the end of each year. Figure 1.2 displays the monthly web and SQL traffic hits on a log log plot. The web traffic follows a linear regression, which on logarithmic scale exhibits exponential growth (the regression function approximates to 2.05x growth per year). The web traffic graph differs greatly from the SQL one, which tends to fluctuate with more significant rises and drops. From month thirty-four to month fifty-three (11/2004 to 5/2005) the traffic seems to oscillate quite a bit in the 110,000 to 1,000,000 hits range before reaching a peak at month fifty-five (6/2005) with 2.5 million queries. As further explained in the SQL analysis section of this paper, we feel the missing April and June 2006 logs as well as a better delineation among Users One of the primary goals of our work was to produce data sets that would help us characterize user behavior. But in order to really analyze behavior, we need to understand who our users are. To do this, we gathered the WHOIS information for each of IP addresses. We conjectured bot behavior would not be very interesting to study, so we removed IP's of known bots from our analysis. Figure 1.2 displays the top twenty-five organizations (minus ISP's) which have accumulated the highest number of webhits over the last five years. Organization 1 Johns Hopkins University 2 NASA 3 San Francisco State University 4 National Research Council Canada 5 University of Chicago 6 The Johns Hopkins Medical Institutions 7 University of Illinois, CCSO 8 Fermilab 9 University of Pittsburgh 10 University of Arizona 11 Harvard-Smithsonian Crt for Astrophysics 12 Princeton University 13 California Institute of Technology 14 Carnegie Mellon University 15 University of Washington 16 University of Puget Sound 17 Yale University 18 Orange County Public Schools 19 Texas Tech University 20 National Radio Astronomy Observatory 21 University of Michigan 22 Swarthmore College 23 City College of San Francisco 24 Los Angeles County Office of Education 25 Cornell University Figure 3 Hits 56830465 3228040 2063041 1920785 1514275 1273951 1263807 1022036 994129 831775 482632 427972 423787 349665 346140 309799 288809 283185 276215 266591 251237 213402 189724 186448 151934 The results here should not be too surprising. John Hopkins University, one of the main contributors of the site, clearly dominates the chart. NASA comes in a far second place, followed by a list of universities, observatories, and K-12 public school districts. Overall, universities dominate the list - holding seventeen of the top twenty-five organization spots. Bandwidth Another statistic we wished to learn was how has bandwidth (or the number of bytes transferred to and from the webserver) evolved over time. Unfortunately, the logs only started to bookkeep this information since February 2005, which may be the primary reason why the bytes traffic appears to fluctuate (see Figure 1.3), although this was not expected since the number of webhits increases annually per month which should correspondingly boost total bytes out. One can see the total bytes in remains very small relative to bytes out, showing that most users are primarily downloading rather than uploading content to the SkyServer. A very interesting statistic Figure 1.3 presents is how much bandwidth of bytes out bots consume. In months four, six, eight, nine, ten, (5/2005, 7/2005, 9/2005, 10/2005, 11/2005) bots consume nearly half of the total web traffic bandwidth. This data serves to validate our claim in Aggregate Traffic that bots (most likely search engines) were very active in crawling DR1 content since DR1 bot user traffic also boosted dramatically in the same monthly time-frame. SQL Queries Figure 2.1 Figure 2.2 Figure 1.3 Page Views [todo] Sessions [todo] Website Tree Rollup improve the website tree layout and overall user experience. The easiest way of doing this is to simply return the user webpage requests with the highest hits. Although this strategy suffices for finding the most visited \ Registry websites, it does not provide us with any registry\RegistryAdmin.asmx Site Hierarchy Rollup summary information about each level of the dr1\ website hierarchy. We want to know the dr1\en\ dr2\ aggregate hits at each level of the tree so we dr3\ could answer questions like - “How many more en\ dr4\ hits were accumulated in ~/en/dr1/ than ~/sp/dr3/ dr2\en\ ?” or “Which second level hierarchy received the dr3\en astro\ most hits?” This means we needed to somehow dr2\en\tools\ recursively sum up the aggregate hits from the edr\ dr3\en\tools\ bottom of the tree up to the root node so each en\tools\ parent directory holds the total number of hits in dr4\en\tools\ dr1\en\tools\search\ all of its child pages. To do this, we parsed dr2\en\tools\search\ through all the user web requests, splitting each ImgCutoutDR4 ImgCutoutDR4\getjpeg.aspx page link by its slashes to retrieve the level edr\jp\ names, and then extracted the page name, suffix, dr3\en\tools\search\ astro\en\tools\search\ query, and fragment from the remaining part of dr1\en\tools\search\x_rect.asp the weblink string using regular expressions. We dr3\en\tools\search\x_sql.asp stored this information in a new table with ten dr1\en\tools\chart collab\ level columns (SQL Server 2005 does not support more than ten columns in GROUP BY ... Figure xxx…. yyyyy WITH ROLLUP expressions) to hold the hierarchy names, and included some additional weblogs was finding site hotspots, or which parts of the columns to hold page metadata like the query, fragment, webpage tree hierarchy users were visiting the most. and suffix (see Figure ?). We then queried the table to This information would help us greatly in aggregate the hit counts for each ten level grouping understanding our customer needs so that we can (GROUP BY level_1 ... level_10) WITH ROLLUP to One of the main goals we had when analyzing the produce all the combinations of NULL values in the level columns – therefore counting the total hits at each level of the website tree. OR One of the first questions we had regarding the SkyServer traffic was what parts of the site are users visiting the most? The least? Specifically, we wished to gather the traffic hits at each level of the website hierarchy tree – where each directory represents a separate successive level and page documents represent leaves in the tree. Parent nodes should hold the total hit value for all of its children directories and page leaves. To build this structure, we first teased apart the distinct webpage URL’s in the logs by parsing out their directory, webpage, query, and fragment attributes. We placed each of these values in a separate column (in order from root directories to leaf pages), each representing a successive level of the tree. To accumulate the number of hits at each level we simply ran a GROUP BY level_1, level_2, … level_n WITH ROLLUP to count every level combination of NULL values. Figure 1.4 Figure 1.4 presents some of the highest hitting directories and pages of the site. Here we notice that the DR1 product receives the highest number of hits, followed by DR2, DR3, and DR4 (in order of the data releases except for EDR). Additionally, most users running search queries go to DR1 and DR2, and visitors viewing images go to the more recent DR4 product. involve multiple HTTP requests (hits). For example, the SkyServer home-page view requires more than 21 HTTP request-hits. These hits may correspond to individual images, banners, or frames on a webpage. Since each HTTP request is currently recorded as a separate hit in the weblogs, we developed a simple classifier to scour these user commands for the page views. In order to build this filter, we needed to formally define what exactly a page view meant. We decided to use a traditional interpretation which defines a page view as simply an HTML document download [?]. To find the weblog records with such webcommands, we first filtered for HTTP GET, HEAD, POST and PUT requests – omitting OPTIONS, which just returns HTTP information. Secondly, in order for one to download an HTML document successfully, the HTTP reply message should not return an error code, so we filtered the previous results for hits with code numbers between 200 and 299. Finally, we search through the last step results for commands requesting the default directory (‘/?’) or for a ‘.asp’, ‘.aspx’, ‘.htm’, or ‘.html’ page. The final result left us with ~77 million records, which we are currently taking as the total page view count. Unfortunately, there are many SkyServer portals which invoke ‘.aspx’ commands many times from the same page, meaning our filter doublecounts certain page views since we treat each ‘.aspx’ command as a separate view. The fix here would be to inspect or parse each source file in the SkyServer webdirectory to find and subtract out these additional commands, but due to time constraints we felt it would be wiser to focus our attention more on other web analysis tasks. Sessions [todo] Page Views [todo] SQL Queries While SQL and web-service requests generate a single response, building an HTML page (a page view) may Aggregate Traffic Web Traffic (Bytes) 3.5E+11 3E+11 2.5E+11 bytes in bytes out bots 1.5E+11 bytes out 1E+11 5E+10 Month (2/2005 - 6/2006) 17 15 13 11 9 7 5 3 0 1 Bytes bytes in bots 2E+11 We analyzed the SQL query traffic trend in the same fashion as the web traffic hits. One should note that these logs cover the timespan of December 2003 to June 2006 – which does not date back to the initial SkyServer release in 2001 (like in the weblogs case) since SQL queries did not get recorded until end of 2002. Additionally, each hit in Figure 2.1 corresponds to a new SQL query request. SQL Traffic SQL Traffic 3000000 - bots dr3 - dr2 bots dr1 bots 40% dr2 bots dr1 - bots dr2 - bots dr1 bots 20% dr1 - bots Months (12/2003 - 3/2006) Month Figure 2.1 SQL Traffic 3000000 dr5 bots 2500000 dr5 - bots dr4 bots dr4 - bots dr3 bots 1500000 dr3 - bots dr2 bots 1000000 dr2 - bots dr1 bots 500000 dr1 - bots Months (12/2003 - 3/2006) 43 40 37 34 31 28 25 22 19 16 13 7 10 4 0 1 Hits 2000000 43 40 37 34 31 28 25 22 19 16 13 7 10 4 1 Hits For the first year (2003-2004) the majority of SQL queries came from DR1 bot users. Following this period a significant spike occurs in months seventeen to eighteen (April and May 2004) of over one million net queries – with a majority coming from DR1 bot users. Although the corresponding spike in the web traffic graph (see Figure 1.1) does not show a dramatic increase in DR1 bot users, there is a considerable increase in DR2 mortal and EDR1 bot user web hits. This may imply a tradeoff where DR1 bots that month focused more on issuing SQL queries rather than crawling additional webpage content. In the next month the traffic dips slightly, but interestingly DR2 bots and mortal user hits shoot up and split a large share of the traffic, wiping DR1 bot users from the picture. In month nineteen DR2 bots dominate DR2 mortal users, extending the trend in the web traffic Figure 2.2 graph where after some mortal user traffic appears bots catch up a few months later. For the following months DR1 mortal user traffic resurrects, most likely for the same CasJobs Traffic reaso ns 10000000 menti oned 1000000 in the no cas web 100000 w / cas all traffi c 10000 analy sis 1000 (DR1 Month searc h engine crawlers making content more accessible). DR3 takes over the traffic share from month twenty-five to thirty-one – with a huge spike in months twenty nine and thirty (April – May 2005), surpassing 2.5 million queries - the highest number of queries ever recorded in the logs for a single month’s traffic. Not too surprisingly, most of these queries came from bot users since in the months before DR3 mortal user traffic took over – again reinforcing the bot delay theme. However, we did not expect DR3 bot user traffic to reach such an extreme peak, especially considering in the next month they can barely be seen even in the stacked areas chart (Figure 2.2). Oddly, in that same month, both mortal and bot user traffic drop significantly to 250,000 queries – over a 90% reduction in hits. Traffic then fluctuates between DR3 and DR1 mortal user traffic 34 31 28 25 22 19 16 13 0% 10 34 31 28 25 22 19 16 13 7 10 4 1 0 dr3 - bots dr3 bots dr2 bots 7 500000 bots dr4 - dr3 bots 60% 4 1000000 dr4 bots dr4 bots dr4 - bots 4 1500000 - bots dr5 - dr5 bots 80% 13 Hits 2000000 dr5 bots dr5 bots 40 2500000 37 Traffic Share (Percentage) 100% until month thirty-six when DR4 mortal and bot user traffic spike and split the majority of the query share. 2.2 presents a small area for DR5 mortal traffic users in month forty-two), assuming DR5 bot traffic increases in the first month (forty-four or forty-five), which should push DR5 mortal traffic up the month later (bots include search crawlers which should make the content more accessible to users). If this pattern does in fact roughly hold true, then we should expect DR5 traffic to boom in July - August 2006 timeframe. There are several important patterns to gather from these figures and analysis: (1) A small group of mortal traffic users hit a new date release, which causes bot traffic to spike up shortly, followed by a long take-over by mortal traffic users. (2) When excluding the first data release trend, there appears to be about a two to three month delay pattern before a data release hits prime time and captures the majority of the SQL query traffic. For example, in month fifteen a small group of early adopters experiment with the new DR2 release, but it is not until month seventeen when DR2 dominates the overall monthly traffic. Same theme applies to DR3 (early: month twenty-four, majority: month twenty-six) and DR4 (early: month thirty-five, majority: month thirtyseven). Using this pattern we can suspect that DR5 traffic should capture close to a majority of the SQL traffic in months forty-four to forty-five (since Figure Future Work Segmenting bots (spider, downloader, casjobs, etc.) and users (students, tourists, scientists) Page view Analyzing sessions / behavior, changed over time, introduction of CasJobs SQL query categorization Anything else? Conclusion SQL Traffic 3000000 dr5 bots 2500000 dr5 - bots dr4 bots Hits 2000000 dr4 - bots dr3 bots 1500000 dr3 - bots dr2 bots 1000000 dr2 - bots dr1 bots 500000 dr1 - bots CasJobs Traffic 43 40 37 34 31 28 25 22 19 16 13 7 10 1 4 0 10000000 Months (12/2003 - 3/2006) Web Traffic (Bytes) no cas 3.5E+11 100000 w / cas all 3E+11 10000 2.5E+11 2E+11 bytes in bytes out bots 1.5E+11 bytes out 1E+11 5E+10 Month (2/2005 - 6/2006) 17 15 13 11 9 7 5 3 0 1 43 40 37 34 31 28 25 19 16 13 7 10 4 22 Month Bytes bytes in bots 1000 1 Hits 1000000
© Copyright 2024 Paperzz