Introduction - To Parent Directory

SkyServer Traffic Report (4/2001 – 6/2006)
<author, place, date,>
Introduction
SkyServer is a popular eScience web site (150M+ web
hits) that offers both HTTP and SQL access to a multiterabyte astronomy archive. Over the last five years
Ani Thakar (with help from Alex Szalay, Tanu Malik,
xxx, and Jim Gray) has been recording the SkyServer
[SkyServer] and CasJobs [CasJobs] activity logs. The
collector running at JHU harvests the logs every few
hours from across the Internet using a web services
interface offered by each of the SkyServers and
CasJobs servers.
These logs are aggregated in a
database and are publicly accessible [SkyServerLogs].
A current summary of the total SkyServer activity is
published
every
few
hours
at
http://skyserver.sdss.org/log/en/traffic/
Web and SQL Traffic
10000000
Hits
1000000
100000
10000
Web Hits
SQL Queries
Expon. (Web Hits)
1000
1
7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 58 61
Month
Figure 1: Growth in web hits and ad-hoc SQL queries.
Thus far, there have been some informal analysis of the
web logs by Jim Gray, and there have been more
substantive analyses done by Australia, Livermore, SF
State[?][?]. In addition, Tanu Malik has done some
classification on the structure of the SQL queries [?].
However, there has been little work mining the traffic
trends from the logs. In this traffic report we analyze
the long-term SkyServer usage patterns. We split up
our discussion into two sections: Web Hits and SQL
queries. In both sections we compare the amount of
traffic coming from bot and mortal users since we
expect their behavior on the site to differ greatly (a
detailed discussion of how we identified bots and
produced the figures in this report can be found here
[?]).
Web Hits
Aggregate Traffic
For the past five years the total number of web hits has
nearly doubled at the end of each successive year.
However, one should note an unusual spike in traffic in the
May 2006 logs, where the number of total hits exceeded
twelve million – the highest ever reported by SkyServer
and over four million more hits than the preceding month's
traffic count. To find out where these hits were directed to
on the site, we broke down the traffic by data release
product (Figure 1.1). Additionally, as described in the
Introduction, we also show the bots;mortals traffic ratios
Web Traffic
14000000
EDR Users
EDR Bots
12000000
DR6 Users
10000000
Hits
DR6 Bots
DR5 Users
8000000
DR5 Bots
DR4 Users
6000000
DR4 Bots
4000000
DR3 Users
DR3 Bots
2000000
DR2 Users
Month
61
56
51
46
41
36
31
26
21
16
DR1 Users
6
DR2 Bots
11
0
1
Our analysis discovered several key
findings: (1) SkyServer web hits have
nearly doubled each year (2) DR1
product usage still dominates recent
monthly traffic (3) Bots contribute a
significant percentage of the overall
SQL traffic as well as consume nearly
half of the total bytes transferred out of
the web server (4) The number of SQL
queries has been declining in the last
eight months and (5) In the past six
months the number of CasJob queries
has had a much more noticeable impact
on overall SQL traffic. (5) Data release
traffic follows a common pattern: small
early mortal traffic, then a burst of bot
traffic soon after, followed by a delay
of about two to three months before
mortal users take over the majority of
traffic. This pattern seems to cycle with
each new data release.
4
DR1 Bots
Figure 2: Web hits by data product (EDR,..DR5) and within a product by
“program” (bot) or human (user) traffic.
for each data release – highlighted by lighter and
darker shades of the same color.
bots, mortal, and CasJobs user traffic trends may fill in
some of the gaps at the end of this graph.
Figure 2 shows the web traffic trend per month, which
clearly shows an exponential growth curve as exhibited
by the doubling of hits each year. For the first thirtyone months of service EDR mortal users dominate the
overall traffic. It is not until month thirty-three
(December 2003) we notice a considerable spike in
traffic as well as a takeover by EDR bots and DR2
mortals.
Say something about the user/bot ratio. Did it change?
What is interesting here is DR1, both in bots and
mortal user traffic, never took over EDR's market share
nor reached DR2's traffic boom in any of the previous
months despite its earlier release. Users may have been
waiting to play around with the SQL system in the
winter break and jumped onboard the newer DR2
release, subsuming DR1's potential traffic share.
Traffic dips after winter break but spikes up again in
months thirty-eight to thirty-nine (May and June 2004),
the beginning of summer break. DR2 mortal users still
dominate the traffic but there is also a considerable
increase in DR2 bot user hits – reflecting an overall
delay theme between data releases and corresponding
bot hits.
Now surprisingly, in months forty-one (July 2004) on,
DR1 traffic begins dominate. By looking at the graph
one can see DR1 bot traffic first rises dramatically,
which and is followed by increased mortal user traffic
in the following months. This logically makes sense
since these bots include search engines crawling and
indexing the SkyServer web pages, making the product
content more accessible to mortal users. DR1 mortal
user hits, even in recent months, continue to dominate
the overall web traffic. The later data releases (DR3,
DR4, DR5) follow similar growth patterns like the
EDR, DR1, DR2 cases where later releases eventually
take over (minus the DR1 exception). Now in May
2006, when the highest number of hits recorded took
place, one can see most mortal and bot user hits were
concentrated on the DR5 and DR1 product pages,
which could definitely be predicted based off the
earlier months' traffic and data release cycle trends.
The total number webhits have nearly doubled at the
end of each year. Figure 1.2 displays the monthly web
and SQL traffic hits on a log log plot. The web traffic
follows a linear regression, which on logarithmic scale
exhibits exponential growth (the regression function
approximates to 2.05x growth per year). The web
traffic graph differs greatly from the SQL one, which
tends to fluctuate with more significant rises and drops.
From month thirty-four to month fifty-three (11/2004
to 5/2005) the traffic seems to oscillate quite a bit in
the 110,000 to 1,000,000 hits range before reaching a
peak at month fifty-five (6/2005) with 2.5 million
queries. As further explained in the SQL analysis
section of this paper, we feel the missing April and
June 2006 logs as well as a better delineation among
Users
One of the primary goals of our work was to produce data
sets that would help us characterize user behavior. But in
order to really analyze behavior, we need to understand
who our users are. To do this, we gathered the WHOIS
information for each of IP addresses. We conjectured bot
behavior would not be very interesting to study, so we
removed IP's of known bots from our analysis. Figure 1.2
displays the top twenty-five organizations (minus ISP's)
which have accumulated the highest number of webhits
over the last five years.
Organization
1 Johns Hopkins University
2 NASA
3 San Francisco State University
4 National Research Council Canada
5 University of Chicago
6 The Johns Hopkins Medical Institutions
7 University of Illinois, CCSO
8 Fermilab
9 University of Pittsburgh
10 University of Arizona
11 Harvard-Smithsonian Crt for Astrophysics
12 Princeton University
13 California Institute of Technology
14 Carnegie Mellon University
15 University of Washington
16 University of Puget Sound
17 Yale University
18 Orange County Public Schools
19 Texas Tech University
20 National Radio Astronomy Observatory
21 University of Michigan
22 Swarthmore College
23 City College of San Francisco
24 Los Angeles County Office of Education
25 Cornell University
Figure 3
Hits
56830465
3228040
2063041
1920785
1514275
1273951
1263807
1022036
994129
831775
482632
427972
423787
349665
346140
309799
288809
283185
276215
266591
251237
213402
189724
186448
151934
The results here should not be too surprising. John
Hopkins University, one of the main contributors of the
site, clearly dominates the chart. NASA comes in a far
second place, followed by a list of universities,
observatories, and K-12 public school districts. Overall,
universities dominate the list - holding seventeen of the top
twenty-five organization spots.
Bandwidth
Another statistic we wished to learn was how has
bandwidth (or the number of bytes transferred to and from
the webserver) evolved over time. Unfortunately, the logs
only started to bookkeep this information since February
2005, which may be the primary reason why the bytes
traffic appears to fluctuate (see Figure 1.3), although this
was not expected since the number of webhits increases
annually per month which should correspondingly boost
total bytes out. One can see the total bytes in remains
very small relative to bytes out, showing that most
users are primarily downloading rather than uploading
content to the SkyServer. A very interesting statistic
Figure 1.3 presents is how much bandwidth of bytes
out bots consume. In months four, six, eight, nine, ten,
(5/2005, 7/2005, 9/2005, 10/2005, 11/2005) bots
consume nearly half of the total web traffic bandwidth.
This data serves to validate our claim in Aggregate
Traffic that bots (most likely search engines) were very
active in crawling DR1 content since DR1 bot user
traffic also boosted dramatically in the same monthly
time-frame.
SQL Queries
Figure 2.1
Figure 2.2
Figure 1.3
Page Views
[todo]
Sessions
[todo]
Website Tree Rollup
improve the website tree layout and overall user
experience.
The easiest way of doing this is to simply return the user
webpage requests with the highest hits. Although
this strategy suffices for finding the most visited
\
Registry
websites, it does not provide us with any
registry\RegistryAdmin.asmx
Site Hierarchy Rollup
summary information about each level of the
dr1\
website hierarchy. We want to know the
dr1\en\
dr2\
aggregate hits at each level of the tree so we
dr3\
could answer questions like - “How many more
en\
dr4\
hits were accumulated in ~/en/dr1/ than ~/sp/dr3/
dr2\en\
?” or “Which second level hierarchy received the
dr3\en
astro\
most hits?” This means we needed to somehow
dr2\en\tools\
recursively sum up the aggregate hits from the
edr\
dr3\en\tools\
bottom of the tree up to the root node so each
en\tools\
parent directory holds the total number of hits in
dr4\en\tools\
dr1\en\tools\search\
all of its child pages. To do this, we parsed
dr2\en\tools\search\
through all the user web requests, splitting each
ImgCutoutDR4
ImgCutoutDR4\getjpeg.aspx
page link by its slashes to retrieve the level
edr\jp\
names, and then extracted the page name, suffix,
dr3\en\tools\search\
astro\en\tools\search\
query, and fragment from the remaining part of
dr1\en\tools\search\x_rect.asp
the weblink string using regular expressions. We
dr3\en\tools\search\x_sql.asp
stored this information in a new table with ten
dr1\en\tools\chart
collab\
level columns (SQL Server 2005 does not
support more than ten columns in GROUP BY ...
Figure xxx…. yyyyy
WITH ROLLUP expressions) to hold the
hierarchy names, and included some additional
weblogs was finding site hotspots, or which parts of the
columns to hold page metadata like the query, fragment,
webpage tree hierarchy users were visiting the most.
and suffix (see Figure ?). We then queried the table to
This information would help us greatly in
aggregate the hit counts for each ten level grouping
understanding our customer needs so that we can
(GROUP BY level_1 ... level_10) WITH ROLLUP to
One of the main goals we had when analyzing the
produce all the combinations of NULL values in the
level columns – therefore counting the total hits at each
level of the website tree.
OR
One of the first questions we had regarding the
SkyServer traffic was what parts of the site are users
visiting the most? The least? Specifically, we wished to
gather the traffic hits at each level of the website
hierarchy tree – where each directory represents a
separate successive level and page documents represent
leaves in the tree. Parent nodes should hold the total hit
value for all of its children directories and page leaves.
To build this structure, we first teased apart the distinct
webpage URL’s in the logs by parsing out their
directory, webpage, query, and fragment attributes. We
placed each of these values in a separate column (in
order from root directories to leaf pages), each
representing a successive level of the tree. To
accumulate the number of hits at each level we simply
ran a GROUP BY level_1, level_2, … level_n WITH
ROLLUP to count every level combination of NULL
values.
Figure 1.4
Figure 1.4 presents some of the highest hitting
directories and pages of the site. Here we notice that
the DR1 product receives the highest number of hits,
followed by DR2, DR3, and DR4 (in order of the data
releases except for EDR). Additionally, most users
running search queries go to DR1 and DR2, and
visitors viewing images go to the more recent DR4
product.
involve multiple HTTP requests (hits). For example, the
SkyServer home-page view requires more than 21 HTTP
request-hits. These hits may correspond to individual
images, banners, or frames on a webpage. Since each
HTTP request is currently recorded as a separate hit in the
weblogs, we developed a simple classifier to scour these
user commands for the page views.
In order to build this filter, we needed to formally define
what exactly a page view meant. We decided to use a
traditional interpretation which defines a page view as
simply an HTML document download [?]. To find the
weblog records with such webcommands, we first filtered
for HTTP GET, HEAD, POST and PUT requests –
omitting OPTIONS, which just returns HTTP information.
Secondly, in order for one to download an HTML
document successfully, the HTTP reply message should
not return an error code, so we filtered the previous results
for hits with code numbers between 200 and 299. Finally,
we search through the last step results for commands
requesting the default directory (‘/?’) or for a ‘.asp’,
‘.aspx’, ‘.htm’, or ‘.html’ page. The final result left us with
~77 million records, which we are currently taking as the
total page view count. Unfortunately, there are many
SkyServer portals which invoke ‘.aspx’ commands many
times from the same page, meaning our filter doublecounts certain page views since we treat each ‘.aspx’
command as a separate view. The fix here would be to
inspect or parse each source file in the SkyServer
webdirectory to find and subtract out these additional
commands, but due to time constraints we felt it would be
wiser to focus our attention more on other web analysis
tasks.
Sessions
[todo]
Page Views
[todo]
SQL Queries
While SQL and web-service requests generate a single
response, building an HTML page (a page view) may
Aggregate Traffic
Web Traffic (Bytes)
3.5E+11
3E+11
2.5E+11
bytes in
bytes out bots
1.5E+11
bytes out
1E+11
5E+10
Month (2/2005 - 6/2006)
17
15
13
11
9
7
5
3
0
1
Bytes
bytes in bots
2E+11
We analyzed the SQL query traffic trend in
the same fashion as the web traffic hits. One
should note that these logs cover the timespan of December 2003 to June 2006 – which
does not date back to the initial SkyServer
release in 2001 (like in the weblogs case)
since SQL queries did not get recorded until
end of 2002. Additionally, each hit in Figure
2.1 corresponds to a new SQL query request.
SQL Traffic
SQL Traffic
3000000
- bots
dr3 - dr2
bots
dr1 bots
40%
dr2 bots
dr1 - bots
dr2 - bots
dr1 bots
20%
dr1 - bots
Months (12/2003 - 3/2006)
Month
Figure 2.1
SQL Traffic
3000000
dr5 bots
2500000
dr5 - bots
dr4 bots
dr4 - bots
dr3 bots
1500000
dr3 - bots
dr2 bots
1000000
dr2 - bots
dr1 bots
500000
dr1 - bots
Months (12/2003 - 3/2006)
43
40
37
34
31
28
25
22
19
16
13
7
10
4
0
1
Hits
2000000
43
40
37
34
31
28
25
22
19
16
13
7
10
4
1
Hits
For the first year (2003-2004) the majority of SQL
queries came from DR1 bot users. Following this
period a significant spike occurs in months seventeen
to eighteen (April and May 2004) of over one million
net queries – with a majority coming from DR1 bot
users. Although the corresponding spike in the web
traffic graph (see Figure 1.1) does not show a dramatic
increase in DR1 bot users, there is a considerable
increase in DR2 mortal and EDR1 bot user web hits.
This may imply a tradeoff where DR1 bots that month
focused more on issuing SQL queries rather than
crawling additional webpage content. In the next
month the traffic dips slightly, but interestingly DR2
bots and mortal user hits shoot up and split a large
share of the traffic, wiping DR1 bot users from the
picture. In month nineteen DR2 bots dominate DR2
mortal users, extending the trend in the web traffic
Figure 2.2
graph where after some mortal user traffic appears bots
catch up a few months later. For the following months
DR1 mortal user traffic resurrects, most likely for the
same
CasJobs Traffic
reaso
ns
10000000
menti
oned
1000000
in the
no cas
web
100000
w / cas
all
traffi
c
10000
analy
sis
1000
(DR1
Month
searc
h
engine crawlers making content more accessible). DR3
takes over the traffic share from month twenty-five to
thirty-one – with a huge spike in months twenty nine
and thirty (April – May 2005), surpassing 2.5 million
queries - the highest number of queries ever recorded
in the logs for a single month’s traffic. Not too
surprisingly, most of these queries came from bot users
since in the months before DR3 mortal user traffic took
over – again reinforcing the bot delay theme. However,
we did not expect DR3 bot user traffic to reach such an
extreme peak, especially considering in the next month
they can barely be seen even in the stacked areas chart
(Figure 2.2). Oddly, in that same month, both mortal
and bot user traffic drop significantly to 250,000
queries – over a 90% reduction in hits. Traffic then
fluctuates between DR3 and DR1 mortal user traffic
34
31
28
25
22
19
16
13
0%
10
34
31
28
25
22
19
16
13
7
10
4
1
0
dr3 - bots
dr3 bots
dr2 bots
7
500000
bots
dr4 - dr3
bots
60%
4
1000000
dr4 bots
dr4 bots
dr4 - bots
4
1500000
- bots
dr5 - dr5
bots
80%
13
Hits
2000000
dr5 bots
dr5 bots
40
2500000
37
Traffic Share (Percentage)
100%
until month thirty-six when DR4 mortal and bot user
traffic spike and split the majority of the query share.
2.2 presents a small area for DR5 mortal traffic users in
month forty-two), assuming DR5 bot traffic increases
in the first month (forty-four or forty-five), which
should push DR5 mortal traffic up the month later (bots
include search crawlers which should make the content
more accessible to users). If this pattern does in fact
roughly hold true, then we should expect DR5 traffic to
boom in July - August 2006 timeframe.
There are several important patterns to gather from
these figures and analysis:
(1) A small group of mortal traffic users hit a new date
release, which causes bot traffic to spike up shortly,
followed by a long take-over by mortal traffic users.
(2) When excluding the first data release trend, there
appears to be about a two to three month delay pattern
before a data release hits prime time and captures the
majority of the SQL query traffic. For example, in
month fifteen a small group of early adopters
experiment with the new DR2 release, but it is not until
month seventeen when DR2 dominates the overall
monthly traffic. Same theme applies to DR3 (early:
month twenty-four, majority: month twenty-six) and
DR4 (early: month thirty-five, majority: month thirtyseven). Using this pattern we can suspect that DR5
traffic should capture close to a majority of the SQL
traffic in months forty-four to forty-five (since Figure
Future Work
Segmenting bots (spider, downloader, casjobs, etc.)
and users (students, tourists, scientists)
Page view
Analyzing sessions / behavior, changed over time,
introduction of CasJobs
SQL query categorization
Anything else?
Conclusion
SQL Traffic
3000000
dr5 bots
2500000
dr5 - bots
dr4 bots
Hits
2000000
dr4 - bots
dr3 bots
1500000
dr3 - bots
dr2 bots
1000000
dr2 - bots
dr1 bots
500000
dr1 - bots
CasJobs Traffic
43
40
37
34
31
28
25
22
19
16
13
7
10
1
4
0
10000000
Months (12/2003 - 3/2006)
Web Traffic (Bytes)
no cas
3.5E+11
100000
w / cas
all
3E+11
10000
2.5E+11
2E+11
bytes in
bytes out bots
1.5E+11
bytes out
1E+11
5E+10
Month (2/2005 - 6/2006)
17
15
13
11
9
7
5
3
0
1
43
40
37
34
31
28
25
19
16
13
7
10
4
22
Month
Bytes
bytes in bots
1000
1
Hits
1000000