eick.pdf

"!$#&%('(#*),+.-0/2143$5
687:9<;*=?>A@<BA9<7 CED&FG9IH"@JBLKNM
OQPSRUT D&;&V2=?BXWZY?;*> []\N=?^`_
T ;&7aF4[ OQbSb
cd=<e*[]^<\f7g>:>h[JHi@Ijlk bSR k$m?npofqAr R
s<t u]vJwAxyt zp{]|f} t ~<z4t E€‚GzIƒ u‚„N…
Abstract
†ˆ‡
Click stream data represents a rich source of information for understanding web
site activity, browsing patterns to purchasing decisions. Standard tools produce
hundreds of reports that are not particularly useful. The problem with fixed web
reports, and report-based analysis in general, is that reports only answer specific, predefined questions that are insufficient for today’s highly competitive
and rapidly changing businesses. To overcome this problem, we have developed a click stream analysis tool called eBizInsights. eBizInsights consists of a
web log parser, a click stream warehouse, a reporting engine, and a rich visual
interactive workspace. Using the workspace, analysts perform ad hoc analysis,
discover patterns, and identify correlations that are impossible to find using fixed
reports.
‰"ŠU‹ŒŽX‘“’”‹•yŽ–Š
One of the most exciting trends of our time is the explosive growth of the
Internet, web sites, and e-commerce. This trend has manifested itself with the
creation of millions of web sites, the widespread emergence of e-commerce
sites, the birth of a huge number of “dot coms”, and the merging of brick and
mortar and cyberspace business models.
As web technology has matured, web sites have progressed from static “information distribution” sites to hugely complex portals providing rich, highly
customized services. For many companies e-commerce web sites have become
a significant, if not the primary distribution channel for a company’s products
and services. With this transition from a traditional brick and mortar to a new
self-service economy, web sites are becoming a focal point for customer interactions. Customer relationships must be formed, personalized, and managed
through this new medium.
—
Although the technology, channel, and medium have changed, the underlying business problems have not. “How do I identify new prospects? Create
offers that excite and satisfy them? Retain existing customers and sell them
more? Understand what is selling? Maximize return on advertising expenditures? Manage profit by product and product line?” What is new is that these
business functions are now being conducted via the Internet instead of using
traditional means. Translated into e-business the key questions become: “How
do we drive profitable traffic to our site? Which banner ads were most successful? How do we target customers in Germany? How do I sell more products
during evening hours when my site has spare capacity? What is the traffic flow
through my site? Which sections of my site are customers bypassing? How
long do customers linger? How do I increase site stickiness?”
Perhaps the most significant difference between the old brick and mortar
and the new e-business is richer instrumentation. In a brick and mortar department store, Sears for example, we might know that a customer purchased a tie
in Men’s Department and, by matching credit card transactions, perhaps also
purchased a Craftsman portable drill in the tools department. In the equivalent
e-business scenario, we would know where the customer came from, when he
entered our store, what he searched for, which other departments he visited,
what other tools he put in his shopping basket, and if he abandoned the purchase because the checkout was too complicated.
The e-business challenge is then how best to exploit this exceedingly rich
new information source. The first problem, capturing site activity, is conceptually simple but practically difficult. There are various strategies for instrumenting sites, but the most common source of site activity data is web server log
files. These contain a record of every page request but processing these files
quickly becomes non-trivial. The transaction volumes may be huge, sites may
be geographically dispersed for reliability, likely include multiple components
such as ad, content, and personalization engines, and are routinely distributed
across server farms. The next level of problem, performing effective analyses,
involves collecting and coalescing this dispersed information set into a coherent schema, correlating site activity with related attributes, and linking it with
enterprise information.
†ˆ‡N†
•fŒ ‹ŠŒ
d‹•yŽ–Š.Ž Ž Œ‹•NŠ
First generation web site reporting tools focus on data collection and reporting. These tools typically include a scalable parsing engine for web server
log files, or perhaps a sniffer that captures site activity by monitoring TCP/IP
packets. They aggregate site statistics, either raw clicks or assembled sessions,
and store them in a fast database. The engineering challenge that these tools
address is how to collect, process, store, and report on high-volume web site
!"#$%&
'
activity. When sites were first deployed, the servers, platforms, and web infrastructure were much less standardized than today. It was a huge problem just to
cope with all the different web log formats and assemble the information into
a coherent schema.
The data collection problem, although still complex, has been somewhat
mitigated as the both standard bodies have formalized web log formats and
packaged solutions have become available. The next generation of web server
suites will further solve the back end problem by automatically collecting web
statistics and depositing them into an associated database.
The “reporting engine” for first generation tools would produce literally
hundreds of fixed reports showing site activity by time of day, url, browser,
and page errors, etc. Initially, customers were thrilled – the reports answered
simple questions like “What’s happening on my site?” “How many hits did I
have yesterday?” Soon, however, problems emerged. It is exceedingly difficult
to make delicate business decisions using just raw reports. Fixed reports:
1 Capture one point in time and are quickly outdated;
2 Miss important trends and outliers that are summarized in the data;
3 Are not effective for multi-dimensional problems;
4 Require significant people and IT resources to be customized;
5 Provide no ad hoc interactivity;
6 Lack drill-down to access details.
This problem with reports is well known within the data mining community
and is called assumptive-based analysis. The reports assume what is reported
upon and deliver fixed answers to pre-specified questions. Inevitably, the questions are not quite right and the particular case of interest is slightly but significantly different from what is reported upon. This leads to a “report explosion”
where the vendors provide hundreds of reports hoping to answer all-important
questions using a “shotgun” approach. What goes wrong with this approach is
that it become increasingly difficult to find the report that answers any particular question.
A final problem with fixed reports: it is impossible to correlate between different variables. A report-based correlation analysis involves printing several
of reports and manually trying to infer relationships. The net result is that the
first generation tools were exciting initially, but insufficient for all but the most
simple sites.
†ˆ‡
’ Ž–Š 
Š Œ d‹•yŽ–Š • 
• i’ Ž Œ
Analysis problems are dynamic, interactive, and iterative. Addressing them
involves correlating trends, making comparisons, and performing richer analyses. These tasks are impossible with fixed reports.
To bridge this gap we have built a shrink-wrapped e-business performance
analysis system that includes:
1 Visual workspace and analysis engine that supports ad hoc discovery,
correlation, and analysis.
2 Workflow to guide a user through the analysis process.
3 Parallelized W3C web log parser for IIS , Netscape, and Apache web
servers.
4 Session and hit-based click stream schema and data store for web site
activity (Kimball and Merz, 2000).
5 Static HTML reports that can be scheduled, browsed, and emailed to
decision makers.
6 Result sets that can be exported based upon analysis insights and acted
upon to convert analysis insights into business value.
The remainder of this paper describes our approach to e-business performance analysis in more detail and its embodiment in a software tool called
eBizInsights.
U‡
• ‰"Š  • U‹
Œ(’ •]‹ ’”‹“Œ
The eBizInsights architecture, as shown in Figure 1, consists of four significant components: an ISAPI filter that attaches to the web server for sessionization, a server, a data warehouse, and analysis clients.
The ISAPI filter, described below, generates session numbers for HTTP requests that allow the sequence of pages followed by a given client to be tracked.
The server parses IIS log files, shown on the left, and deposits the results in
MS SQL Server 7.0 tables. After parsing is complete, the server-reporting
module creates dozens of static HTML reports that are distributed and published throughout the enterprise. Clients running our visual workspace connect directly to the SQL Server 7.0 database for click stream data access. The
SQL Server 7.0 warehouse may be hosted on machine running the eBizInsights
server or on its own machine for increased scalability.
!"#$%&
07hC?;&^I[ P M
eBizInsights architecture consists of a ISAPI filter, a server, SQL Server 7.0 click
stream data store, and analysis clients.
‡
• ‰"Š  • U‹
ŽUŒ ˆ’ To overcome the inherent problems with report-based analysis we have developed an interactive workspace for visual discovery, analysis, and correlation (See Figure 2). There are three significant parts to the workspace: a treestructured workflow control along the left, interactive bar charts down the middle, and an analysis pane with three tabs labelled Overview, Details, and
Paths. Other system components include a color legend, navigation controls,
and selection control.
‡N†
Œ
‹Œ(“’”‹“Œ
 Ž Œ yŽ
Ž–Š Ž
Clicking on any entry in the workflow control automatically populates the
bar charts and sets the tabbed pane to the most useful data and display for
the particular analysis. In its initial configuration there are seven high-level
categories in the workflow:
1 Top 20s provides a quick view of the top page, path, promotion, visitor, and visit statistics.
2 General Activity reports on site activity by time period, either
hours, days, weeks, months, or years.
3 Hosted Ads shows click through rates and other advertizing effectiveness attributes for sites displaying banner adds.
07hC?;&^I[ O M
eBizInsights workspace. Some of the entries in the tree work flow control are
expanded to show sub items.
4 Promotions show which sites, URLs, search engines, etc., are most
effective at driving traffic to the site.
5 Site Effectiveness investigates site-oriented statistics such as
errors, paths though the site, entry and exit pages, page stickiness, etc.
6 Visitors organizes e-commerce site visitors into Browsers, Abandoners, and Buyers through a configuration option. It then supports visitor
analysis via organization, country, type, etc.
‡
7 My Insights is a special tab for users to save custom analyses and,
in the future, an integration point for analyses based on additional data
sources.
ŽX’  ŽUŒ0Œ
d ‹•yŽ–Š 2 Š“
•
ˆŒ
ˆŒi‹
The workspace contains three barcharts named Focus, Correlation
and Time . All function similarly. Users set the statistic in the bar chart either
by clicking on an entry in the tree or by manipulating the bar chart selection
control at the bottom. The bar charts both show results and serve as an input
environment for user selections. See Section 4.3. Expanding the bar charts
shows that they are richly parameterized and can be oriented, zoomed, panned,
labeled, and sorted. See Figure 3.
!"#$%&
07hC?;&^I[ˆmSM
‡
Bar charts may be expanded, zoomed, panned, labeled, oriented, and sorted.
Œ • ‹ 2• 2Š“ d ‹
The Overview tab, as illustrated in Figure 2, shows correlations between
the focus and time bar charts using a 3D Multiscape (Eick, 2000). A Multiscape, our implementation of a landscape, provides a broad overview of the
information and is particularly powerful for showing time-oriented data. The
Multiscape toolbar, shown in Figure 5, provides navigation controls for zooming, rotating, and symbol choice, with several fixed but useful viewpoints.
Clicking on any viewpoint causes the control to animate smoothly to the new
position, thereby helping the user maintain context.
07hC?;&^I[UqAM
Details view is a color-coded, scrollable text window that smoothly transitions
between text and graphics to increase view scalability. Upper left: full scale, lower right: completely smashed where each thin bar represents a text field.
The Detail tab, as shown in Figure 4, is an interactive viewer for displaying text called Data Sheet (Eick, 2000). At full scale Data Sheet displays
information using a standard color-coded text font. Each column in the display is sortable, selectable, and searchable. To increase scalability, as the user
zooms out Data Sheet uses progressively smaller fonts and eventually switches
to thin bars. This process, called smashing, is illustrated in Figure 4.
07hC?;&^I[ R M
Top: Multiscape navigation toolbar provides a rich interface for manipulating the
display. Bottom: Selection and exclusion toolbar.
Path analysis is a technique for understanding how customers navigate through
a site, identifying trouble spots, and optimizing site layout. Our path view,
shown in Figure 7, helps support this by showing flow among pages. There
are columns for Date, Promotion, Referrer, Entry URL, Exit URL, and Visitor Type. Within each column filled circles are sized to encode the number of
visitors with each attribute. The circles in the Date column are all about the
same size, indicated that traffic was steady for the period. Lines between the
columns show visitors having the combination of attributes.
In contrast to the random pattern that browsers navigating around world
wide web (Huberman et al., 1997) follow, e-commerce sites, just like stores,
are designed for efficient traffic flows. In well designed stores customers can
find their product easily and check out efficiently. As with a department store,
another challenge for web site designers is to provide customers with impulse
purchase opportunities on the way out. Then to attract profitable repeat customers back to the site using ads and other promotions.
‡
“Œ
eBizInsights provides several different analysis measures that are set using the selection box at the top of the workspace. The most common measure is visits, the number of unique site visitors, but other choices include hits,
bytes transferred, and hits with errors. Hits and bytes transferred are useful for
understanding raw site activity since they correlate to machine performance.
!"#$%&
Visits are useful for understanding site activity from an advertising or promotional point of view.
‡
“Š f• ’ Œ
•f’ Œ
d‹0•yŽUŠ
Dynamic graphics is the capability of the visual metaphors to change in
response to mouse input. This dynamic capability is difficult to illustrate in a
static medium such as paper, but is extremely powerful for interactive visual
discovery.
There are several different classes of interactive operations:
1 Brushing or touching any graphical object with the mouse reveals details. For example, brushing the date bar chart reveals the number of
visitors on any particular day.
2 Viewport manipulation via pan, zoom, and orientation controls makes
it easy to see patterns that might otherwise be obscured. As illustrated
in Figure 5 top, Multiscape’s navigation control provides a rich environment for a 3D object.
3 Data layering via color shows additional information. Tying color to
referring site, as in Figure 6, shows that most visitors came from catalog
promotions.
4 Sorting and ordering are powerful sensemaking operations that all of the
visual metaphors support.
5 Selecting to identify subsets is useful to see how a subpopulation relates
to the whole. For example, do the Buyers take different paths through
a site than Abandoners?
6 Excluding to focus on subpopulations is a key technique for identifying
micro trends and other patterns that are obscured by averages.
‡
7 Interactively changing the correlation variable facilitates add hoc analysis.
• ‰"Š  • U‹
•NŠ ’”‹0•yŽUŠ
This section illustrates how eBizInsights can be used to analyze promotional effectiveness. Static reports show how many visitors each promotion
attracted to the web site. The deeper analysis task, however, involves figuring
out which promotions stimulated the most buyers, when, and what path they
took to complete their purchase. The data in this example is real and comes
from the e-commerce site for a large catalog retailer .
—
Figure 6 shows the initial display for a promotional study, with color tied to
promotion type. The large blue bar in the focus bar chart shows that the vast
majority of the visitors came via Catalog, the company’s own catalog. The
correlation bar chart shows that most visitors were Browsers, visitors who
did not buy anything, a small fraction were Abandoners, visitors who put
items in their shopping basket but did not complete the checkout process, and
finally the Buyers. The time dimension bar chart shows visitors by day for
the one week snapshot under study.
(7 C?;&^‚[2k?M
Promotion study shows that most visitors come from catalog ads.
The upper left bar chart in Figure 8 shows that after selecting and excluding visitors who come from the catalog, the four top-producing banner ads
were on Yahoo, FreeShop, DoubleClick, and AOL. Proceeding further
by changing the correlation bar chart to show visits by country, selecting and
excluding the .coms, .nets, and .edus, the upper left bar chart in Figure 8
shows that Canada and Italy are the two foreign countries that have sent the
most visitors. By switching the time dimension to show hits by hour, the lower
left graph in Figure 8 shows that visitors followed a two hump daily pattern.
Activity is lowest during the early morning hours, peaks around Dinnertime,
decreases slightly and peaks again at 2am. This pattern suggests that many
visitors are shopping from work in the early afternoons and very late into the
early morning. Alternatively, the surge of late evening traffic may be foreign
shoppers in other time zones. The lower right chart shows a detailed listing of
the Abandoners.
!"#$%&
07hC?;&^I[EM
—8—
Path analysis for Abandoners who exited on a particular a selected page.
07hC?;&^I[oQM
Promotional analysis. Upper left: banner ad effectiveness, upper right: visitors by
foreign country, lower left: visitors by hour, lower right: detail for abandoners.
‡N†
d ‹
Š • A business goal is to understand why some shoppers abandoned the buying process. Focusing in on the Entry URL and Exit URL columns, the most
common entry points for Abandoners are index.html, head.html, and
vnav.html. The most common exit point is cgi-bin/showBasket.html
and cgi-bin/finit.asp. This suggests that customers are abandoning
because they do not like what’s in their basket, or perhaps because show basket is too slow, or because the search routine is too complex. Perhaps the
interface to remove shopping basket items needs to be simplified.
—
‡
•NŠ ’”‹0•yŽUŠ Ž Œi‹0•NŠ
“’
 ]‹ Z ‹ ‹iŽ
It may be possible to convert the abandoners into buyers. Detail data may
be exported to Microsoft Excel using Writeback. Using Excel and Microsoft
Office, we can contact and perhaps offer the abandoners an incentive to purchase.
‡
• d‹0•yŽUŠ • ’”‹0•yŽUŠ
An important concept in e-business analysis involves segmenting customers
into different categories and understanding behavior patterns within each segment. For example, do the browsing patterns of customers entering a site coming from Yahoo differ from those who have bookmarked the site? How are
Buyers different from Abandoners? Do evening visitors linger longer than
daytime visitors?
We have developed a simple and intuitive model for selecting, navigating,
and comparing different sub populations. As illustrated in Figure 9 (left) a user
has selected the three busiest hours representing 2,280 visits and excluded the
rest (right) to focus on the busy hours. It is possible to select and focus in on
arbitrary subpopulations using our selection controls, Figure 5 (bottom right).
U‡
• "’  •yŽUŠ
eBizInsights has been used in three significant ways:
1 Understanding visitor demographics, i.e., who is visiting the site, when,
what are they doing, etc.
2 Analyzing promotional activity to understand which promotions drive
profitable traffic to the site.
3 Improving site effectiveness to improve the customer experience, avoid
abandoners, increase site stickiness, and provide up-selling opportunities.
The analysis approach for each of these applications is essentially the same.
Reports provide high level trends. Successively relating the trends to different attributes using the eBizInsights workspace leads to insights. This process
involves seeing patterns, formulating hypotheses, sub-setting to confirm the
trends, and successively drilling down to individual transactions. Converting
the insights into value involves taking action. The action might be to export
visitor lists to an operational system for target emails, tune the web site personalization engine, simplify images on web pages that download slowly and
cause abandonment, or redesign overly complex web sites. Our analysis approach as embodied in eBizInsights is both ad hoc and iterative.
!"#$%&
—
'
07hC?;&^I[ rQM
Top: selecting the three hours with the highest visitor traffic. Bottom: excluding
the non-selected hours for a focused analysis on the busy hours.
U‡N†
d‹ 
Ž Œ
There are numerous commercial systems that produce web reports that describe site activity. Ours, to my knowledge, is the first to use and interactive
workspace for site analysis. Much is known about paths and web browsing
patterns (Huberman et al., 1997) and (Lawrence and Giles, 1999).
A significant fraction of the previous research has focused on helping browser
navigate through sites more easily (Wexelblat and Maes, 1999). Other authors
have mined web data to determine common browsing patterns (Borges and
Levene, 1999) and (Cooley et al., 1999). Our approach, improving the web site
to make navigation easier, is closer to that of (Spiliopoulou et al., 1999). Related work involve web site visualization involves (Minar and Donath, 1999).
In work clearly related to our backend and data store, Kimball and Merz
provide an overview of web data and click stream data warehousing (Kimball
and Merz, 2000).
—
‡
ŽUŠ“’ N i •yŽ–Š 2 Š  
ˆŒ
There are several contributions embodied in eBizInsights:
1 Parser for web log files that extracts both sessions and hits.
2 Click stream data warehouse including analysis cubes.
3 Dozens of management reports that can be distributed throughout the
enterprise.
4 Visual Workspace that provides a rich, iterative, ad hoc environment for
web site analysis.
5 Workflow to structure web site analysis sessions.
6 Paths, promotions, errors, and other web site specific analyses.
7 Specific visual metaphors that support and facilitate click stream analysis
(e.g. Figure 7).
8 A rich environment that helps users convert insights and decisions into
action that creates value.
These capabilities working together provide a powerful environment for
web site analysis.
’ Š Ž 
Š–‹ Many talented engineers on the Visual Insights staff have contributed to this
project including Tim Barg, Rich Ely, Bill Hammond, Bill Hull, John Luers,
Jon Martin, John Pyrce, Kurt Rivard, Carla Schanstra, Bill Swanson, Michael
Tatelman, and Daryl Whitmore.
Žˆ‹ 1. Microsoft Internet Information Server – Microsoft’s web server.
2. Internet Information Server API.
3. In some images the Correlation barchart has been minimized.
4. All names and identifiers have been changed for privacy.
Œ Š“’ Borges, J. and Levene, M. (1999). Data mining of user navigation patterns. In Proceedings 1999
KDD Workshop on Web Mining, San Diego, California. Springer-Verlag. In Press.
Cooley, R., Tan, P.-N., and Srivastava, J. (1999). Websift: the web site information filter system. In Proceedings 1999 KDD Workshop on Web Mining, San Diego, California. SpringerVerlag. In Press.
!"#$%&
—
Eick, S. G. (2000). Visual discovery and analysis. IEEE Transactions on Computer Graphics
and Visualization, 6(1):44–59.
Huberman, B., Pirolli, P., Pitkow, J., and Lukose, R. (1997). Strong regularities in world wide
web surfing. Science, 280:95–97.
Kimball, R. and Merz, R. (2000). The Data Webhouse Toolkit. John Wiley & Sons, Inc., New
York, New York.
Lawrence, S. and Giles, C. L. (1999). Accessibility of information on the web. Nature, 400(6740):107–
109.
Minar, N. and Donath, J. (1999). Visualizing crowds at a web site. In CHI ’99 Late-breaking
Papers. ACM Press.
Spiliopoulou, M., Pohle, C., and Faulstich, L. (1999). Improving the effectiveness of a web site
with web usage mining. In Proceedings 1999 KDD Workshop on Web Mining, San Diego,
California. Springer-Verlag. In Press.
Wexelblat, A. and Maes, P. (1999). Footprints: History-rich tools for information foraging. In
CHI ’99 Conference Proceedings. ACM Press.