"!$#&%('(#*),+.-0/2143$5 687:9<;*=?>A@<BA9<7 CED&FG9IH"@JBLKNM OQPSRUT D&;&V2=?BXWZY?;*> []\N=?^`_ T ;&7aF4[ OQbSb cd=<e*[]^<\f7g>:>h[JHi@Ijlk bSR k$m?npofqAr R s<t u]vJwAxyt zp{]|f} t ~<z4t EGzI uN Abstract Click stream data represents a rich source of information for understanding web site activity, browsing patterns to purchasing decisions. Standard tools produce hundreds of reports that are not particularly useful. The problem with fixed web reports, and report-based analysis in general, is that reports only answer specific, predefined questions that are insufficient for today’s highly competitive and rapidly changing businesses. To overcome this problem, we have developed a click stream analysis tool called eBizInsights. eBizInsights consists of a web log parser, a click stream warehouse, a reporting engine, and a rich visual interactive workspace. Using the workspace, analysts perform ad hoc analysis, discover patterns, and identify correlations that are impossible to find using fixed reports. "UXy One of the most exciting trends of our time is the explosive growth of the Internet, web sites, and e-commerce. This trend has manifested itself with the creation of millions of web sites, the widespread emergence of e-commerce sites, the birth of a huge number of “dot coms”, and the merging of brick and mortar and cyberspace business models. As web technology has matured, web sites have progressed from static “information distribution” sites to hugely complex portals providing rich, highly customized services. For many companies e-commerce web sites have become a significant, if not the primary distribution channel for a company’s products and services. With this transition from a traditional brick and mortar to a new self-service economy, web sites are becoming a focal point for customer interactions. Customer relationships must be formed, personalized, and managed through this new medium. Although the technology, channel, and medium have changed, the underlying business problems have not. “How do I identify new prospects? Create offers that excite and satisfy them? Retain existing customers and sell them more? Understand what is selling? Maximize return on advertising expenditures? Manage profit by product and product line?” What is new is that these business functions are now being conducted via the Internet instead of using traditional means. Translated into e-business the key questions become: “How do we drive profitable traffic to our site? Which banner ads were most successful? How do we target customers in Germany? How do I sell more products during evening hours when my site has spare capacity? What is the traffic flow through my site? Which sections of my site are customers bypassing? How long do customers linger? How do I increase site stickiness?” Perhaps the most significant difference between the old brick and mortar and the new e-business is richer instrumentation. In a brick and mortar department store, Sears for example, we might know that a customer purchased a tie in Men’s Department and, by matching credit card transactions, perhaps also purchased a Craftsman portable drill in the tools department. In the equivalent e-business scenario, we would know where the customer came from, when he entered our store, what he searched for, which other departments he visited, what other tools he put in his shopping basket, and if he abandoned the purchase because the checkout was too complicated. The e-business challenge is then how best to exploit this exceedingly rich new information source. The first problem, capturing site activity, is conceptually simple but practically difficult. There are various strategies for instrumenting sites, but the most common source of site activity data is web server log files. These contain a record of every page request but processing these files quickly becomes non-trivial. The transaction volumes may be huge, sites may be geographically dispersed for reliability, likely include multiple components such as ad, content, and personalization engines, and are routinely distributed across server farms. The next level of problem, performing effective analyses, involves collecting and coalescing this dispersed information set into a coherent schema, correlating site activity with related attributes, and linking it with enterprise information. N f dy. N First generation web site reporting tools focus on data collection and reporting. These tools typically include a scalable parsing engine for web server log files, or perhaps a sniffer that captures site activity by monitoring TCP/IP packets. They aggregate site statistics, either raw clicks or assembled sessions, and store them in a fast database. The engineering challenge that these tools address is how to collect, process, store, and report on high-volume web site !"#$%& ' activity. When sites were first deployed, the servers, platforms, and web infrastructure were much less standardized than today. It was a huge problem just to cope with all the different web log formats and assemble the information into a coherent schema. The data collection problem, although still complex, has been somewhat mitigated as the both standard bodies have formalized web log formats and packaged solutions have become available. The next generation of web server suites will further solve the back end problem by automatically collecting web statistics and depositing them into an associated database. The “reporting engine” for first generation tools would produce literally hundreds of fixed reports showing site activity by time of day, url, browser, and page errors, etc. Initially, customers were thrilled – the reports answered simple questions like “What’s happening on my site?” “How many hits did I have yesterday?” Soon, however, problems emerged. It is exceedingly difficult to make delicate business decisions using just raw reports. Fixed reports: 1 Capture one point in time and are quickly outdated; 2 Miss important trends and outliers that are summarized in the data; 3 Are not effective for multi-dimensional problems; 4 Require significant people and IT resources to be customized; 5 Provide no ad hoc interactivity; 6 Lack drill-down to access details. This problem with reports is well known within the data mining community and is called assumptive-based analysis. The reports assume what is reported upon and deliver fixed answers to pre-specified questions. Inevitably, the questions are not quite right and the particular case of interest is slightly but significantly different from what is reported upon. This leads to a “report explosion” where the vendors provide hundreds of reports hoping to answer all-important questions using a “shotgun” approach. What goes wrong with this approach is that it become increasingly difficult to find the report that answers any particular question. A final problem with fixed reports: it is impossible to correlate between different variables. A report-based correlation analysis involves printing several of reports and manually trying to infer relationships. The net result is that the first generation tools were exciting initially, but insufficient for all but the most simple sites. dy i Analysis problems are dynamic, interactive, and iterative. Addressing them involves correlating trends, making comparisons, and performing richer analyses. These tasks are impossible with fixed reports. To bridge this gap we have built a shrink-wrapped e-business performance analysis system that includes: 1 Visual workspace and analysis engine that supports ad hoc discovery, correlation, and analysis. 2 Workflow to guide a user through the analysis process. 3 Parallelized W3C web log parser for IIS , Netscape, and Apache web servers. 4 Session and hit-based click stream schema and data store for web site activity (Kimball and Merz, 2000). 5 Static HTML reports that can be scheduled, browsed, and emailed to decision makers. 6 Result sets that can be exported based upon analysis insights and acted upon to convert analysis insights into business value. The remainder of this paper describes our approach to e-business performance analysis in more detail and its embodiment in a software tool called eBizInsights. U " U ( ] The eBizInsights architecture, as shown in Figure 1, consists of four significant components: an ISAPI filter that attaches to the web server for sessionization, a server, a data warehouse, and analysis clients. The ISAPI filter, described below, generates session numbers for HTTP requests that allow the sequence of pages followed by a given client to be tracked. The server parses IIS log files, shown on the left, and deposits the results in MS SQL Server 7.0 tables. After parsing is complete, the server-reporting module creates dozens of static HTML reports that are distributed and published throughout the enterprise. Clients running our visual workspace connect directly to the SQL Server 7.0 database for click stream data access. The SQL Server 7.0 warehouse may be hosted on machine running the eBizInsights server or on its own machine for increased scalability. !"#$%& 07hC?;&^I[ P M eBizInsights architecture consists of a ISAPI filter, a server, SQL Server 7.0 click stream data store, and analysis clients. " U U To overcome the inherent problems with report-based analysis we have developed an interactive workspace for visual discovery, analysis, and correlation (See Figure 2). There are three significant parts to the workspace: a treestructured workflow control along the left, interactive bar charts down the middle, and an analysis pane with three tabs labelled Overview, Details, and Paths. Other system components include a color legend, navigation controls, and selection control. N ( y Clicking on any entry in the workflow control automatically populates the bar charts and sets the tabbed pane to the most useful data and display for the particular analysis. In its initial configuration there are seven high-level categories in the workflow: 1 Top 20s provides a quick view of the top page, path, promotion, visitor, and visit statistics. 2 General Activity reports on site activity by time period, either hours, days, weeks, months, or years. 3 Hosted Ads shows click through rates and other advertizing effectiveness attributes for sites displaying banner adds. 07hC?;&^I[ O M eBizInsights workspace. Some of the entries in the tree work flow control are expanded to show sub items. 4 Promotions show which sites, URLs, search engines, etc., are most effective at driving traffic to the site. 5 Site Effectiveness investigates site-oriented statistics such as errors, paths though the site, entry and exit pages, page stickiness, etc. 6 Visitors organizes e-commerce site visitors into Browsers, Abandoners, and Buyers through a configuration option. It then supports visitor analysis via organization, country, type, etc. 7 My Insights is a special tab for users to save custom analyses and, in the future, an integration point for analyses based on additional data sources. X U0 d y 2 i The workspace contains three barcharts named Focus, Correlation and Time . All function similarly. Users set the statistic in the bar chart either by clicking on an entry in the tree or by manipulating the bar chart selection control at the bottom. The bar charts both show results and serve as an input environment for user selections. See Section 4.3. Expanding the bar charts shows that they are richly parameterized and can be oriented, zoomed, panned, labeled, and sorted. See Figure 3. !"#$%& 07hC?;&^I[mSM Bar charts may be expanded, zoomed, panned, labeled, oriented, and sorted. 2 2 d The Overview tab, as illustrated in Figure 2, shows correlations between the focus and time bar charts using a 3D Multiscape (Eick, 2000). A Multiscape, our implementation of a landscape, provides a broad overview of the information and is particularly powerful for showing time-oriented data. The Multiscape toolbar, shown in Figure 5, provides navigation controls for zooming, rotating, and symbol choice, with several fixed but useful viewpoints. Clicking on any viewpoint causes the control to animate smoothly to the new position, thereby helping the user maintain context. 07hC?;&^I[UqAM Details view is a color-coded, scrollable text window that smoothly transitions between text and graphics to increase view scalability. Upper left: full scale, lower right: completely smashed where each thin bar represents a text field. The Detail tab, as shown in Figure 4, is an interactive viewer for displaying text called Data Sheet (Eick, 2000). At full scale Data Sheet displays information using a standard color-coded text font. Each column in the display is sortable, selectable, and searchable. To increase scalability, as the user zooms out Data Sheet uses progressively smaller fonts and eventually switches to thin bars. This process, called smashing, is illustrated in Figure 4. 07hC?;&^I[ R M Top: Multiscape navigation toolbar provides a rich interface for manipulating the display. Bottom: Selection and exclusion toolbar. Path analysis is a technique for understanding how customers navigate through a site, identifying trouble spots, and optimizing site layout. Our path view, shown in Figure 7, helps support this by showing flow among pages. There are columns for Date, Promotion, Referrer, Entry URL, Exit URL, and Visitor Type. Within each column filled circles are sized to encode the number of visitors with each attribute. The circles in the Date column are all about the same size, indicated that traffic was steady for the period. Lines between the columns show visitors having the combination of attributes. In contrast to the random pattern that browsers navigating around world wide web (Huberman et al., 1997) follow, e-commerce sites, just like stores, are designed for efficient traffic flows. In well designed stores customers can find their product easily and check out efficiently. As with a department store, another challenge for web site designers is to provide customers with impulse purchase opportunities on the way out. Then to attract profitable repeat customers back to the site using ads and other promotions. eBizInsights provides several different analysis measures that are set using the selection box at the top of the workspace. The most common measure is visits, the number of unique site visitors, but other choices include hits, bytes transferred, and hits with errors. Hits and bytes transferred are useful for understanding raw site activity since they correlate to machine performance. !"#$%& Visits are useful for understanding site activity from an advertising or promotional point of view. f f d0yU Dynamic graphics is the capability of the visual metaphors to change in response to mouse input. This dynamic capability is difficult to illustrate in a static medium such as paper, but is extremely powerful for interactive visual discovery. There are several different classes of interactive operations: 1 Brushing or touching any graphical object with the mouse reveals details. For example, brushing the date bar chart reveals the number of visitors on any particular day. 2 Viewport manipulation via pan, zoom, and orientation controls makes it easy to see patterns that might otherwise be obscured. As illustrated in Figure 5 top, Multiscape’s navigation control provides a rich environment for a 3D object. 3 Data layering via color shows additional information. Tying color to referring site, as in Figure 6, shows that most visitors came from catalog promotions. 4 Sorting and ordering are powerful sensemaking operations that all of the visual metaphors support. 5 Selecting to identify subsets is useful to see how a subpopulation relates to the whole. For example, do the Buyers take different paths through a site than Abandoners? 6 Excluding to focus on subpopulations is a key technique for identifying micro trends and other patterns that are obscured by averages. 7 Interactively changing the correlation variable facilitates add hoc analysis. " U N 0yU This section illustrates how eBizInsights can be used to analyze promotional effectiveness. Static reports show how many visitors each promotion attracted to the web site. The deeper analysis task, however, involves figuring out which promotions stimulated the most buyers, when, and what path they took to complete their purchase. The data in this example is real and comes from the e-commerce site for a large catalog retailer . Figure 6 shows the initial display for a promotional study, with color tied to promotion type. The large blue bar in the focus bar chart shows that the vast majority of the visitors came via Catalog, the company’s own catalog. The correlation bar chart shows that most visitors were Browsers, visitors who did not buy anything, a small fraction were Abandoners, visitors who put items in their shopping basket but did not complete the checkout process, and finally the Buyers. The time dimension bar chart shows visitors by day for the one week snapshot under study. (7 C?;&^[2k?M Promotion study shows that most visitors come from catalog ads. The upper left bar chart in Figure 8 shows that after selecting and excluding visitors who come from the catalog, the four top-producing banner ads were on Yahoo, FreeShop, DoubleClick, and AOL. Proceeding further by changing the correlation bar chart to show visits by country, selecting and excluding the .coms, .nets, and .edus, the upper left bar chart in Figure 8 shows that Canada and Italy are the two foreign countries that have sent the most visitors. By switching the time dimension to show hits by hour, the lower left graph in Figure 8 shows that visitors followed a two hump daily pattern. Activity is lowest during the early morning hours, peaks around Dinnertime, decreases slightly and peaks again at 2am. This pattern suggests that many visitors are shopping from work in the early afternoons and very late into the early morning. Alternatively, the surge of late evening traffic may be foreign shoppers in other time zones. The lower right chart shows a detailed listing of the Abandoners. !"#$%& 07hC?;&^I[EM 8 Path analysis for Abandoners who exited on a particular a selected page. 07hC?;&^I[oQM Promotional analysis. Upper left: banner ad effectiveness, upper right: visitors by foreign country, lower left: visitors by hour, lower right: detail for abandoners. N d A business goal is to understand why some shoppers abandoned the buying process. Focusing in on the Entry URL and Exit URL columns, the most common entry points for Abandoners are index.html, head.html, and vnav.html. The most common exit point is cgi-bin/showBasket.html and cgi-bin/finit.asp. This suggests that customers are abandoning because they do not like what’s in their basket, or perhaps because show basket is too slow, or because the search routine is too complex. Perhaps the interface to remove shopping basket items needs to be simplified. N 0yU i0N ] Z i It may be possible to convert the abandoners into buyers. Detail data may be exported to Microsoft Excel using Writeback. Using Excel and Microsoft Office, we can contact and perhaps offer the abandoners an incentive to purchase. d0yU 0yU An important concept in e-business analysis involves segmenting customers into different categories and understanding behavior patterns within each segment. For example, do the browsing patterns of customers entering a site coming from Yahoo differ from those who have bookmarked the site? How are Buyers different from Abandoners? Do evening visitors linger longer than daytime visitors? We have developed a simple and intuitive model for selecting, navigating, and comparing different sub populations. As illustrated in Figure 9 (left) a user has selected the three busiest hours representing 2,280 visits and excluded the rest (right) to focus on the busy hours. It is possible to select and focus in on arbitrary subpopulations using our selection controls, Figure 5 (bottom right). U " yU eBizInsights has been used in three significant ways: 1 Understanding visitor demographics, i.e., who is visiting the site, when, what are they doing, etc. 2 Analyzing promotional activity to understand which promotions drive profitable traffic to the site. 3 Improving site effectiveness to improve the customer experience, avoid abandoners, increase site stickiness, and provide up-selling opportunities. The analysis approach for each of these applications is essentially the same. Reports provide high level trends. Successively relating the trends to different attributes using the eBizInsights workspace leads to insights. This process involves seeing patterns, formulating hypotheses, sub-setting to confirm the trends, and successively drilling down to individual transactions. Converting the insights into value involves taking action. The action might be to export visitor lists to an operational system for target emails, tune the web site personalization engine, simplify images on web pages that download slowly and cause abandonment, or redesign overly complex web sites. Our analysis approach as embodied in eBizInsights is both ad hoc and iterative. !"#$%& ' 07hC?;&^I[ rQM Top: selecting the three hours with the highest visitor traffic. Bottom: excluding the non-selected hours for a focused analysis on the busy hours. UN d There are numerous commercial systems that produce web reports that describe site activity. Ours, to my knowledge, is the first to use and interactive workspace for site analysis. Much is known about paths and web browsing patterns (Huberman et al., 1997) and (Lawrence and Giles, 1999). A significant fraction of the previous research has focused on helping browser navigate through sites more easily (Wexelblat and Maes, 1999). Other authors have mined web data to determine common browsing patterns (Borges and Levene, 1999) and (Cooley et al., 1999). Our approach, improving the web site to make navigation easier, is closer to that of (Spiliopoulou et al., 1999). Related work involve web site visualization involves (Minar and Donath, 1999). In work clearly related to our backend and data store, Kimball and Merz provide an overview of web data and click stream data warehousing (Kimball and Merz, 2000). U N i y 2 There are several contributions embodied in eBizInsights: 1 Parser for web log files that extracts both sessions and hits. 2 Click stream data warehouse including analysis cubes. 3 Dozens of management reports that can be distributed throughout the enterprise. 4 Visual Workspace that provides a rich, iterative, ad hoc environment for web site analysis. 5 Workflow to structure web site analysis sessions. 6 Paths, promotions, errors, and other web site specific analyses. 7 Specific visual metaphors that support and facilitate click stream analysis (e.g. Figure 7). 8 A rich environment that helps users convert insights and decisions into action that creates value. These capabilities working together provide a powerful environment for web site analysis. Many talented engineers on the Visual Insights staff have contributed to this project including Tim Barg, Rich Ely, Bill Hammond, Bill Hull, John Luers, Jon Martin, John Pyrce, Kurt Rivard, Carla Schanstra, Bill Swanson, Michael Tatelman, and Daryl Whitmore. 1. Microsoft Internet Information Server – Microsoft’s web server. 2. Internet Information Server API. 3. In some images the Correlation barchart has been minimized. 4. All names and identifiers have been changed for privacy. Borges, J. and Levene, M. (1999). Data mining of user navigation patterns. In Proceedings 1999 KDD Workshop on Web Mining, San Diego, California. Springer-Verlag. In Press. Cooley, R., Tan, P.-N., and Srivastava, J. (1999). Websift: the web site information filter system. In Proceedings 1999 KDD Workshop on Web Mining, San Diego, California. SpringerVerlag. In Press. !"#$%& Eick, S. G. (2000). Visual discovery and analysis. IEEE Transactions on Computer Graphics and Visualization, 6(1):44–59. Huberman, B., Pirolli, P., Pitkow, J., and Lukose, R. (1997). Strong regularities in world wide web surfing. Science, 280:95–97. Kimball, R. and Merz, R. (2000). The Data Webhouse Toolkit. John Wiley & Sons, Inc., New York, New York. Lawrence, S. and Giles, C. L. (1999). Accessibility of information on the web. Nature, 400(6740):107– 109. Minar, N. and Donath, J. (1999). Visualizing crowds at a web site. In CHI ’99 Late-breaking Papers. ACM Press. Spiliopoulou, M., Pohle, C., and Faulstich, L. (1999). Improving the effectiveness of a web site with web usage mining. In Proceedings 1999 KDD Workshop on Web Mining, San Diego, California. Springer-Verlag. In Press. Wexelblat, A. and Maes, P. (1999). Footprints: History-rich tools for information foraging. In CHI ’99 Conference Proceedings. ACM Press.
© Copyright 2025 Paperzz