The Scent of a Site: A System for Analyzing and Predicting ...

Page 1

The Scent of a Site: A System for Analyzing and PredictingInformation Scent, Usage, and Usability of a Web Site

Ed H. Chi, Peter Pirolli, James Pitkow

Xerox Palo Alto Research Center3333 Coyote Hill Road, Palo Alto, CA 94304

{echi,pirolli,pitkow}@parc.xerox.com

ABSTRACTDesigners and researchers of users’ interactions with theWorld Wide Web need tools that permit the rapidexploration of hypotheses about complex interactions ofuser goals, user behaviors, and Web site designs. Wepresent an architecture and system for the analysis andprediction of user behavior and Web site usability. Thesystem integrates research on human information foragingtheory, a reference model of information visualization andWeb data-mining techniques. The system also incorporatesnew methods of Web site visualization (Dome Tree, UsageBased Layouts), a new predictive modeling technique forWeb site use (Web User Flow by Information Scent,WUFIS), and new Web usability metrics.

KeywordsInformation foraging, information scent, World Wide Web,usability, information visualization, data mining, longestrepeated subsequences, Dome Tree, Usage-Based Layout.

INTRODUCTIONThe World Wide Web is a complex information ecologyconsisting of several hundred million Web pages and over ahundred million users. Each day these users generate overa billion clicks through the myriad of accessible Web sites.Naturally, Web site designers and content providers seek tounderstand the information needs and activities of theirusers and to understand the impact of their designs. Giventhe magnitude of user interaction data, there exists a needfor more efficient and automated methods to (a) analyzethe goals and behaviors of Web site visitors, and (b)analyze and predict Web site usability. Simpler, effective,and efficient toolkits need to be developed to explore andrefine predictive models, user analysis techniques, and Website usability metrics.

Here we present an architecture and system for exploratorydata analysis and predictive modeling of Web site use. Thearchitecture and system integrates research on humaninformation foraging theory [6], a reference model ofinformation visualization [3], and Web data-miningtechniques [9]. The system also incorporates new methods

of Web site visualization, a new predictive modelingtechnique for Web site use, and new Web usability metrics.The system is currently being developed for researchersinterested in modeling users within a site and investigatingWeb site usability; however, the ultimate goal is to evolvethe system so that it can be effectively employed bypracticing Web site designers and content providers.

WEB SITE ANALYSIS AND PREDICTIONMost Web sites record visitor interaction data in someform. Since the inception of the Web, a variety of toolshave been developed to extract information from usagedata. Although the degree of reliability varies widely basedupon the different heuristics used, metrics like the numberof unique users, number of page visits, reading times,session lengths, and user paths are commonly computed.While some tools have evolved into products1, most Weblog file analysis consists of simple descriptive statistics,providing little insight into the users and use of Web sites.

A new emerging approach is to employ software agents assurrogate users to traverse a Web site and derive varioususability metrics. WebCriteria SiteProfile uses a browsingagent to navigate a Web site using a modified GOMSmodel and record download times and other information.The data are integrated into metrics that assess: (a) the loadtimes associated with pages on the site, and (b) theaccessibility of content (ease of finding content). Theaccessibility metric is based upon the hyperlink structure ofthe site and the amount of content. An analysis of theactual content is not performed.

Current approaches to Web site analysis are aimed at theWebmasters who are interested in exploring questionsabout the current design of a Web site and the current setof users. However, Webmasters must also be interested inpredicting the usability of alternative designs of their Websites. They also seek to answer these same questions fornew kinds of (hypothetical) users who have slightlydifferent interests than the current users.

Our work aims to develop predictive models capable ofsimulating hypothetical users and alternative Web Sitedesign. Using these models, we also seek to develop meansfor the automatic calculation of usability metrics. Our 1 For instance, Accrue Insight (http://www.accrue.com), Astra SiteManager

(http://www.merc-int.com), and WebCriteria SiteProfile.(http://www.webcriteria.com).

.

Page 2

research on new analysis models, predictive models, andusability metrics contribute to the development of tools forthe practicing Web site designer interested in exploring"what-if" Web site designs.

Our system was developed to answer questions beyondthose answered by basic descriptive statistics. Specifically,we sought to answer questions concerning the entire Website, specific pages, and the users:

• Overall site. What is the overall current traffic flow?What are the actual and predicted surfing traffic routes(e.g., branching patterns, pass-through points)? Howdoes the site measure on ease of access (findinginformation) and cost?

• Given page. Where do the visitors come from (i.e., whatroutes do they follow)? Where do they actually go?What other pages are related?

• Users. What are the interests of the visitors (real orsimulated) to this page? Where do we think they shouldgo given their interests? Do actual usage data matchthese predictions, and why? What is the cost (e.g., interms of download time) of surfing for these visitors?

INFORMATION FORAGING AT WEB SITESInformation foraging theory [6] has been developed as wayof explaining human information-seeking and sense-makingbehavior. Here we use the theoretical notion of informationscent developed in this theory [5,6] as the basis for severalanalysis techniques, metrics, and predictive modeling. Wealso employ a data mining technique involving theidentification of longest repeated subsequences (LRS, [9])to extract the surfing patterns of users foraging forinformation on the Web. This fusion of methods provides anovel way of capturing user information goals, theaffordances of Web sites, and user behavior.

Information Goals and Information ScentOn the Web, users typically forage for information bynavigating from page to page along Web links. The contentof pages associated with these links is usually presented tothe user by some snippets of text or graphic. Foragers usethese proximal cues (snippets; graphics) to assess the distalcontent (page at the other end of the link).2 Informationscent is the imperfect, subjective, perception of the value,cost, or access path of information sources obtained fromproximal cues, such as Web links, or icons representing thecontent sources.

In the current system, we have developed a variety ofapproximations for analyzing and predicting informationscent. These techniques are based on psychological models[6], which are closely related to standard informationretrieval techniques, and Web data mining techniquesbased on the analysis of Content, Usage, and hyperlinkTopology (CUT, [3,10]). For more details, see [1].

2 Furnas referred to such intermediate information as “residue” [4].

Reverse Scent Flow to Identify Information NeedA well-traveled path may indicate a group of users whohave very similar information goals and are guided by thescent of the environment. Therefore, given a path, wewould like to know the information goal expressed by thatpath. We have developed a new algorithm called InferringUser Need by Information Scent (IUNIS) that uses theScent Flow model in reverse to determine users’information goals [1]. Such goals can be described by asorted list of weighted keywords, which can be skimmed byan analyst to estimate and understand the goals of userstraversing a particular path.

Mining Web Site Foraging PatternsPitkow and Pirolli [9] systematically investigated the utilityof a Web-mining technique that extracts significant surfingpaths by the identification of longest repeatingsubsequences (LRS). They found that the LRS techniqueserves to reduce the complexity of the surfing path modelrequired to represent a set of raw surfing data, whilemaintaining an accurate profile of future usage patterns. Inessence, the LRS technique extracts surfing paths that arelikely to re-occur and ignores noise in the usage data. Weuse the LRS data mining technique to identify significantsurfing paths in real and simulated data.

Overview of the Analysis ApproachOur assumption is that, for the purposes of many analyses,users have some information goal, and their surfingpatterns through the site are guided by information scent.Given this framing assumption we have developedtechniques for answering a variety of Web site usabilityquestions. First, for a particular pattern of surfing, we seekto infer the associated information goal. Second, given aninformation goal, some pages as starting points, and theinformation scent associated with all the links emanatingfrom all the pages, we attempt to predict the expectedsurfing patterns, and thereby simulate Web site usage.Finally, we develop metrics concerning the overallgoodness of the information scent that leads users to goalcontent (cf. [11]). Using these methods, we analyze thequality of Web links in providing good proximal scent thatleads users to the distal content that they seek.

ARCHITECTUREThe architecture of the system is based on the InformationVisualization Reference Model [3]. Figure 1 shows thearchitecture of the system using the Data State Model andthe associated operators. The figure summarizes the datastates and the operators defined by the system components,which we describe in detail below.

In this figure, circles represent the data states, while edgesrepresent operators. There are four major data state stages:Value, Analytical Abstraction, Visualization Abstraction,and View. There are three major operator types: Data,Visualization, and Visual Mapping Transformations. Theright side of the figure depicts these stages and types.

Page 3

At the conceptual level, Figure 1 shows an importantfeature of the architecture: the actual observed usage datacan be seamlessly replaced by simulated usage data,without disturbing other parts of the system. By pushingthe observed or the simulated surfing data through thesystem, we obtain visualizations of actual or simulatedusage. By providing this capability, users of the system canquickly test hypothetical cases against actual usage in areal-time, iterative manner, thus supporting detailedinvestigation into a site’s usability.

SYSTEM FOR WEB SCENT VISUALIZATIONUsing the reference model, we constructed a system forvisualizing and analyzing a site’s information scent, usertrails, and usability. In the next sections, we describe thesystem components, followed by a series of casesillustrating the utility of the system.

Web Site and Observed DataTo develop and test the system, we used data collected atwww.xerox.com on May 18th, 1998. Although slightlydated, the data set has been explored for a variety of otherpurposes [8,9] and was chosen to enable cross studycomparisons and validation. The snapshot consists ofroughly 15,000 pages and its associated Content, Usage,Topology (CUT) data. Content and topology data wereextracted from the actual Web site using the techniquesoutlined in [7]. Usage data were extracted from theExtended Common Log Format access logs using the

Timeout-Cookie method [8] to identify individual paths ofcontiguous surfing of Web pages by individual users.

Simulated DataFor the simulated data we developed a new techniquecalled Web User Flow by Information Scent (WUFIS) [1].Conceptually, the simulation models an arbitrary number ofagents traversing the links and content of a Web site. Theagents have information goals that are represented bystrings of content words such as “Xerox scanningproducts.” For each simulated agent, at each page visit, themodel assesses the information scent associated with linkedpages. The information scent is computed by comparingthe agents’ information goals against the pages’ contents.This computation is a variation of the computationalcognitive model for information scent developed in [6].The information scent used by the simulation may be thedistal scent of the actual linked content, or the proximalscent of the linked pages as represented by a text snippet oricon. For the cases examined in this paper, we usedsimulations based on the distal information scent, but, aswe shall illustrate, this turns out to be fruitful way ofidentifying problems with the way pages are presentingproximal information scent.

Usability MetricsWe are developing metrics to assess the quality of scent ata Web site in leading users to information they are seeking,and the cost of finding such information. One of thesemetrics involves (a) the specification of a user informationgoal (e.g., “Xerox products”), (b) the specification of oneor more hypothetical starting pages (e.g., the Xerox homepage), and (c) one or more target pages (e.g., a Xeroxproduct index). Using the WUFIS simulator, agentstraverse the Web site making navigation decisions based onthe information scent associated with links on each page.The navigation decisions are stochastic, such that moreagents traverse higher-scent links, but some agents traverselower-scent links [1]. The simulation assumes that theagents either stop at the target page when found, or failingto find the page they surf up to some arbitrary amount ofeffort. We then assess the proportion of simulated agentsthat find the target page.

Network Representations of CUTCUT graphs and various derivatives are readily extractablefrom most Web sites and the corresponding usage logs. Inthis representation, nodes in the graph correspond to Webpages and weighted directed edges correspond to thestrength of association between any pair of nodes. For theanalyses in this paper, we extracted the following graphs:

• content similarity graph [7], represents the similaritybetween Web pages as determined by the textualcontent of the pages. The edge values provide anapproximate measure of the topical relevance of onepage to another.

• usage graph [7], represents the proportion of surfersthat traverse the hyperlinks between pages. The edge

Va lu e

An a lytic a l Ab s tra c tio n

Vis u a liza tio n Ab s tra c tio n

Vie w

D a ta T ra n s fo rm a tio n

Vis u a liza tio n T ra n s fo rm a tio n

Vis u a l Ma p p in g

T ra n s fo rm a tio n

W e b s ite

Ex tra c t p a ge linkd ag e

B re a d th f irs t tr a v ers a l w / u s ag e-b as e d la y ou t

w e ig h te d g ra p h

h ie ra rc hy

v is ua liz ation

Dis k Tre e

Do me Tre e

W e b Tra ils

ro ta te

f oc us

Ex tra c t Us ag e

Ex tr ac t L RS Pa th

L RS Pa th s

E mb ed

De ta il W ind ow

Ex tr ac t

A c ce ss L o g

O b s erv ed Us a ge

S imu la te d Us ag e

S imu la to r

s e lec t

Figure 1: Data State Model for Web Scent Visualization

Page 4

values reflect how users “voted with their clicks” infinding relevant information.

• co-citation graph [10], reflects the frequency that twonodes were linked to by the same page. The edgevalues provide an indication of the authoritativerelevance of pages to one another.

Spreading Activation Assessments of ScentWe use a spreading activation algorithm [7] on the variousgraphs to compute relevance or scent over a Web site.Conceptually, spreading activation pumps a metric calledactivation through one or more of the graphs. Activationflows from a set of source nodes through the edges in thegraph. The amount of activation flow among nodes ismodulated by the edge strengths. In this model, sourcenodes correspond to Web pages for which we want toidentify related pages. After a few iterations, (subject tothe selection of the appropriate spreading activationparameters), the activation levels settle into a stable state.The final activation vector over nodes defines the degree ofrelevance for a set of pages to the source pages.

Surfer Patterns Identified by LRSA longest repeating subsequence (LRS) is a sequence ofitems where (1) subsequence means a set of consecutiveitems, (2) repeated means the item occurs more than somethreshold T, where T typically equals one, and (3) longestmeans that although a subsequence may be part of anotherrepeated subsequence, there is at least once occurrence ofthis subsequence where this is the longest repeating.

To help illustrate, suppose we have the case where a sitecontains the pages A, B, C, D, where A contains ahyperlink to B and B contains hyperlinks to both C and D.As shown in Figure 2, if users repeatedlynavigate from A to B, but only one userclicks through to C and only one userclicks through to D (as in Case 1), the LRSis AB. If however more than one userclicks through from B to D (as in Case 2),then both AB and ABD are LRS. In thisevent, AB is a LRS since on at least oneother occasion, AB was not followed byABD. In Case 3, both ABC and ABD areLRS since both occur more than once andare the longest subsequences. Note thatAB is not a LRS since it is never thelongest repeating subsequence as in Case 4for the LRS ABD.

Dome Tree VisualizationChi et al. [2], developed a visualizationcalled the Disk Tree to map large Websites. See the right side of Figure 3 for anexample. At the center of the Disk Tree isthe root node, and successive levels of thetree are mapped to new rings expandingfrom the center. The amount of space

given to each sub-tree is proportional to the number of leafnodes it contains.

One of the limitations of this approach is that overlayinguser paths on top of the Disk Tree occludes the underlyingstructure of the Web site, removing important visual datafrom the analyst’s view. With our current focus on the flowof users through web sites, we designed a new techniquecalled Dome Tree. In a Dome Tree, only 3/4 of the disk isused and at each successive level, the disk is extrudedalong the Z dimension. The rationale behind usingextrusion is expanding the structure to 3D so that we canembed user paths in 3D rather than on the surface of theDisk Tree. By using only 3/4 of the disk, we can peer intothe Dome through the opening like a door, without beingoccluded by the object itself. While this provided a usefullayout, we sought to further minimize the impact of pathcrossings inherent in visualizing Web trails.

Figure 3: Dome Tree with Usage-Based Layout (left) shows that links (shown inyellow) are laid along significant paths (shown by orange arrow), eliminatingcrossings. In comparison, the traditional Disk Tree approach (right) has manycrossing yellow links (shown in enclosed orange box). White arrows point to thecurrent document being examined (investor.html).

Longest Repeating Subsequences (LRS)Case 1: AB Case 3: ABC, ABDCase 2: AB, ABD Case 4: ABD

C D

B

A

C D

B

A

C D

B

A

D

B

A

Figure 2. Examples illustrating the formation of longestrepeating subsequences (LRS). Thick-lined arrows indicatemore than one traversal whereas thin-lined arrows indicateonly one traversal. For each case, the resulting LRS are listed.

Page 5

Usage-Based LayoutTo provide a visualization of Web paths with less pathcrossings, we developed new layout methods called Usage-Based Layout (UBL). UBL algorithms determinehierarchical relationships by various popularity metricsderived from user’s paths and usage data. These methodsrepresent a departure from traditional graph layout methodsthat rely exclusively upon the traversal of structuralrelationships.

Applied to the Web, UBL can also identify user pathsbetween two pages even though no explicit hyperlink existsbetween the two pages. We call this link induction. Linkinduction finds usage paths that arise from the use ofhistory buttons, search result pages, other dynamic pages,and so forth, which cannot be obtained by crawling the site.

To determine the hierarchical relationships betweendocuments, we conduct a priority-based traversal based onusage data. Starting from the root node, its children aredetermined by looking at the existing hyperlink structure aswell as the inducted links. Instead of using a simple queueas in a breadth-first traversal algorithm, we use a priorityqueue, where the top-most used page is chosen as the nextnode to expand. The expanded children list is then sortedin increasing usage order, and then inserted at the end ofthe queue along with their usage data. Then we proceed tothe next highest-used child of the root node, which is at thetop of the queue.

Figure 3 displays LRS user trails using the Dome Tree withUBL as compared with the Disk Tree. The green structureis the map of the Web site, and the yellow and blue linesrepresent user trails. This example demonstrates that weare able reduce trail crossings by using UBL.

By using a mouse-brushing technique, we highlight eachnode and show its URL and frequency of usage as themouse moves over the documents on the Dome Tree (leftof Figure 3). An orange ball highlights the currentdocument of interest. The user is then allowed to pick aparticular document to bring up additional details on it.

Web TrailsOne of the details shown is the extracted Web Trails thatare made by the users. All paths that lead into thisdocument are called History Trails, which are shown inblue. All paths that spread out from this document areFuture Trails, which are shown in yellow.

A dialog box also pops up, containing trail informationrelated to this document (See Figure 4 for an example).The dialog shows the history and future portion of eachpath, along with its length and frequency. A scrollbar onthe right enables the user to graze over this list. Thebottom of the dialog box shows the documents that are onthese paths, with their frequency of access, size, and URL.

Clicking on a path or a portion of a path narrows the list ofdocuments to just the documents on that particular path. Inthis way, we enable analysts to drill down to specific paths

of interest. Selecting a path also highlights it in the DomeTree visualization in red.

Clicking on the Reverse Scent button in the dialog boxdynamically computes a set of keywords using IUNIS thatdescribes the information needs expressed by that path.The list is shown to the user in sorted order, with the mostdiagnostic words at the top.

We also compute and show an estimated download time ofa user traversing this path using a modem. The estimationis derived from the total bytes of the files on the path.Analysts can therefore quickly judge the cost of traversingthis path, and make appropriate judgements on the path’susability.

Scent VisualizationThe user can choose to show several kinds of scent relatedto the selected document, including spreading activationbased on Content Similarity, Co-citation, and Usagegraphs, and WUFIS-computed scent flow. The systemdynamically computes these scent assessments for eachdocument and shows the result using red bar lines on theDome Tree. The taller the red bar, the higher the scent.

By visually comparing the documents that lie on user trailsand the computed scent, we can see whether users arefinding the information that they need. This gives us adirect visual evidence of goodness of the design of the Website. If the paths and the scents match, then users arenavigating the Web site with success. If the paths and thescents mismatch, then it is possible that users are notfinding the information because the Web site design givesinappropriate scents.

In practice, we have found Spreading Activation based onContent Similarity and Scent Flow computed by WUFIS tobe very useful. Therefore, we have included thisinformation as a column in the bottom portion of the dialogbox. A mark of "C" means Content Spreading Activationpredicted its relevance, and a mark of "S" means ScentFlow predicted its relevance.

OverviewThis section described our system for the analysis andvisualization of information scent, user surfing, and Webusability. The interactions between the different pieces ofthe components enable analysts to mine both the actual andpredicted usage data of a large Web site. Looking at thearchitecture depicted in Figure 1, one important data flowthrough the components isLog+Web site Å LRS+Graph Å Hierarchy Å Dome Tree.This path uses the Usage-based Layout to compute a DomeTree, which visualizes the whole site, with room toaccommodate the Web Trails. Another data flow isLog Å LRS PathsÅ Web Trails Å Embed on Dome Tree,which computes the appropriate trails that are to beembedded on the Dome Tree.In the next section, we will show the tool in action, andpresent a number of case studies.

Page 6

CASE STUDIESEarlier in this paper, wepresented questions thatmight be posed aboutsurfers and Web sites. Inthis section, we illustratethe system by variousanalysis scenarios of theXerox Web site3.Specifically, we willattempt to answer thefollowing questions:

1. What pages act asmulti-way branchingpoints for usertraversals? Do usersbranch on these pages?What pages behave aspass-through points?

2. For a page, what are the well-traveled paths? Do usersfind the desired information on these paths?

3. For well-traveled paths, what is the users’ informationgoal? How can this information goal be extracted?

4. What are the predicted useful information destinations,given a specific information need? Does actual usageconform to these predictions? Why, or why not?

Page TypesSome pages act as indices, serving as way points innavigation patterns. Other pages act as conduits in a set ofserially organized pages. Given these and other page types,the question arises, "How are users actually surfing thesepages?" One may posit the design principle that effectiveway points should be kept around for good navigationalscent. Once identified, ineffective way points can beredesigned, integrated with other content, or removed.

Figure 4 reveals a multi-way branching point where a fewhistory paths lead into the branching point and result in afew well-traveled future paths. Upon drill-down, wediscover that the branching page leads to several importantdestination pages, including the shareholder informationpage, the 1998 Xerox Fact Book, and a financial document-ordering page. While the page is relatively under-utilized(~60 accesses/day), our analysis shows it to be a veryeffective local sitemap. Within a few clicks, users are ableto access the desired content.

Figure 5 shows an example of a pass-through point whereUBL has laid out the pages in path-priority order. Intraversing this path, some users leave the serialorganization of the pages to find a related page (yellowpath going to the red Content Spreading Activation page,

3 Since the system is built for displays exceeding the resolution of paper, we

have placed a copy of the figures for inspection online at:http://www.parc.xerox.com/uir/pubs/chi2000-scent.

bottom right). Users then backtrack to continue surfing theserial links. From this inspection, we conclude that while itis a fairly well designed pass-through point, the page couldpotentially be improved to incorporate the related contentdirectly. The tradeoff may be between coherence of thepages and navigational effort.

Well-Traveled PathsCurrently, most Web site visualizations focus on theidentification of high usage areas. Our system identifieswell-traveled paths by using a combination of two methods.First, the LRS computation reduces the number andcomplexity of user paths into manageable chunks. Second,embedding the paths onto the Dome Tree facilitates thevisual extraction of well-traveled paths. We do notconsider these methods perfect, rather they permitinvestigations that are otherwise difficult to attempt.

The left-hand image of Figure 6 illustrates the well-traveledpaths related to a specific Web page (the TextBridge Pro98 product page). As evidenced by the myriad of yellowfuture paths, related information is laid out across manydifferent areas of the Web site, suggesting a possibleredesign to bring more cohesion into the site. One

Figure 5: Pass-through Point in a series of pages (marked byorange arrows and current page pointed by white arrow isannualreport/1997/market.htm)

Figure 4: Multi-way Branching Point (investor/sitemap.htm) shown enclosed by orange lines, and WebPath detail dialog box (orange box shows the inferred user information need keywords, which arereinvestment, stock, brochure, dividend, and shareholder).

Page 7

interesting well-worn path isthe serial pattern on the left(long arching yellow andblue path) that correspondsto the product tutorial pages.The right-hand image ofFigure 6 shows the well-traveled paths extendingfrom the Pagis product page.The zigzagging paths nearthe page indicate surfingbetween popular siblingpages. Many users travel thesoftware demo tour and thisis made explicit by the largeblue path radiating upwards.

In both of these examples,the red bar marks throughout the Dome Tree indicate therelated pages to TextBridge and Pagis as computed by theScent model. The correspondence of predicted relatedcontent to actual user paths suggests that the relatedcontent is not only reachable, but also well traveled byusers. Visually, the yellow user paths that connect the redbars extending from the related pages reveal thiscorrelation.

Identifying Information NeedSince well-traveled paths indicate items that compete wellagainst other items for users’ attention, it is important tofind out, given a well-traveled path, what information needhas the user expressed in that path. The bottom of Figure 4shows the information need of a well-traveled path ascomputed by the reverse scent algorithm. The example istaken from a path related to investor/sitemap.htm. The topkeywords computed by the reverse scent algorithm arereinvestment, stock, brochure, dividend, and shareholder.These keywords represent the goal of the users thattraverse the path from the Shareinfo to the Orderdoc Webpages.

Figure 7 (corresponding to Figure 5) shows a more specificinformation need for the highly traversed path that starts atthe employment recruitment page and winds through the1997 Annual Report. In this case, some of the topkeywords are reexamine, employment, socially, and morals,suggesting that potential Xerox employee are investigatingthe attitudes and culture of Xerox as expressed in theAnnual Report. Another possible interpretation is thatresearchers are examining the correspondence betweenXerox’s employment policy and its social/moral position.In Figure 8, a large number of paths relate to how toupgrade previous versions of TextBridge96. Arepresentative path shows top keywords as TextBridge,upgrade, OCR, Pro, bundled, software, windows, andresellers.

These examples suggest that we are able to automaticallyidentify the information goals of users by first discovering

the well-traveled paths and then computing the informativekeywords using the Scent Flow model. These exampleshelp demonstrate that the Scent Model is not only good atpredicting future user surfing behavior given a startingpage, but also good at determining the information needs ofa set of users given their paths through a site.

Predicted Destinations Based on ScentOne analysis centers on the differences between theWUFIS Scent Flow Model and actual user behaviors. Weseek to answer the question, "Where in the Web site doesthe Scent Flow model differ from observed data, and why?"

For example, 100 hypothetical users interested ininformation related to Pagis product were simulated to flowthrough the Web site from two different entry points.Figure 9 shows the result of these two simulations, where

Figure 7: from annualreport/1997/market.htm (Figure 5)

Figure 8: from xis/tbpro96win/index.htm

Figure 6: Well-traveled paths related to scansoft/tbpro98win/index.htm (left) andscansoft/pagis/index.htm (right), where major traffic routes are marked by orange lines.

Page 8

actual user paths are encoded withyellow lines and the frequency of visitby the simulated users is encoded bythe height of red bars. In the left-handimage, we placed users at scansoft/pagis/index.html, and watched theusers surf to various points in the Website, including pages relating to a tourof the software, release notes, andsoftware registration pages. Thecorrespondence of the yellow trails tothe red pages reveals a match betweenthe flow of real and simulated users.The right-hand image of Figure 9displays the result of simulating usersfrom products.html. It is immediatelyclear from the picture that many pagescontaining information relating to"Pagis" are found by the simulation,but real users are not finding pages.Upon careful examination we discovered that while the“Pagis” scent is contained near products.html, the scent isburied in layers of graphics and texts. The example showsthat products.html does not adequately provide access toinformation relating to “Pagis”.

There remain many limitations to the current system thatremain for future work. Although we have amelioratedsome of the visual clutter problems associated withvisualizing Web sites and user paths, there is clearly muchroom for improvement. Techniques such as animationmight aid in showing and comparing Web Trails. Anotherway to improve the current visualization of Web Trails is tofade colors out as we move into history or future portion ofthe path. To do this, we would have to first compute theaggregate path flow down each section over all paths.

CONCLUSIONWithin the last few years, we have seen an explosivegrowth in Web usability as a field. Given its infancy, it isnot surprising that there are so few tools to assist Webanalysts. We presented a Scent Flow model for predictingand analyzing Web site usability. The analysis andvisualization system presented in this paper is aimed atimproving the design of Web sites, and at improving ourunderstanding how users forage for information in the vastecology of the Web.

AcknowledgementThis research was supported in part by an Office of Naval Researchgrant No. N00014-96-C-0097 to Peter Pirolli and Stuart Card.

REFERENCES1. Chi, E. H., Pirolli, P., Pitkow, J. (1999) Using Information

Scent to Model User Information Needs and Actions on theWeb. (submitted).

2. Chi, E.H., Pitkow, J., Mackinlay, J., Pirolli, P., Gossweiler, R.,and Card, S. (1998). Visualizing the Evolution of WebEcologies. Proceedings of the Human Factors in ComputingSystems, CHI ’98. (pp. 400-407). Los Angles, CA.

3. Chi, E.H. and Riedl, J.T. (1998). An operator interactionframework for visualization systems. Proceedings of the IEEEInformation Visualization Symposium. (pp. 63-70).

4. Furnas, G.W. (1997). Effective view navigation. Proceedingsof the Human Factors in Computing Systems, CHI ’97 (pp.367-374), Atlanta, GA.

5. Pirolli, P. (1997). Computational models of information scent-following in a very large browsable text collection.Proceedings of the Conference on Human Factors inComputing Systems, CHI ’97 (pp. 3-10), Atlanta, GA.

6. Pirolli, P. and Card, S.K. (in press). Information foraging.Psychological Review.

7. Pirolli, P., Pitkow, J., and Rao, R. (1996). Silk from a sow’sear: Extracting usable structures from the web. Proceedingsof the Conference on Human Factors in Computing Systems,CHI ’96 Vancouver, Canada.

8. Pirolli, P. and Pitkow, J.E. (1999). Distributions of surfers’paths through the World Wide Web: Empiricalcharacterization. World Wide Web, 1, 1-17.

9. Pitkow, J. and Piroll, P. (1999, in press). Mining longestrepeated subsequences to predict World Wide Web surfing.Proceedings of the USENIX Conference on Internet.

10. Pitkow, J. and Pirolli, P. (1997). Life, death, and lawfulness onthe electronic frontier. Proceedings of the Conference onHuman Factors in Computing Systems, CHI ’97 (pp. 383-390).

11. Spool, J.M., Scanlon, T., Snyder, C., and Schroeder, W.(1998). Measuring Website usability. Proceedings of theConference on Human Factors in Computing Systems, CHI ’98(pp. 390), Los Angeles, CA.

Figure 9: Given an information need related to "Pagis", Scent Flow simulation resultsin good match in scansoft/pagis/index.html (left, good match points pointed byorange arrows), but poor match from products.html (right, bad match points pointedto by purple arrows).

The Scent of a Site: A System for Analyzing and Predicting ...

Documents