Information retrieval on the web

Information Retrieval on the WebMEI KOBAYASHI and KOICHI TAKEDA

IBM Research

In this paper we review studies of the growth of the Internet and technologies thatare useful for information search and retrieval on the Web. We present data on theInternet from several different sources, e.g., current as well as projected number ofusers, hosts, and Web sites. Although numerical figures vary, overall trends citedby the sources are consistent and point to exponential growth in the past and inthe coming decade. Hence it is not surprising that about 85% of Internet userssurveyed claim using search engines and search services to find specificinformation. The same surveys show, however, that users are not satisfied with theperformance of the current generation of search engines; the slow retrieval speed,communication delays, and poor quality of retrieved results (e.g., noise and brokenlinks) are commonly cited problems. We discuss the development of new techniquestargeted to resolve some of the problems associated with Web-based informationretrieval,and speculate on future trends.

Categories and Subject Descriptors: G.1.3 [Numerical Analysis]: NumericalLinear Algebra—Eigenvalues and eigenvectors (direct and iterative methods);Singular value decomposition; Sparse, structured and very large systems (direct anditerative methods); G.1.1 [Numerical Analysis]: Interpolation; H.3.1[Information Storage and Retrieval]: Content Analysis and Indexing; H.3.3[Information Storage and Retrieval]: Information Search andRetrieval—Clustering; Retrieval models; Search process; H.m [InformationSystems]: Miscellaneous

General Terms: Algorithms, Theory

Additional Key Words and Phrases: Clustering, indexing, information retrieval,Internet, knowledge management, search engine, World Wide Web

1. INTRODUCTION

We review some notable studies on thegrowth of the Internet and on technolo-gies useful for information search andretrieval on the Web. Writing about theWeb is a challenging task for severalreasons, of which we mention three.First, its dynamic nature guaranteesthat at least some portions of any

manuscript on the subject will be out-of-date before it reaches the intended au-dience, particularly URLs that are ref-erenced. Second, a comprehensivecoverage of all of the important topics isimpossible, because so many new ideasare constantly being proposed and areeither quickly accepted into the Internetmainstream or rejected. Finally, as withany review paper, there is a strong bias

Authors’ address: Tokyo Research Laboratory, IBM Research, 1623–14 Shimotsuruma, Yamato-shi,Kanagawa-ken, 242–8502, Japan; email: [email protected]; [email protected] to make digital / hard copy of part or all of this work for personal or classroom use is grantedwithout fee provided that the copies are not made or distributed for profit or commercial advantage, thecopyright notice, the title of the publication, and its date appear, and notice is given that copying is bypermission of the ACM, Inc. To copy otherwise, to republish, to post on servers, or to redistribute tolists, requires prior specific permission and / or a fee.© 2001 ACM 0360-0300/00/0600–0144 $5.00

ACM Computing Surveys, Vol. 32, No. 2, June 2000

in presenting topics closely related tothe authors’ background, and givingonly cursory treatment to those of whichthey are relatively ignorant. In an at-tempt to compensate for oversights andbiases, references to relevant worksthat describe or review concepts indepth will be given whenever possible.This being said, we begin with refer-ences to several excellent books thatcover a variety of topics in informationmanagement and retrieval. They in-clude Information Retrieval and Hyper-text [Agosti and Smeaton 1996]; ModernInformation Retrieval [Baeza-Yates andRibeiro-Neto 1999]; Text Retrieval andFiltering: Analytic Models of Perfor-mance [Losee 1998]; Natural LanguageInformation Retrieval [Strzalkowski1999]; and Managing Gigabytes [Wittenet al. 1994]. Some older, classic texts,which are slightly outdated, include In-formation Retrieval [Frakes and Baeza-Yates 1992]; Information Storage andRetrieval [Korfhage 1997]; IntelligentMultmedia Information Retrieval [May-bury 1997]; Introduction to Modern In-formation Retrieval [Salton and McGill1983]; and Readings in Information Re-trieval [Jones and Willett 1977].

Additional references are to specialjournal issues on search engines on theInternet [Scientific American 1997];digital libraries [CACM 1998]; digitallibraries, representation and retrieval[IEEE 1996b]; the next generationgraphical user interfaces (GUIs) [CACM

1994]; Internet technologies [CACM1994; IEEE 1999]; and knowledge dis-covery [CACM 1999]. Some notable sur-vey papers are those by Chakrabartiand Rajagopalan [1997]; Faloutsos andOard [1995]; Feldman [1998]; Gudivadaet al. [1997]; Leighton and Srivastava[1997]; Lawrence and Giles [1998b;1999b]; and Raghavan [1997]. Exten-sive, up-to-date coverage of topics inWeb-based information retrieval andknowledge management can be found inthe proceedings of several conferences,such as: the International World WideWeb Conferences [WWW Conferences2000] and the Association for Comput-ing Machinery’s Special Interest Groupon Computer-Human Interaction [ACMSIGCHI] and Special Interest Group onInformation Retrieval [ACM SIGIR]conferences ,acm.org.. A list of papersand Web pages that review and compareWeb search tools are maintained at sev-eral sites, including Boutell’s WorldWide Web FAQ ,boutell.com/faq/.;Hamline University’s ,web.hamline.edu/administration/libraries/search/comparisons.html.; Kuhn’s pages (in German),gwdg.de/hkuhn1/pagesuch.html#vl.;Maire’s pages (in French) ,imaginet.fr/ime/search.htm.; Princeton University’s,cs.princeton.edu/html/search.html.;U.C. Berkeley’s ,sunsite.berkeley.edu/help/searchdetails.html.; and Yahoo!’spages on search engines ,yahoo.com/computers and internet/internet/worldwide web.. The historical developmentof information retrieval is documentedin a number of sources: Baeza-Yatesand Ribeiro-Neto [1999]; Cleverdon[1970]; Faloutsos and Oard [1995]; Sal-ton [1970]; and van Rijsbergen [1979].Historical accounts of the Web and Websearch technologies are given in Berners-Lee et al. [1994] and Schatz [1997].

This paper is organized as follows. Inthe remainder of this section, we dis-cuss and point to references on ratingsof search engines and their features, thegrowth of information available on theInternet, and the growth in users. Inthe second section we present tools forWeb-based information retrieval. These

CONTENTS

1. Introduction1.1 Ratings of Search Engines and their Features1.2 Growth of the Internet and the Web1.3 Evaluation of Search Engines

2. Tools for Web-Based Retrieval and Ranking2.1 Indexing2.2 Clustering2.3 User Interfaces2.4 Ranking Algorithms for Web-Based Searches

3. Future Directions3.1 Intelligent and Adaptive Web Services3.2 Information Retrieval for Internet Shopping3.3 Multimedia Retrieval3.4 Conclusions

Information Retrieval on the Web • 145


include classical retrieval tools (whichcan be used as is or with enhancementsspecifically geared for Web-based appli-cations), as well as a new generation oftools which have developed alongsidethe Internet. Challenges that must beovercome in developing and refiningnew and existing technologies for theWeb environment are discussed. In theconcluding section, we speculate on fu-ture directions in research related toWeb-based information retrieval whichmay prove to be fruitful.

1.1 Ratings of Search Engines and theirFeatures

About 85% of Web users surveyed claimto be using search engines or some kindof search tool to find specific informa-tion of interest. The list of publicly ac-cessible search engines has grown enor-mously in the past few years (see, e.g.,blueangels.net), and there are now listsof top-ranked query terms available on-line (see, e.g., ,searchterms.com.).Since advertising revenue for searchand portal sites is strongly linked to thevolume of access by the public, increas-ing hits (i.e., demand for a site) is anextremely serious business issue. Un-doubtedly, this financial incentive isserving as one the major impetuses forthe tremendous amount of research onWeb-based information retrieval.

One of the keys to becoming a popularand successful search engine lies in thedevelopment of new algorithms specifi-cally designed for fast and accurate re-trieval of valuable information. Otherfeatures that make a search or portalsite highly competitive are unusuallyattractive interfaces, free email ad-dresses, and free access time [Chan-drasekaran 1998]. Quite often, these ad-vantages last at most a few weeks, sincecompetitors keep track of new develop-ments (see, e.g., ,portalhub.com. or,traffik.com., which gives updates andcomparisons on portals). And sometimessuccess can lead to unexpected conse-quences:

“Lycos, one of the biggest and most popular searchengines, is legendary for its unavailability dur-ing work hours.” [Webster and Paul 1996]

There are many publicly availablesearch engines, but users are not neces-sarily satisfied with the different for-mats for inputting queries, speeds ofretrieval, presentation formats of theretrieval results, and quality of re-trieved information [Lawrence andGiles 1998b]. In particular, speed (i.e.,search engine search and retrieval timeplus communication delays) has consis-tently been cited as “the most commonlyexperienced problem with the Web“ inthe biannual WWW surveys conductedat the Graphics, Visualization, and Us-ability Center of the Georgia Instituteof Technology.1 63% to 66% of Web us-ers in the past three surveys, over aperiod of a year-and-a-half were dissat-isfied with the speed of retrieval andcommunication delay, and the problemappears to be growing worse. Eventhough 48% of the respondents in theApril 1998 survey had upgraded mo-dems in the past year, 53% of the re-spondents left a Web site while search-ing for product information because of“slow access.” “Broken links” registeredas the second most frequent problem inthe same survey. Other studies also citethe number one and number two rea-sons for dissatisfaction as “slow access”and “the inability to find relevant infor-mation,” respectively [Huberman andLukose 1997; Huberman et al. 1998]. Inthis paper we elaborate on some of thecauses of these problems and outlinesome promising new approaches beingdeveloped to resolve them.

It is important to remember thatproblems related to speed and accesstime may not be resolved by consideringWeb-based information access and re-trieval as an isolated scientific problem.An August 1998 survey by Alexa Internet

1GVU’s user survey (available at ,gvu.gatech.edu/user surveys/.) is one of the more reliablesources on user data. Its reports have been en-dorsed by the World Wide Web Consortium (W3C)and INRIA.

146 • M. Kobayashi and K. Takeda


,alexa.com/company/inthenews/webfacts.html. indicates that 90% of all Webtraffic is spread over 100,000 differenthosts, with 50% of all Web trafficheaded towards the top 900 most popu-lar sites. Effective means of managinguneven concentration of informationpackets on the Internet will be neededin addition to the development of fastaccess and retrieval algorithms.

The volume of information on searchengines has exploded in the past year.Some valuable resources are cited be-low. The University of California at Ber-keley has extensive Web pages on “howto choose the search tools you need”,lib.berkeley.edu/teachinglib/guides/internet/toolstables.html.. In additionto general advice on conductingsearches on the Internet, the pages com-pare features such as size, case sensitiv-ity, ability to search for phrases andproper names, use of Boolean logicterms, ability to require or exclude spec-ified terms, inclusion of multilingualfeatures, inclusion of special featurebuttons (e.g., “more like this,” “top 10most frequently visited sites on the sub-ject,” and “refine”) and exclusion ofpages updated prior to a user-specifieddate of several popular search enginessuch as those of Alta Vista ,altavista.com.; HotBot ,hotbot.com.; Lycos ProPower Search ,lycos.com.; Excite ,ex-cite.com.; Yahoo! ,yahoo.com.; Info-seek ,infoseek.com.; Disinformation,disinfo.com.; and Northern Light,nlsearch.com..

The work of Lidsky and Kwon [1997]is an opinionated but informative re-source on search engines. It describes36 different search engines and ratesthem on specific details of their searchcapabilities. For instance, in one study,searches are divided into five catego-ries: (1) simple searches; (2) customsearches; (3) directory searches; (4) cur-rent news searches; and (5) Web con-tent. The five categories of search areevaluated in terms of power and ease ofuse. Variations in ratings sometimesdiffer substantially for a given searchengine. Similarly, query tests are con-

ducted according to five criteria: (1)simple queries; (2) customized queries;(3) news queries; (4) duplicate elimina-tion; and (5) dead link elimination. Onceagain, variations in the ratings some-times differ substantially for a givensearch engine. In addition to ratings,the authors give charts on search in-dexes and directories associated withtwelve of the search engines, and ratethem in terms of specific features forcomplex searches and content. The dataindicate that as the number of peopleusing the Internet and Web has grown,user types have diversified and searchengine providers have begun to targetmore specific types of users and querieswith specialized and tailored searchtools.

Web Search Engine Watch ,search-enginewatch.com/webmasters/features.html. posts extensive data and ratingsof popular search engines according tofeatures such as size, pages crawled perday, freshness, and depth. Some otheruseful online sources are home pages onsearch engines by the Gray ,mit.people.edu/mkgray/net.; Information Today,infotoday.com/searcher/jun/story2.htm.;Kansas City Public Library ,kcpl.lib.mo.us/search/srchengines.htm.; Koch,ub2.lu.se/desire/radar/lit-about-search-services.html.; Northwestern Univer-sity Library ,library.nwu.edu/resources/internet/search/evaluate.html.; andNotes of Search Engine Showdown,imtnet/notes/search/index.html.. Dataon international use of the Web andInternet is posted at the NUA InternetSurvey home page ,nua.ie/surveys..

A note of caution: in digesting thedata in the paragraphs above and be-low, published data on the Internet andthe Web are very difficult to measureand verify. GVU offers a solid piece ofadvice on the matter:

“We suggest that those interested in these (i.e.,Internet/WWW statistics and demographics)statistics should consult several sources; thesenumbers can be difficult to measure and resultsmay vary between different sources.” [GVU’sWWW user survey]

Although details of data from different



popular sources vary, overall trends arefairly consistently documented. Wepresent some survey results from someof these sources below.

1.2 Growth of the Internet and the Web

Schatz [1997] of the National Center forSupercomputing Applications (NCSA)estimates that the number of Internetusers increased from 1 million to 25million in the five years leading up toJanuary of 1997. Strategy Alley [1998]gives a number of statistics on Internetusers: Matrix Information and Direc-tory Services (MIDS), an Internet mea-surement organization, estimated therewere 57 million users on the consumerInternet worldwide in April of 1998, andthat the number would increase to 377million by 2000; Morgan Stanley givesthe estimate of 150 million in 2000; andKillen and Associates give the estimateas 250 million in 2000. Nua’s surveys,nua.ie/surveys. estimates the figureas 201 million worldwide in Septemberof 1999, and more specifically by region:1.72 million in Africa; 33.61 in the Asia/Pacific region; 47.15 in Europe; 0.88 inthe Middle East; 112.4 in Canada andthe U.S.; and 5.29 in Latin America.Most data and projections support con-tinued tremendous growth (mostly ex-ponential) in Internet users, althoughprecise numerical values differ.

Most data on the amount of informa-tion on the Internet (i.e., volume, num-ber of publicly accessible Web pages andhosts) show tremendous growth, andthe sizes and numbers appear to begrowing at an exponential rate. Lynchhas documented the explosive growth ofInternet hosts; the number of hosts hasbeen roughly doubling every year. Forexample, he estimates that it was 1.3million in January of 1993, 2.2 millionin January of 1994, 4.9 million in Janu-ary of 1995, and 9.5 million in Januaryof 1996. His last set of data is 12.9million in July of 1996 [Lynch 1997].Strategy Alley [1998] cites similar fig-ures: “Since 1982, the number of hostshas doubled every year.” And an article

by the editors of the IEEE InternetComputing Magazine states that expo-nential growth of Internet hosts wasobserved in separate studies by severalexperts [IEEE 1998a], such as MarkLottor of Network Wizards ,nw.com.;Mirjan Kühne of the RIPE NetworkControl Center ,.ripe.net. for a periodof over ten years; Samarada Weera-handi of Bellcore on his home page onInternet hosts ,ripe.net. for a periodof over five years in Europe; and JohnQuarterman of Matrix Information andDirectory Services ,mids.org..

The number of publicly accessiblepages is also growing at an aggressivepace. Smith [1973] estimates that inJanuary of 1997 there were 80 millionpublic Web pages, and that the numberwould subsequently double annually.Bharat and Broder [1998] estimatedthat in November of 1997 the total num-ber of Web pages was over 200 million.If both of these estimates for number ofWeb pages are correct, then the rate ofincrease is higher than Smith’s predic-tion, i.e., it would be more than doubleper year. In a separate estimate [Monier1998], the chief technical officer of Alta-Vista estimated that the volume of pub-licly accessible information on the Webhas grown from 50 million pages on100,000 sites in 1995 to 100 to 150million pages on 600,000 sites in Juneof 1997. Lawrence and Giles summarizeWeb statistics published by others: 80million pages in January of 1997 by theInternet Archive [Cunningham 1997],75 million pages in September of 1997by Forrester Research Inc. [Guglielmo1997], Monier’s estimate (mentionedabove), and 175 million pages in Decem-ber 1997 by Wired Digital. Then theyconducted their own experiments to es-timate the size of the Web and con-cluded that:

“it appears that existing estimates significantlyunderestimate the size of the Web.” [Lawrenceand Giles 1998b]

Follow-up studies by Lawrence andGiles [1999a] estimate that the numberof publicly indexable pages on the Web



at that time was about 800 millionpages (with a total of 6 terabytes of textdata) on about 3 million servers (Law-rence’shomepage:,neci.nec.cim/lawrence/papers.html.). On Aug. 31 1998, AlexaInternet announced its estimate of 3terabytes or 3 million megabytes for theamount of information on the Web, with20 million Web content areas; a contentarea is defined as top-level pages ofsites, individual home pages, and signif-icant subsections of corporate Web sites.Furthermore, they estimate a doublingof volume every eight months.

Given the enormous volume of Webpages in existence, it comes as no sur-prise that Internet users are increas-ingly using search engines and searchservices to find specific information. Ac-cording to Brin and Paige, the WorldWideWebWorm(homepages:,cs.colorado.edu/wwww. and ,guano.cs.colorado.edu/wwww.) claims to have handled anaverage of 1,500 queries a day in April1994, and AltaVista claims to have han-dled 20 million queries in November1997. They believe that

“it is likely that top search engines will handlehundreds of millions (of queries) per day by theyear 2000.” [Brin and Page 1998]

The results of GVU’s April 1998WWW user survey indicate that about86% of people now find a useful Website through search engines, and 85%find them through hyperlinks in otherWeb pages; people now use search en-gines as much as surfing the Web tofind information.

1.3 Evaluation of Search Engines

Several different measures have beenproposed to quantitatively measure theperformance of classical information re-trieval systems (see, e.g., Losee [1998];Manning and Schutze [1999]), most ofwhich can be straightforwardly ex-tended to evaluate Web search engines.However, Web users may have a ten-dency to favor some performance issuesmore strongly than traditional users ofinformation retrieval systems. For ex-

ample, interactive response times ap-pear to be at the top of the list ofimportant issues for Web users (see Sec-tion 1.1) as well as the number of valu-able sites listed in the first page ofretrieved results (i.e., ranked in the top8, 10, or 12), so that the scroll down ornext page button do not have to be in-voked to view the most valuable results.

Some traditional measures of infor-mation retrieval system performanceare recognized in modified form by Webusers. For example, a basic model fromtraditional retrieval systems recognizesa three way trade-off between the speedof information retrieval, precision, andrecall (which is illustrated in Figure 1).This trade-off becomes increasingly dif-ficult to balance as the number of docu-ments and users of a database escalate.In the context of information retrieval,precision is defined as the ratio of rele-vant documents to the number of re-trieved documents:

precision 5number of relevant documents

number of retrieved documents,

and recall is defined as the proportion ofrelevant documents that are retrieved:

recall 5

number of relevant, retrieved documents

total number of relevant documents.

Figure 1. Three way trade-off in search engineperformance: (1) speed of retrieval, (2) precision,and (3) recall.



Most Web users who utilize search en-gines are not so much interested in thetraditional measure of precision as theprecision of the results displayed in thefirst page of the list of retrieved docu-ments, before a “scroll” or “next page”command is used. Since there is littlehope of actually measuring the recallrate for each Web search engine queryand retrieval job—and in many casesthere may be too many relevant pag-es—a Web user would tend to be moreconcerned about retrieving and beingable to identify only very highly valu-able pages. Kleinberg [1998] recognizesthe importance of finding the most in-formation rich, or authority pages. Hubpages, i.e., pages that have links tomany authority pages are also recog-nized as being very valuable. A Webuser might substitute recall with a mod-ified version in which the recall is com-puted with respect to the set of hub andauthority pages retrieved in the top 10or 20 ranked documents (rather than allrelated pages). Details of an algorithmfor retrieving authorities and hubs byKleinberg [1998] is given in Section 2.4of this paper.

Hearst [1999] notes that the user in-terface, i.e., the quality of human-com-puter interaction, should be taken intoaccount when evaluating an informa-tion retrieval system. Nielsen [1993] ad-vocates the use of qualitative (ratherthan quantitative) measures to evaluateinformation retrieval systems. In partic-ular, user satisfaction with the systeminterface as well as satisfaction withretrieved results as a whole (ratherthan statistical measures) is suggested.Westera [1996] suggests some query for-mats for benchmarking search engines,such as: single keyword search; pluralsearch capability; phrase search; Bool-ean search (with proper noun); and com-plex Boolean. In the next section wediscuss some of the differences and sim-ilarities in classical and Internet-basedsearch, access and retrieval of informa-tion.

Hawking et al. [1999] discusses eval-uation studies of six text retrieval con-

ferences (TREC) U.S. National Instituteof Standards and Technology (NIST)search engines ,trec.nist.gov.. In par-ticular, they examine answers to ques-tions such as “Can link information re-sult in better rankings?” and “Do longerqueries result in better answers?”

2. TOOLS FOR WEB-BASED RETRIEVALAND RANKING

Classical retrieval and ranking algo-rithms developed for isolated (andsometimes static) databases are not nec-essarily suitable for Internet applica-tions. Two of the major differences be-tween classical and Web-based retrievaland ranking problems and challenges indeveloping solutions are the number ofsimultaneous users of popular searchengines and the number of documentsthat can be accessed and ranked. Morespecifically, the number of simultaneoususers of a search engine at a givenmoment cannot be predicted beforehandand may overload a system. And thenumber of publicly accessible docu-ments on the Internet exceeds thosenumbers associated with classical data-bases by several orders of magnitude.Furthermore, the number of Internetsearch engine providers, Web users, andWeb pages is growing at a tremendouspace, with each average page occupyingmore memory space and containing dif-ferent types of multimedia informationsuch as images, graphics, audio, andvideo.

There are other properties besides thenumber of users and size that set classi-cal and Web-based retrieval problemsapart. If we consider the set of all Webpages as a gigantic database, this set isvery different from a classical databasewith elements that can be organized,stored, and indexed in a manner thatfacilitates fast and accurate retrievalusing a well-defined format for inputqueries. In Web-based retrieval, deter-mining which pages are valuableenough to index, weight, or cluster andcarrying out the tasks efficiently, whilemaintaining a reasonable degree of



accuracy considering the ephemeral na-ture of the Web, is an enormous chal-lenge. Further complicating the problemis the set of appropriate input queries;the best format for inputting the que-ries is not fixed or known. In this sec-tion we examine indexing, clustering,and ranking algorithms for documentsavailable on the Web and user inter-faces for protoype IR systems for theWeb.

2.1 Indexing

The American Heritage Dictionary(1976) defines index as follows:

(in z dex) 1. Anything that serves toguide, point out or otherwise facilitatereference, as: a. An alphabetized list-ing of names, places, and subjects in-cluded in a printed work that givesfor each item the page on which itmay be found. b. A series of notchescut into the edge of a book for easyaccess to chapters or other divisions.c. Any table, file, or catalogue.

Although the term is used in the samespirit in the context of retrieval andranking, it has a specific meaning.Some definitions proposed by expertsare “The most important of the tools forinformation retrieval is the index—acollection of terms with pointers toplaces where information about docu-ments can be found” [Manber 1999];“indexing is building a data structurethat will allow quick seaching of thetext” [Baeza-Yates 1999]; or “the act ofassigning index terms to documents,which are the objects to be retrieved”[Korfhage 1997]; “An index term is a(document) word whose semantics helpsin remembering the document’s mainthemes” [Baeza-Yates and Ribeiro-Neto1999]. Four approaches to indexing doc-uments on the Web are (1) human ormanual indexing; (2) automatic index-ing; (3) intelligent or agent-based index-ing; and (4) metadata, RDF, and anno-tation-based indexing. The first twoappear in many classical texts, whilethe latter two are relatively new and

promising areas of study. We first givean overview of Web-based indexing,then describe or give references to thevarious approaches.

Indexing Web pages to facilitate re-trieval is a much more complex andchallenging problem than the corre-sponding one associated with classicaldatabases. The enormous number of ex-isting Web pages and their rapid in-crease and frequent updating makesstraightforward indexing, whether byhuman or computer-assisted means, aseemingly impossible, Sisyphean task.Indeed, most experts agree that, at agiven moment, a significant portion ofthe Web is not recorded by the indexerof any search engine. Lawrence andGiles estimated that, in April 1997, thelower bound on indexable Web pageswas 320 million, and a given individualsearch engine will have indexed be-tween 3% to 34% of the possible total[Lawrence and Giles 1998b]. They alsoestimated that the extent of overlapamong the top six search engines issmall and their collective coverage wasonly around 60%; the six search enginesare HotBot, AltaVista, Northern Light,Excite, Infoseek, and Lycos. A follow upstudy for the period February 2–28, 1999,involving the top 11 search engines (the sixabove plus Snap ,snap.com.; Microsoft,msn.com.; Google ,google.com.;Yahoo!; and Euroseek ,euroseek.com.)indicates that we are losing the index-ing race. A far smaller proportion of theWeb is now indexed with no engine cov-ering more than 16% of the Web. Index-ing appears to have become more impor-tant than ever, since 83% of sitescontained commercial content and 6%contained scientific or educational con-tent [Lawrence and Giles 1999a].

Bharat and Broder estimated in No-vember 1997 that the number of pagesindexed by HotBot, AltaVista, Excite,and Infoseek were 77 million, 100 mil-lion, 32 million, and 17 million, respec-tively. Furthermore, they believe thatthe union of these pages is around 160million pages, i.e., about 80% of the 200million total accessible pages they believe



existed at that time. Their studies indi-cate that there is little overlap in theindexing coverage, more specifically,less than 1.4% (i.e., 2.2 million) of the160 million indexed pages were coveredby all four of the search engines. Melee’sIndexing Coverage Analysis (MICA)Reports ,melee.com/mica/index.html.provides a weekly update on indexingcoverage and quality by a few, select,search engines that claim to index “atleast one fifth of the Web.” Other stud-ies on estimating the extent of Webpages that have been indexed by popu-lar search engines include Baldonadoand Winograd [1997]; Hernandez[1996]; Hernandez and Stolfo [1995];Hylton [1996]; Monge and Elkan [1998];Selberg and Etzioni [1995a]; and Silber-schatz et al. [1995].

In addition to the sheer volume ofdocuments to be processed, indexersmust take into account other complexissues, for example, Web pages are notconstructed in a fixed format; the tex-tual data is riddled with an unusuallyhigh percentage of typos—the contentsusually contain nontextual multimediadata, and updates to the pages aremade at different rates. For instance,preliminary studies documented in Na-varro [1998] indicate that on the aver-age site 1 in 200 common words and 1 in3 foreign surnames are misspelled.Brake [1997] estimates that the averagepage of text remains unchanged on theWeb for about 75 days, and Kahle esti-mates that 40% of the Web changesevery month. Multiple copies of identi-cal or near-identical pages are abun-dant; for example, FAQs 2 postings, mir-ror sites, old and updated versions ofnews, and newspaper sites. Broder et al.[1997] and Shivakumar and García-Mo-lina [1998] estimate that 30% of Webpages are duplicates or near-duplicates.

Tools for removing redundant URLs orURLs of near and perfectly identicalsites have been investigated by Baldo-nado and Winograd [1997]; Hernandez[1996]; Hernandez and Stolfo [1995];Hylton [1996]; Monge and Elkan [1998];Selberg and Etzioni [1995a]; and Silber-schatz et al. [1995].

Henzinger et al. [1999] suggested amethod for evaluating the quality ofpages in a search engine’s index. In thepast, the volume of pages indexed wasused as the primary measurement ofWeb page indexers. Henzinger et al.suggest that the quality of the pages ina search engine’s index should also beconsidered, especially since it has be-come clear that no search engine canindex all documents on the Web, andthere is very little overlap between theindexed pages of major search engines.The idea of Henzinger’s method is toevaluate the quality of Web pages ac-cording to its indegree (an evaluationmeasure based on how many otherpages point to the Web page under con-sideration [Carriere and Kazman 1997])and PageRank (an evaluation measurebased on how many other pages point tothe Web page under consideration, aswell as the value of the pages pointingto it [Brin and Page 1998; Cho et al.1998]).

The development of effective indexingtools to aid in filtering is another majorclass of problems associated with Web-based search and retrieval. Removal ofspurious information is a particularlychallenging problem, since a popular in-formation site (e.g., newsgroup discus-sions, FAQ postings) will have littlevalue to users with no interest in thetopic. Filtering to block pornographicmaterials from children or for censor-ship of culturally offensive materials isanother important area for research andbusiness devlopment. One of the prom-ising new approaches is the use of meta-data, i.e., summaries of Web page con-tent or sites placed in the page foraiding automatic indexers.

2FAQs, or frequently asked questions, are essayson topics on a wide range of interests, with point-ers and references. For an extensive list of FAQs,see,cis.ohio-state.edu/hypertext/faq/usenet/faq-list.html. and ,faq.org..



2.1.1 Classical Methods. Manual in-dexing is currently used by several com-mercial, Web-based search engines, e.g.,Galaxy ,galaxy.einet.net.; GNN: WholeInternet Catalog ,e1c.gnn.com/gnn/wic/index.html.; Infomine ,lib-www.ucr.edu.; KidsClick! ,sunsite.berkeley.edu/kidsclick!/.; LookSmart ,looksmart.com.;Subject Tree ,bubl.bath.ac.uk/bubl/cattree.html.; Web Developer’s VirtualLibrary ,stars.com.; World-Wide WebVirtual Library Series Subject Catalog,w3.org/hypertext/datasources/bysubject/overview.html.; and Yahoo!. The prac-tice is unlikely to continue to be assuccessful over the next few years,since, as the volume of informationavailable over the Internet increases atan ever greater pace, manual indexingis likely to become obsolete over thelong term. Another major drawbackwith manual indexing is the lack ofconsistency among different profes-sional indexers; as few as 20% of theterms to be indexed may be handled inthe same manner by different individu-als [Korfhage 1997, p. 107], and there isnoticeable inconsistency, even by agiven individual [Borko 1979; Cooper1969; Jacoby and Slamecka 1962; Mac-skassys et al. 1998; Preschel 1972; andSalton 1969].

Though not perfect, compared to mostautomatic indexers, human indexing iscurrently the most accurate because ex-perts on popular subjects organize andcompile the directories and indexes in away which (they believe) facilitates thesearch process. Notable references onconventional indexing methods, includ-ing automatic indexers, are Part IV ofSoergel [1985]; Jones and Willett[1977]; van Rijsbergen [1977]; and Wit-ten et al. [1994, Chap. 3]. Technologicaladvances are expected to narrow thegap in indexing quality between humanand machine-generated indexes. In thefuture, human indexing will only be ap-plied to relatively small and static (ornear static) or highly specialized databases, e.g., internal corporate Webpages.

2.1.2 Crawlers/Robots. Scientists haverecently been investigating the use ofintelligent agents for performing specifictasks, such as indexing on the Web [AIMagazine 1997; Baeza-Yates and Ri-beiro-Neto 1999]. There is some ambi-guity concerning proper terminology todescribe these agents. They are mostcommonly referred to as crawlers, butare also known as ants, automatic in-dexers, bots, spiders, Web robots (Webrobot FAQ ,info.webcrawler.com/mak/projects/robots/faq.html.), and worms.It appears that some of the terms wereproposed by the inventors of a specifictool, and their subsequent use spread tomore general applications of the samegenre.

Many search engines rely on automati-cally generated indices, either by them-selves or in combination with othertechnologies, e.g., Aliweb ,nexor.co.uk/public/aliweb/aliweb.html.; AltaVista;Excite; Harvest ,harvest.transarc.com.;HotBot; Infoseek; Lycos; Magellan,magellan.com.; MerzScope ,merzcom.com.; Northern Light; Smart Spider,engsoftware.com.; Webcrawler,webcrawler.com/.; and World WideWeb Worm. Although most of Yahoo!’sentries are indexed by humans or ac-quired through submissions, it uses arobot to a limited extent to look for newannouncements. Examples of highlyspecialized crawlers include Argos ,argos.evansville.edu. for Web sites on theancient and medieval worlds; CACTVSChemistry Spider ,schiele.organik.uni-erlangen.de/cactvs/spider.html. forchemical databases; MathSearch ,maths.usyd.edu.au:8000/mathsearch.html. forEnglish mathematics and statistics doc-uments; NEC-MeshExplorer ,netplaza.biglobe.or.jp/keyword.html. for theNETPLAZA search service owned bythe NEC Corporation; and Social Sci-ence Information Gateway (SOSIG),scout.cs.wisc.edu/scout/mirrors/sosig.for resources in the social sciences.Crawlers that index documents in lim-ited environments include LookSmart,looksmart.com/. for a 300,000 site data-base of rated and reviewed sites; Robbie



the Robot, funded by DARPA for educa-tion and training purposes; and UCSDCrawl ,www.mib.org/ ucsdcrawl. forUCSD pages. More extensive lists of in-telligent agents are available on The WebRobots Page ,info.webcrawler.com/mak/projects/robots/active/html/type.html.;and on Washington State University’srobot pages ,wsulibs.wsu.edu/general/robots.htm..

To date, there are three major prob-lems associated with the use of robots:(1) some people fear that these agentsare too invasive; (2) robots can overloadsystem servers and cause systems to bevirtually frozen; and (3) some sites areupdated at least several times per day,e.g., approximately every 20 minutes byCNN ,cnn.com. and Bloomberg,bloomberg.com., and every few hoursby many newspaper sites [Carl 1995](article home page ,info.webcrawler.com/mak/projects/robots/threat-or-treat.html.);[Koster 1995]. Some Web sites deliber-ately keep out spiders; for example, theNew York Times ,nytimes.com.requires users to pay and fill out aregistration form; CNN used to excludesearch spiders to prevent distortion ofdata on the number of users who visitthe site; and the online catalogue of theBritish Library ,portico.bl.uk. only al-lows access to users who have filled outan online query form [Brake 1997]. Sys-tem managers of these sites must keepup with the new spider and robot tech-nologies in order to develop their owntools to protect their sites from newtypes of agents that intentionally or un-intentionally could cause mayhem.

As a working compromise, Kostnerhas proposed a robots exclusion stan-dard (“A standard for robots exclusion,”ver.1:,info.webcrawler.com/mak/projects/robots/exclusion.html.; ver. 2: ,info.webcrawler.com/mak/projects/robots/norobot.html.), which advocatesblocking certain types of searches torelieve overload problems. He has alsoproposed guidelines for robot design(“Guidelines for robot writers” (1993),info.webcrawler.com/mak/projects/robots/guidelines.html.). It is important to

note that robots are not always the rootcause of network overload; sometimeshuman user overload causes problems,which is what happened at the CNNsite just after the announcement of theO.J. Simpson trial verdict [Carl 1995].Use of the exclusion standard is strictlyvoluntary, so that Web masters have noguarantee that robots will not be able toenter computer systems and createhavoc. Arguments in support of the ex-clusion standard and discussion on itseffectiveness are given in Carl [1995]and Koster [1996].

2.1.3 Metadata, RDF, and Annota-tions.

“What is metadata? The Macquarie dictionary de-fines the prefix ‘meta-’ as meaning ‘among,’ ‘to-gether with,’ ‘after’ or ‘behind.’ That suggeststhe idea of a ‘fellow traveller ’: that metadata isnot fully fledged data, but it is a kind of fellow-traveller with data, supporting it from the side-lines. My definition is that ‘an element of meta-data describes an information resource or helpsprovide access to an information resource.’”[Cathro 1997]

In the context of Web pages on theInternet, the term “metadata” usuallyrefers to an invisible file attached to aWeb page that facilitates collection ofinformation by automatic indexers; thefile is invisible in the sense that it hasno effect on the visual appearance of thepage when viewed using a standardWeb browser.

The World Wide Web (W3) Consor-tium ,w3.org. has compiled a list ofresources on information and standard-ization proposals for metadata (W3metadata page ,w3.org/metadata.. Anumber of metadata standards havebeen proposed for Web pages. Amongthem, two well-publicized, solid effortsare the Dublin Core Metadata standard:home page ,purl.oclc.org/metadata/dublin core. and the Warwick frame-work: article home page ,dlib.org/dlib/july96/lagoze/07lagoze.html. [Lagoze1996]. The Dublin Core is a 15-elementmetadata element set proposed to facili-tate fast and accurate information re-trieval on the Internet. The elementsare title; creator; subject; description;



publisher; contributors; date; resourcetype; format; resource identifier; source;language; relation; coverage; and rights.The group has also developed methodsfor incorporating the metadata into aWeb page file. Other resources on meta-data include Chapter 6 of Baeza-Yatesand Ribeiro-Neto [1999] and Mar-chionini [1999]. If the general publicadopts and increases use of a simplemetadata standard (such as the DublinCore), the precision of information re-trieved by search engines is expected toimprove substantially. However, wide-spread adoption of a standard by inter-national users is dubious.

One of the major drawbacks of thesimplest type of metadata for labelingHTML documents, called metatags, isthey can only be used to describe con-tents of the document to which they areattached, so that managing collectionsof documents (e.g., directories or thoseon similar topics) may be tedious whenupdates to the entire collection aremade. Since a single command cannotbe used to update the entire collectionat once, documents must be updatedone-by-one. Another problem is whendocuments from two or more differentcollections are merged to form a newcollection. When two or more collectionsare merged, inconsistent use of meta-tags may lead to confusion, since ametatag might be used in different col-lections with entirely different mean-ings. To resolve these issues, the W3Consortium proposed in May 1999 thatthe Resource Description Framework(RDF) be used as the metadata codingscheme for Web documents (W3 Consor-tium RDF homepage ,w3.org/rdf.. Aninteresting associated development isIBM’s XCentral ,ibm.com/developer/xml., the first search engine that in-dexes XML and RDF elements.

Metadata places the responsibility ofaiding indexers on the Web page au-thor, which is reasonable if the authoris a responsible person wishing to ad-vertise the presence of a page to in-crease legitimate traffic to a site. Unfor-tunately, not all Web page authors are

fair players. Many unfair players main-tain sites that can increase advertisingrevenue if the number of visitors is veryhigh or charging a fee per visit for ac-cess to pornographic, violent, and cul-turally offensive materials. These sitescan attract a large volume of visitors byattaching metadata with many popularkeywords. Development of reliable fil-tering services for parents concernedabout their children’s surfing venues isa serious and challenging problem.

Spamming, i.e., excessive, repeateduse of key words or “hidden” text pur-posely inserted into a Web page to pro-mote retrieval by search engines, is re-lated to, but separate from, theunethical or deceptive use of metadata.Spamming is a new phenomenon thatappeared with the introduction ofsearch engines, automatic indexers, andfilters on the Web [Flynn 1996; Libera-tore 1997]. Its primary intent is to out-smart these automated software sys-tems for a variety of purposes;spamming has been used as an adver-tising tool by entrepreneurs, cult re-cruiters, egocentric Web page authorswanting attention, and technically well-versed, but unbalanced, individuals whohave the same sort of warped mentalityas inventors of computer viruses. A fa-mous example of hidden text spammingis the embedding of words in a blackbackground by the Heaven’s Gate cult.Although the cult no longer exists, itshome page is archived at the sunspot.net site ,sunspot.net/news/special/heavensgatesite., a technique knownas font color spamming [Liberatore1997]. We note that the term spamminghas a broader meaning, related to re-ceiving an excessive amount of email orinformation. An excellent, broad over-view of the subject is given in Cranorand LaMacchia [1998]. In our context,the specialized terms spam-indexing,spam-dexing, or keyword spamming aremore precise.

Another tool related to metadata isannotation. Unlike metadata, which iscreated and attached to Web documentsby the author for the specific purpose of



aiding indexing, annotations include amuch broader class of data to be at-tached to a Web document [Nagao andHasida 1998; Nagao et al. 1999]. Threeexamples of the most common annota-tions are linguistic annotation, com-mentary (created by persons other thanthe author), and multimedia annota-tion. Linguistic annotation is being usedfor automatic summarization and con-tent-based retrieval. Commentary anno-tation is used to annotate nontextualmultimedia data, such as image andsound data plus some supplementaryinformation. Multimedia annotation gen-erally refers to text data, which describesthe contents of video data (which may bedownloadable from the Web page). Aninteresting example of annotation is theattachment of comments on Web docu-ments by people other than the documentauthor. In addition to aiding indexing andretrieval, this kind of annotation may behelpful for evaluating documents.

Despite the promise that metadataand annotation could facilitate fast andaccurate search and retrieval, a recentstudy for the period February 2–28,1999 indicates that metatags are onlyused on 34% of homepages, and only0.3% of sites use the Dublin Core meta-data standard [Lawrence and Giles1999a]. Unless a new trend towards theuse of metadata and annotations devel-ops, its usefulness in information re-trieval may be limited to very large,closed data owned by large corporations,public institutions, and governmentsthat choose to use it.

2.2 Clustering

Grouping similar documents together toexpedite information retrieval is knownas clustering [Anick and Vaithyanathan1997; Rasmussen 1992; Sneath andSokal 1973; Willett 1988]. During theinformation retrieval and ranking pro-cess, two classes of similarity measuresmust be considered: the similarity of adocument and a query and the similar-ity of two documents in a database. Thesimilarity of two documents is impor-

tant for identifying groups of documentsin a database that can be retrieved andprocessed together for a given type ofuser input query.

Several important points should beconsidered in the development and im-plementation of algorithms for cluster-ing documents in very large databases.These include identifying relevant at-tributes of documents and determiningappropriate weights for each attribute;selecting an appropriate clusteringmethod and similarity measure; esti-mating limitations on computationaland memory resources; evaluating thereliability and speed of the retrievedresults; facilitating changes or updatesin the database, taking into account therate and extent of the changes; andselecting an appropriate search algo-rithm for retrieval and ranking. Thisfinal point is of particularly great con-cern for Web-based searches.

There are two main categories of clus-tering: hierarchical and nonhierarchi-cal. Hierarchical methods show greaterpromise for enhancing Internet searchand retrieval systems. Although detailsof clustering algorithms used by majorsearch engines are not publicly avail-able, some general approaches areknown. For instance, Digital EquipmentCorporation’s Web search engine Alta-Vista is based on clustering. Anick andVaithyanathan [1997] explore how tocombine results from latent semanticindexing (see Section 2.4) and analysisof phrases for context-based informationretrieval on the Web.

Zamir et al. [1997] developed threeclustering methods for Web documents.In the word-intersection clusteringmethod, words that are shared by docu-ments are used to produce clusters. Themethod runs in O~n2! time and producesgood results for Web documents. A secondmethod, phrase-intersection clustering,runs in O~nlog n! time is at least twoorders of magnitude faster than methodsthat produce comparable clusters. A thirdmethod, called suffix tree clustering is de-tailed in Zamir and Etzioni [1998].



Modha and Spangler [2000] developeda clustering method for hypertext docu-ments, which uses words contained inthe document, outlinks from the docu-ment, and in-links to the document.Clustering is based on six informationnuggets, which the authors dubbedsummary, breakthrough, review, key-words, citation, and reference. The firsttwo are derived from the words in thedocument, the next two from the out-links, and the last two from the in-links.

Several new approaches to clusteringdocuments in data mining applicationshave recently been developed. Sincethese methods were specifically de-signed for processing very large datasets, they may be applicable with somemodifications to Web-based informationretrieval systems. Examples of some ofthese techniques are given in Agrawal etal. [1998]; Dhillon and Modha [1999;2000]; Ester et al. [1995a; 1995b; 1995c];Fisher [1995]; Guha et al. [1998]; Ng andHan [1994]; and Zhang et al. [1996]. Forvery large databases, appropriate parallelalgorithms can speed up computation[Omiecinski and Scheuermann 1990].

Finally, we note that clustering is justone of several ways of organizing docu-ments to facilitiate retrieval from largedatabases. Some alternative methodsare discussed in Frakes and Baeza-Yates [1992]. Specific examples of somemethods designed specifically for facili-tating Web-based information retrievalare evaluation of significance, reliabil-ity, and topics covered in a set of Webpages based on analysis of the hyper-link structures connecting the pages(see Section 2.4); and identification ofcyber communities with expertise insubject(s) based on user access fre-quency and surfing patterns.

2.3 User Interfaces

Currently, most Web search engines aretext-based. They display results frominput queries as long lists of pointers,sometimes with and sometimes withoutsummaries of retrieved pages. Futurecommercial systems are likely to take

advantage of small, powerful comput-ers, and will probably have a variety ofmechanisms for querying nontextualdata (e.g., hand-drawn sketches, tex-tures and colors, and speech) and betteruser interfaces to enable users to visu-ally manipulate retrieved information[Card et al. 1999; Hearst 1997; Mayburyand Wahlster 1998; Rao et al. 1993;Tufte 1983]. Hearst [1999] surveys visu-alization interfaces for information re-trieval systems, with particular emphasison Web-based systems. A sampling ofsome exploratory works being conductedin this area are described below. Theseinterfaces and their display systems,which are known under several differentnames (e.g., dynamic querying, informa-tion outlining, visual information seek-ing), are being developed at universities,government, and private research labs,and small venture companies worldwide.

2.3.1 Metasearch Navigators. A verysimple tool developed to exploit the bestfeatures of many search engines is themetasearch navigator. These navigatorsallow simultaneous search of a set ofother navigators. Two of the most exten-sive are Search.com ,search.com/.,which can utilize the power of over 250search engines, and INFOMINE ,lib-www.ucr.edu/enbinfo.html., which uti-lizes over 90. Advanced metasearchnavigators have a single input interfacethat sends queries to all (or only userselected search engines), eliminates du-plicates, and then combines and ranksreturned results from the different searchengines. Some fairly simple examplesavailable on the Web are 2ask ,web.gazeta.pl/miki/search/2ask-anim.html.;ALL-IN-ONE ,albany.net/allinone/.;EZ-Find at The River ,theriver.com/theRiver/explore/ezfind.html.; IBM Info-Market Service ,infomkt.ibm.com/.;Inference Find ,inference.com/infind/.;Internet Sleuth ,intbc.com/sleuth.; Meta-Crawler ,metacrawler.cs.washington.edu:8080/.; and SavvySearch ,cs.colostat.edu/dreiling/smartform.html. and ,guaraldi.cs.colostate.edu:2000/. [Howe andDreilinger 1997].



2.3.2 Web-Based Information Outlin-ing/Visualization. Visualization toolsspecifically designed to help users un-derstand websites (e.g., their directorystructures, types of information avail-able) are being developed by many pri-vate and public research centers[Nielsen 1997]. Overviews of some ofthese tools are given in Ahlberg andShneiderman [1994]; Beaudoin et al.[1996]; Bederson and Hollan [1994]; Gloorand Dynes [1998]; Lamping et al. [1995];Liechti et al. [1998]; Maarek et al. [1997];Munzner and Burchard [1995]; Robertsonet al. [1991]; and Tetranet Software Inc.[1998] ,tetranetsoftware.com.. Belowwe present some examples of interfacesdesigned to facilitate general informa-tion retrieval systems, we then presentsome that were specifically designed toaid retrieval on the Web.

Shneiderman [1994] introduced theterm dynamic queries to describe inter-active user control of visual query pa-rameters that generate a rapid, up-dated, animated visual display ofdatabase search results. Some applica-tions of the dynamic query concept aresystems that allow real estate brokersand their clients to locate homes basedon price, number of bedrooms, distancefrom work, etc. [Williamson and Shnei-derman 1992]; locate geographical re-gions with cancer rates above the na-tional average [Plaisant 1994]; allowdynamic querying of a chemistry table[Ahlberg and Shneiderman 1997]; withan interface to enable users to exploreUNIX directories through dynamic que-ries [Liao et al. 1992]: Visual presenta-tion of query components; visual presen-tation of results; rapid, incremental,and reversible actions; selection bypointing (not typing); and immediateand continuous feedback are features ofthe systems. Most graphics hardwaresystems in the mid-1990’s were still tooweak to provide adequate real-time in-teraction, but faster algorithms and ad-vances in hardware should increase sys-tem speed up in the future.

Williams [1984] developed a user in-terface for information retrieval sys-

tems to “aid users in formulating a que-ry.” The system, RABBIT III, supportsinteractive refinement of queries by al-lowing users to critique retrieved re-sults with labels such as “require” and“prohibit.” Williams claims that thissystem is particularly helpful to naïveusers “with only a vague idea of whatthey want and therefore need to beguided in the formulation/reformulationof their queries . . . (or) who have lim-ited knowledge of a given database orwho must deal with a multitude of data-bases.”

Hearst [1995] and Hearst and Peder-son [1996] developed a visualizationsystem for displaying information abouta document and its contents, e.g., itslength, frequency of term sets, and dis-tribution of term sets within the docu-ment and to each other. The system,called TileBars, displays informationabout a document in the form of a two-dimensional rectangular bar with even-sized tiles lying next to each other in anorderly fashion. Each tile representssome feature of the document; the infor-mation is encoded as a number whosemagnitude is represented in grayscale.

Cutting et al. [1993] developed a sys-tem called Scatter/Gather to allow usersto cluster documents interactively,browse the results, select a subset of theclusters, and cluster this subset of docu-ments. This process allows users to iter-atively refine their search. BEAD[Chalmers and Chitson 1992]; Galaxy ofNews [Rennison 1994]; and Theme-Scapes [Wise et al. 1995] are some ofthe other systems that show graphicaldisplays of clustering results.

Baldonado [1997] and Baldonado andWinograd [1997] developed an interfacefor exploring information on the Webacross heterogeneous sources, e.g.,search services such as Alta Vista, bib-liographic search services such as Dia-log, a map search service and a videosearch service. The system, called Sense-Maker, can “bundle” (i.e., cluster) simi-lar types of retrieved data according touser specified “bundling criteria” (the



criteria must be selected from a fixedmenu provided by SenseMaker). Exam-ples of available bundling criteria for aURL type include “(1) bundling resultswhose URLs refer to the same site; (2)bundling results whose URLs refer tothe same collection at a site; and (3) notbundling at all.” The system allows us-ers to select from several criteria toview retrieved results, e.g., according tothe URL, and also allows users to selectfrom several criteria on how duplicatesin retrieved information will be elimi-nated. Efficient detection and elimina-tion of duplicate database records andduplicate retrievals by search engines,which are very similar but not necessar-ily identical, have been investigated ex-tensively by many scientists, e.g., Her-nandez [1996]; Hernandez and Stolfo[1995]; Hylton [1996]; Monge and Elkan[1998]; and Silberschatz et al. [1995].

Card et al. [1996] developed two 3Dvirtual interface tools, WebBook andWebForager, for browsing and recordingWeb pages. Kobayashi et al. [1999] de-veloped a system to compare how rele-vance ranking of documents differ whenqueries are changed. The parallel rank-ing system can be used in a variety ofapplications, e.g., query refinement andunderstanding the contents of a data-base from different perspectives (eachquery represents a different user per-spective). Manber et al. [1997] devel-oped WebGlimpse, a tool for simulta-neous searching and browsing Webpages, which is based on the Glimpsesearch engine.

Morohashi et al. [1995] and Takedaand Nomiyama [1997] developed a sys-tem that uses new technologies to orga-nize and display, in an easily discern-ible form, a massive set of data. Thesystem, called “information outlining,”extracts and analyzes a variety of fea-tures of the data set and interactivelyvisualizes these features through corre-sponding multiple, graphical viewers.Interactions with multiple viewers facil-itates reducing candidate results, profil-ing information, and discovering newfacts. Sakairi [1999] developed a site

map for visualizing a Web site’s struc-ture and keywords.

2.3.3 Acoustical Interfaces. Web-based IR contributes to the accelerationof studies on and development of moreuser friendly, nonvisual, input-outputinterfaces. Some examples of researchprojects are given in a special journalissue on the topic “the next generationgraphics user interfaces (GUIs)” [CACM1993]. An article in Business Week[1977] discusses user preference forspeech-based interfaces, i.e., spoken in-put (which relies on speech recognitiontechnologies) and spoken output (whichrelies on text-to-speech and speech syn-thesis technologies).

One response to this preference byAsakawa [1996] is a method to enablethe visually impaired to access and usethe Web interactively, even when Japa-nese and English appear on a page (IBMHomepage on Systems for the Disabled,trl.ibm.co.jp/projects/s7260/sysde.htm.).The basic idea is to identify differentlanguages (e.g., English, Japanese) anddifferent text types (e.g., title and sec-tion headers, regular text, hot buttons)and then assign persons with easily dis-tinguishable voices (e.g., male, female)to read each of the different types of text.More recently, the method has been ex-tended to enable the visually impairedto access tables in HTML [Oogane andAsakawa 1998].

Another solution, developed by Ra-man [1996], is a system that enablesvisually impaired users to surf the Webinteractively. The system, called Emac-speak, is much more sophisticated thanscreen readers. It reveals the structureof a document (e.g., tables or calendars)in addition to reading the text aloud.

A third acoustic-based approach forWeb browsing is being investigated byMereu and Kazman [1996]. They exam-ined how sound environments can beused for navigation and found thatsighted users prefer musical environ-ments to enhance conventional means ofnavigation, while the visually impairedprefer the use of tones. The components



of all of the systems described above canbe modified for more general systems(i.e., not necessarily for the visually im-paired) which require an audio/speech-based interface.

2.4 Ranking Algorithms for Web-BasedSearches

A variety of techniques have been devel-oped for ranking retrieved documentsfor a given input query. In this sectionwe give references to some classicaltechniques that can be modified for useby Web search engines [Baeza-Yatesand Ribeiro-Neto 1999; Berry andBrowne 1999; Frakes and Baeza-Yates1992]. Techniques developed specificallyfor the Web are also presented.

Detailed information regarding rank-ing algorithms used by major searchengines is not publicly available, howev-er—it seems that most use term weight-ing or variations thereof or vector spacemodels [Baeza-Yates and Ribeiro-Neto1999]. In vector space models, each doc-ument (in the database under consider-ation) is modeled by a vector, each coor-dinate of which represents an attributeof the document [Salton 1971]. Ideally,only those that can help to distinguishdocuments are incorporated in the at-tribute space. In a Boolean model, eachcoordinate of the vector is zero (whenthe corresponding attribute is absent)or unity (when the corresponding at-tribute is present). Many refinements ofthe Boolean model exist. The most com-monly used are term-weighting models,which take into account the frequencyof appearance of an attribute (e.g., key-word) or location of appearance (e.g.,keyword in the title, section header, orabstract). In the simplest retrieval andranking systems, each query is alsomodeled by a vector in the same manneras the documents. The ranking of a doc-ument with respect to a query is deter-mined by its “distance” to the queryvector. A frequently used yardstick isthe angle defined by a query and docu-

ment vector.3 Ranking a document isbased on computation of the angle de-fined by the query and document vector.It is impractical for very large data-bases.

One of the more widely used vectorspace model-based algorithms for reduc-ing the dimension of the documentranking problem is latent semantic in-dexing (LSI) [Deerwester et al. 1990].LSI reduces the retrieval and rankingproblem to one of significantly lowerdimensions, so that retrieval from verylarge databases can be performed inreal time. Although a variety of algo-rithms based on document vector mod-els for clustering to expedite retrievaland ranking have been proposed, LSI isone of the few that successfully takesinto account synonymy and polysemy.Synonymy refers to the existence ofequivalent or similar terms, which canbe used to express an idea or object inmost languages, and polysemy refers tothe fact that some words have multiple,unrelated meanings. Absence of ac-counting for synonymy will lead tomany small, disjoint clusters, some ofwhich should actually be clustered to-gether, while absence of accounting forpolysemy can lead to clustering togetherof unrelated documents.

In LSI, documents are modeled byvectors in the same way as Salton’svector space model. We represent therelationship between the attributes anddocuments by an m-by-n (rectangular)matrix A, with ij-th entry aij, i.e.,

A 5 @aij#.

The column vectors of A represent thedocuments in the database. Next, wecompute the singular value decomposi-tion (SVD) of A, then construct a modifiedmatrix Ak, from the k largest singular

3The angle between two vectors is determined bycomputing the dot product and dividing by theproduct of the l2-norms of the vectors.



values s i; i 5 1, 2, ..., k, and their cor-responding vectors, i.e.,

Ak 5 UkSkV kT.

Sk is a diagonal matrix with monotoni-cally decreasing diagonal elements s i.Uk and Vk are matrices whose columnsare the left and right singular vectors ofthe k largest singular values of A.4

Processing the query takes place intwo steps: projection followed by match-ing. In the projection step, input queriesare mapped to pseudodocuments in thereduced query-document space by thematrix Uk, then weighted by the corre-sponding singular values s i from thereduced rank singular matrix Sk. Theprocess can be described mathemati-cally as

q 3 q̂ 5 qTUkSk21,

where q represents the original queryvector; q̂ the pseudodocument; qT thetranspose of q; and ~ z!21 the inverseoperator. In the second step, similari-ties between the pseudodocument q̂ anddocuments in the reduced term docu-ment space V k

T are computed using anyone of many similarity measures, suchas angles defined by each document andquery vector; see Anderberg [1973] orSalton [1989]. Notable reviews of linearalgebra techniques, including LSI andits applications to information retrieval,are Berry et al. [1995] and Letsche andBerry [1997].

Statistical approaches used in naturallanguage modeling and IR can probablybe extended for use by Web search en-gines. These approaches are reviewed inCrestani et al. [1998] and Manning andSchutze [1999].

Several scientists have proposed in-formation retrieval algorithms based on

analysis of hyperlink structures for useon the Web [Botafogo et al. 1992; Carri-ere and Kazman 1997; Chakrabarti etal. 1988; Chakrabarti et al. 1998; Frisse1988; Kleinberg 1998; Pirolli et al. 1996;and Rivlin et al. 1994].

A simple means to measure the qual-ity of a Web page, proposed by Carriereand Kazman [1997], is to count thenumber of pages with pointers to thepage, and is used in the WebQuery sys-tem and the Rankdex search engine,rankdex.gari.com.. Google, whichcurrently indexes about 85 million Webpages, is another search engine thatuses link infomation. Its rankings arebased, in part, on the number of otherpages with pointers to the page. Thispolicy seems to slightly favor educa-tional and government sites over com-mercial ones. In November 1999, North-ern Light introduced a new rankingsystem, which is also based, in part, onlink data (Search Engine Briefs,searchenginewatch.com/sereport/99/11-briefs.html.).

The hyperlink structures are used torank retrieved pages, and can also beused for clustering relevant pages ondifferent topics. This concept of corefer-encing as a means of discovering so-called “communities” of good works wasoriginally introduced in nonInternet-based studies on cocitations by Small[1973] and White and McCain [1989].

Kleinberg [1998] developed an algo-rithm to find the several most informa-tion-rich or, authority, pages for aquery. The algorithm also finds hubpages, i.e., pages with links to manyauthority pages, and labels the twotypes of retrieved pages appropriately.

3. FUTURE DIRECTIONS

In this section we present some promis-ing and imaginative research endeavorsthat are likely to make an impact onWeb use in some form or variation inthe future. Knowledge management[IEEE 1998b].

4For details on implementation of the SVD algo-rithm, see Demmel [1997]; Golub and Loan [1996];and Parlett [1998].



3.1 Intelligent and Adaptive Web Services

As mentioned earlier, research and de-velopment of intelligent agents (alsoknown as bots, robots, and aglets) forperforming specific tasks on the Webhas become very active [Finin et al.1998; IEEE 1996a]. These agents cantackle problems including finding andfiltering information; customizing infor-mation; and automating completion ofsimple tasks [Gilbert 1997]. The agents“gather information or perform someother service without (the user’s) imme-diate presence and on some regularschedule” (whatis?com home page,whatis.com/intellig.htm.). The BotSpothome page ,botspot.com. summarizesand points to some historical informa-tion as well as current work on intelli-gent agents. The Proceedings of the As-sociation for Computing Machinery(ACM), see Section 5.1 for the URL; theConferences on Information and Know-ledge Management (CIKM); and theAmerican Association for Artificial In-telligence Workshops ,www.aaai.org.are valuable information sources. TheProceedings of the Practical Applica-tions of Intelligent Agents and Multi-Agents (PAAM) conference series,demon.co.uk/ar/paam96.and,demon.co.uk/ar/paam97. gives a nice overviewof application areas. The home page ofthe IBM Intelligent Agent Center ofCompetence (IACC) ,networking.ibm.com/iag/iaghome.html. describes someof the company’s commercial agentproducts and technologies for the Web.

Adaptive Web services is one interest-ing area in intelligent Web robot re-search, including, e.g., Ahoy! TheHomepage Finder, which performs dy-namic reference sifting [Shakes et al.1997]; Adaptive Web Sites, which “auto-matically improve their organizationand presentation based on user accessdata” [Etzioni and Weld 1995; Perkowitzand Etzioni 1999]; Perkowitz’s homepage ,info.cs.vt.edu.; and AdaptiveWeb Page Recommendation Service[Balabanovic 1997; Balabanovic andShoham 1998; Balabanovic et al. 1995].

Discussion and ratings of some of theseand other robots are available at severalWeb sites, e.g., Felt and Scales ,wsulibs.wsu.edu/general/robots.htm. and Mitchell[1998].

Some scientists have studied proto-type metasearchers, i.e., services thatcombine the power of several search en-gines to search a broader range of pages(since any given search engine coversless than 16% of the Web) [Gravano1997; Lawrence and Giles 1998a; Sel-berg and Etzioni 1995a; 1995b]. Some ofthe better known metasearch enginesinclude MetaCrawler, SavvySearch, andInfoSeek Express. After a query is is-sued, metasearchers work in three mainsteps: first, they evaluate which searchengines are likely to yield valuable,fruitful responses to the query; next,they submit the query to search engineswith high ratings; and finally, theymerge the retrieved results from thedifferent search engines used in the pre-vious step. Since different search en-gines use different algorithms, whichmay not be publicly available, ranking ofmerged results may be a very difficult task.

Scientists have investigated a numberof approaches to overcome this problem.In one system, a result merging condi-tion is used by a metasearcher to decidehow much data will be retrieved fromeach of the search engine results, sothat the top objects can be extractedfrom search engines without examiningthe entire contents of each candidateobject [Gravano 1997]. Inquirus down-loads and analyzes individual docu-ments to take into account factors suchas query term context, identification ofdead pages and links, and identificationof duplicate (and near duplicate) pages[Lawrence and Giles 1998a]. Documentranking is based on the downloaded doc-ument itself, instead of rankings fromindividual search engines.

3.2 Information Retrieval for InternetShopping

An intriguing application of Web robottechnology is in simulation and prediction



of pricing strategies for sales over theInternet. The 1999 Christmas and holi-day season marked the first time thatshopping online was no longer a predic-tion; “Online sales increased by 300 per-cent and the number of orders increasedby 270 percent” compared to the previ-ous year [Clark 2000]. To underscorethe point, Time magazine selected JeffBezos, founder of Amazon.com as 1999Person of the Year. Exponential growthis predicted in online shopping. Chartsthat illustrate projected growth in In-ternet-generated revenue, Internet-re-lated consumer spending, Web advertis-ing revenue, etc. from the present to2002, 2003, and 2005 are given in Nua’ssurvey pages (see Section 1.2 for theURL).

Robots to help consumers shop, orshopbots, have become commonplace ine-commerce sites and general-purposeWeb portals. Shopbot technology hastaken enormous strides since its initialintroduction in 1995 by Anderson Con-sulting. This first bot, known as Bar-gain Finder, helped consumers find thelowest priced CDs. Many current shop-bots are capable of a host of other tasksin addition to comparing prices, such ascomparing product features, user re-views, delivery options, and warrantyinformation. Clark [2000] reviews thestate-of-the-art in bot technology andpresents some predicitions for the fu-ture by experts in the field—for exam-ple, Kephart, manager of IBM’s Agentsand Emergent Phenomena Group, pre-dicts that “shopping bots may soon beable to negotiate and otherwise workwith vendor bots, interacting via ontolo-gies and distributed technologies... botswould then become ‘economic actorsmaking decisions’ ” and Guttman, chieftechnology officer for Frictionless com-merce ,frictionless.com. footnotes thatFrictionless’s bot engine is used by somefamous portals, including Lycos, andmentions that his company’s technologywill be used in a retailer bot that will“negotiate trade-offs between productprice, performance, and delivery timeswith shopbots on the basis of customer

preferences.” Price comparison robotsand their possible roles in Internet mer-chant price wars in the future are dis-cussed in Kephart et al. [1998a; 1998b].

The auction site is another successfultechnological off-shoot of the Internetshopping business [Cohen 2000; Ferguson2000]. Two of the more famous generalonline auction sites are priceline.com,priceline.com. and eBay ,ebay.com..Priceline.com pioneered and patentedits business concept, i.e., online bidding[Walker et al. 1997]. Patents related tothat of priceline.com include thoseowned by ADT Automotive, Inc. [Berentet al. 1998]; Walker Asset Management[Walker et al. 1996]; and two individu-als [Barzilai and Davidson 1997].

3.3 Multimedia Retrieval

IR from multimedia databases is a mul-tidisciplinary research area, which in-cludes topics from a very diverse range,such as analysis of text, image andvideo, speech, and nonspeech audio;graphics; animation; artificial intelli-gence; human-computer interaction;and multimedia computing [Faloutsos1996; Faloutsos and Lin 1995; Maybury1997; and Schauble 1997]. Recently,several commercial systems that inte-grate search capabilities from multipledatabases containing heterogeneous,multimedia data have become available.Examples include PLS ,pls.com.;Lexis-Nexis ,lexis-nexis.com.; DIALOG,dialog.com.; and Verity ,verity.com..In this section we point to some recentdevelopments in the field; but the dis-cussion is by no means comprehensive.

Query and retrieval of images is oneof the more established fields of re-search involving multimedia databases[IEEE ICIP: Proceedings of the IEEEInternational Conference on Image Pro-cessing and IEEE ICASSP: Proceedingsof the IEEE International Conference onAcoustics, Speech and Signal Processingand IFIP 1992]. So much work by somany has been conducted on this topicthat a comprehensive review is beyondthe scope of this paper. But some se-



lected work in this area follows: searchand retrieval from large image archives[Castelli et al. 1998]; pictorial queriesby image similarity [Soffer and Samet];image queries using Gabor wavelet fea-tures [Manjunath and Ma 1996]; fast,multiresolution image queries usingHaar wavelet transform coefficients [Ja-cobs et al. 1995]; acquisition, storage,indexing, and retrieval of map images[Samet and Soffer 1986]; real-time fin-gerprint matching from a very large da-tabase [Ratha et al. 1992]; querying andretrieval using partially decoded JPEGdata and keys [Schneier and Abdel-Mot-taleb 1996]; and retrieval of faces from adatabase [Bach et al. 1993; Wu andNarasimhalu 1994].

Finding documents that have imagesof interest is a much more sophisticatedproblem. Two well-known portals with asearch interface for a database of im-ages are the Yahoo! Image Surfer,isurf.yahoo.com. and the Alta VistaPhotoFinder ,image.altavista.com.. LikeYahoo!’s text-based search engine, theImage Surfer home pages are organizedinto categories. For a text-based query,a maximum of six thumbnails of thetop-ranked retrieved images are dis-played at a time, along with their titles.If more than six are retrieved, thenlinks to subsequent pages with lowerrelevance rankings appear at the bot-tom of the page. The number of entriesin the database seem to be small; weattempted to retrieve photos of somefamous movie stars and came up withnone (for Brad Pitt) or few retrievals(for Gwyneth Paltrow), some of whichwere outdated or unrelated links. Theinput interface to Photofinder looksvery much like the interface for AltaVista’s text-based search engine. For atext-based query, a maximum of twelvethumbnails of retrieved images are dis-played at a time. Only the name of theimage file is displayed, e.g., image.jpg.To read the description of an image (if itis given), the mouse must point to thecorresponding thumbnail. The numberof retrievals for Photofinder were huge(4232 for Brad Pitt and 119 for Gwyneth

Paltrow), but there was a considerableamount of noise after the first page ofretrievals and there were many redun-dancies. Other search engines with anoption for searching for images in theiradvanced search page are Lycos, Hot-Bot, and AltaVista. All did somewhatbetter than Photofinder in retrievingmany images of Brad Pitt and GwynethPaltrow; most of the thumbnails wererelevant for the first several pages (eachpage contained 10 thumbnails).

NEC’s Inquirus is an image searchengine that uses results from severalsearch engines. It analyzes the text ac-companying images to determine rele-vance for ranking, and downloads theactual images to create thumbnails thatare displayed to the user [Lawrence andGiles 1999c].

Query and retrieval of images in avideo frame or frames is a research areaclosely related to retrieval of still im-ages from a very large image database[Bolle et al. 1998]. We mention a few toillustrate the potentially wide scope ofapplications, e.g., content-based videoindexing retrieval [Smoliar and Zhang1994]; the Query-by-Image-Content(QBIC) system, which helps users findstill images in large image and videodatabases on the basis of color, shape,texture, and sketches [Flickner et al.1997; Niblack 1993]; Information Navi-gation System (INS) for multimediadata, a system for archiving and search-ing huge volumes of video data via Webbrowsers [Nomiyama et al. 1997]; andVisualSEEk, a tool for searching, brows-ing, and retrieving images, which allowsusers to query for images using the vi-sual properties of regions and their spa-tial layout [Smith and Chang 1997a;1996]; compressed domain image ma-nipulation and feature extraction forcompressed domain image and video in-dexing and searching [Chang 1995;Zhong and Chang 1997]; a method forextracting visual events from relativelylong videos uing objects (rather thankeywords), with specific applications tosports events [Iwai et al. 2000; Kuro-kawa et al. 1999]; retrieval and semantic



interpretation of video contents basedon objects and their behavior [Echigo etal. 2000]; shape-based retrieval and itsapplication to identity checks on fish[Schatz 1997]; and searching for imagesand videos on the Web [Smith andChang 1997b].

Multilingual communication on theWeb [Miyahara et al. 2000] and cross-language document retrieval is a timelyresearch topic being investigated bymany [Ballesteros and Croft 1998; Eich-mann et al. 1998; Pirkola 1998]. Anintroduction to the subject is given inOard [1997b], and some surveys arefound in CLIR [1999] (Cross-LanguageInformation Retrieval Project ,clis.umd.edu/dlrg.); Oard [1997a] ,glue.umd.edu/oard/research.html. and in Oardand Door [1996]. Several search enginesnow feature multilingual search, e.g.,Open Text Web Index ,index.opentext.net. searches in four languages (En-glish, Japanese, Spanish, and Portu-guese). A number of commercialJapanese-to-English and English-to-Japanese Web translation softwareproducts have been developed by lead-ing Japanese companies in Japanese,bekkoame.ne.jp/oto3.. A typical ex-ample, which has a trial version fordownloading, is a product called Hon-yaku no Oosama ,ibm.co.jp/software/internet/king/index.html., or InternetKing of Translation [Watanabe andTakeda 1998].

Other interesting research topics andapplications in multimedia IR arespeech-based IR for digital libraries[Oard 1997c] and retrieval of songs froma database when a user hums the firstfew bars of a tune [Kageyama andTakashima 1994]. The melody retrievaltechnology has been incorporated as aninterface in a karaoke machine.

3.4 Conclusions

Potentially lucrative application of In-ternet-based IR is a widely studied andhotly debated topic. Some pessimists be-lieve that current rates of increase inthe use of the Internet, number of Web

sites and hosts are not sustainable, sothat research and business opportuni-ties in the area will decline. They citestatistics such as the April 1998 GVUWWW survey, which states that the useof better equipment (e.g., upgrades inmodems by 48% of people using theWeb) has not resolved the problem ofslow access, and an August 1998 surveyby Alexa Internet stating that 90% of allWeb traffic is spread over 100,000 dif-ferent hosts, with 50% of all Web trafficheaded towards the top 900 most popu-lar sites. In short, the pessimists main-tain that an effective means of manag-ing the highly uneven concentration ofinformation packets on the Internet isnot immediately available, nor will it bein the near future. Furthermore, theynote that the exponential increase inWeb sites and information on the Webis contributing to the second most com-monly cited problem, that is, users notbeing able to find the information theyseek in a simple and timely manner.

The vast majority of publications,however, support a very optimistic view.The visions and research projects ofmany talented scientists point towardsfinding concrete solutions and buildingmore efficient and user-friendly solu-tions. For example, McKnight andBoroumand [2000] maintain that flatrate Internet retail pricing—currentlythe predominant pricing model in theU.S.—may be one of the major culpritsin the traffic-congestion problem, andthey suggest that other pricing modelsare being proposed by researchers. It islikely that the better proposals will beseriously considered by the businesscommunity and governments to avoidthe continuation of the current solution,i.e., overprovisioning of bandwidth.

ACKNOWLEDGMENTS

The authors acknowledge helpful con-versations with Stuart McDonald of alpha-Works and our colleagues at IBM Re-search. Our manuscript has benefittedgreatly from the extensive and well-docu-mented list of suggestions and corrections



from the reviewers of the first draft. Weappreciate their generosity, patience, andthoughtfulness.

REFERENCES

ASSOCIATION FOR COMPUTING MACHINERY. 2000.SIGCHI: Special Interest Group on Computer-Human Interaction. Home page: www.acm.org/sigchi/

ASSOCIATION FOR COMPUTING MACHINERY. 2000. SI-GIR: Special Interest Group on InformationRetrieval. Home page: www.acm.org/sigir/

AGOSTI, M. AND SMEATON, A. 1996. InformationRetrieval and Hypertext. Kluwer AcademicPublishers, Hingham, MA.

AGRAWAL, R., GEHRKE, J., GUNOPULOS, D., ANDRAGHAVAN, P. 1998. Automatic subspace clus-tering of high dimensional data for data min-ing applications. In Proceedings of the ACMSIGMOD Conference on Management of Data(SIGMOD, Seattle, WA, June). ACM Press,New York, NY, 94–105.

AHLBERG, C. AND SHNEIDERMAN, B. 1994. Visualinformation seeking: Tight coupling of dy-namic query filters with starfield displays. InProceedings of the ACM Conference on HumanFactors in Computing Systems: CelebratingInterdependence (CHI ’94, Boston, MA, Apr.24–28). ACM Press, New York, NY, 313–317.

AHLBERG, C. AND SHNEIDERMAN, B. 1997. The al-phaslider: A compact and rapid and selector.In Proceedings of the ACM Conference on Hu-man Factors in Computing Systems (CHI ’97,Atlanta, GA, Mar. 22–27), S. Pemberton, Ed.ACM Press, New York, NY.

AI MAG. 1997. Special issue on intelligent systemson the internet. AI Mag. 18, 4.

ANDERBERG, M. R. 1973. Cluster Analysis forApplications. Academic Press, Inc., New York,NY.

ANICK, P. G. AND VAITHYANATHAN, S. 1997. Exploit-ing clustering and phrases for context-basedinformation retrieval. SIGIR Forum 31, 1,314–323.

ASAKAWA, C. 1996. Enabling the visually disabledto use the www in a gui environment. IEICETech. Rep. HC96-29.

BACH, J., PAUL, S., AND JAIN, R. 1993. A visualinformation management system for the in-teractive retrieval of faces. IEEE Trans.Knowl. Data Eng. 5, 4, 619–628.

BAEZA-YATES, R. A. 1992. Introduction to datastructures and algorithms related to informa-tion retrieval. In Information Retrieval: DataStructures and Algorithms, W. B. Frakes andR. Baeza-Yates, Eds. Prentice-Hall, Inc., Up-per Saddle River, NJ, 13–27.

BAEZA-YATES, R. AND RIBEIRO-NETO, B. 1999. Mod-ern Information Retrieval. Addison-Wesley,Reading, MA.

BALABANOVIC, M. 1997. An adaptive Web pagerecommendation service. In Proceedings of the

First International Conference on AutonomousAgents (AGENTS ’97, Marina del Rey, CA,Feb. 5–8), W. L. Johnson, Chair. ACM Press,New York, NY, 378–385.

BALABANOVIC, M. AND SHOHAM, Y. 1995. Learninginformation retrieval agents: Experimentswith automated web browsing. In Proceedingsof the 1995 AAAI Spring Symposium on Infor-mation Gathering from Heterogenous Distrib-uted Environments (Stanford, CA, Mar.).AAAI Press, Menlo Park, CA.

BALABANOVIC, M., SHOHAM, Y., AND YUN, T. 1995.An adaptive agent for automated webbrowsing. Stanford Univ. Digital LibrariesProject, working paper 1995-0023. StanfordUniversity, Stanford, CA.

BALDONADO, M. 1997. An interactive, structure-mediated approach to exploring informationin a heterogeneous, distributed environment.Ph.D. Dissertation. Computer Systems Labo-ratory, Stanford Univ., Stanford, CA.

BALDONADO, M. Q. W. AND WINOGRAD, T. 1997.SenseMarker: an information-exploration in-terface supporting the contextual evolution ofa user’s interests. In Proceedings of the ACMConference on Human Factors in ComputingSystems (CHI ’97, Atlanta, GA, Mar. 22–27),S. Pemberton, Ed. ACM Press, New York, NY,11–18.

BALLESTEROS, L. AND CROFT, W. B. 1998. Resolvingambiguity for cross-language retrieval. InProceedings of the 21st Annual InternationalACM SIGIR Conference on Research and De-velopment in Information Retrieval (SIGIR’98, Melbourne, Australia, Aug. 24–28), W. B.Croft, A. Moffat, C. J. van Rijsbergen, R.Wilkinson, and J. Zobel, Chairs. ACM Press,New York, NY, 64–71.

BARZILAI AND DAVIDSON. 1997. Computer-basedelectronic bid, auction and sale system, and asystem to teach new/non-registered customershow bidding, auction purchasing works: U.S.Patent no. 60112045.

BEAUDOIN, L., PARENT, M.-A., AND VROOMEN, L. C.1996. Cheops: A compact explorer for complexhierarchies. In Proceedings of the IEEE Con-ference on Visualization (San Francisco, CA,Oct. 27-Nov. 1), R. Yagel and G. M. Nielson,Eds. IEEE Computer Society Press, LosAlamitos, CA, 87ff.

BEDERSON, B. B. AND HOLLAN, J. D. 1994. Pad11:A zooming graphical interface for exploringalternate interface physics. In Proceedings ofthe 7th Annual ACM Symposium on UserInterface Software and Technology (UIST ’94,Marina del Rey, CA, Nov. 2–4), P. Szekely,Chair. ACM Press, New York, NY, 17–26.

BERENT, T., HURST, D., PATTON, T., TABERNIK, T.,REIG, J. W. D., AND WHITTLE, W. 1998. Elec-tronic on-line motor vehicle auction and infor-mation system: U.S. Patent no. 5774873.

BERNERS-LEE, T., CAILLIAU, R., LUOTONEN, A.,NIELSEN, H. F., AND SECRET, A. 1994. The



World-Wide Web. Commun. ACM 37, 8 (Aug.),76–82.

BERRY, M. AND BROWN, M. 1999. UnderstandingSearch Engines. SIAM, Philadelphia, PA.

BERRY, M. W., DUMAIS, S. T., AND O’BRIEN, G. W.1995. Using linear algebra for intelligent in-formation retrieval. SIAM Rev. 37, 4 (Dec.),573–595.

BHARAT, K. AND BRODER, A. 1998. A technique formeasuring the relative size and overlap ofpublic web search engines. In Proceedings ofthe Seventh International Conference onWorld Wide Web 7 (WWW7, Brisbane, Austra-lia, Apr. 14–18), P. H. Enslow and A. Ellis,Eds. Elsevier Sci. Pub. B. V., Amsterdam, TheNetherlands, 379–388.

BOLLE, R., YEO, B.-L., AND YEUNG, M. 1998. Videoquery: Research directions. IBM J. Res. Dev.42, 2 (Mar.), 233–251.

BORKO, H. 1979. Inter-indexer consistency. In Pro-ceedings of the Cranfield Conference.

BOTAFOGO, R. A., RIVLIN, E., AND SHNEIDERMAN, B.1992. Structural analysis of hypertexts: Iden-tifying hierarchies and useful metrics. ACMTrans. Inf. Syst. 10, 2 (Apr.), 142–180.

BRAKE, D. 1997. Lost in cyberspace. New Sci.Mag. www.newscientist.com/keysites/networld/lost.html

BRIN, S. AND PAGE, L. 1998. The anatomy of alarge-scale hypertextual Web search engine.Comput. Netw. ISDN Syst. 30, 1-7, 107–117.

BRODER, A. Z., GLASSMAN, S. C., MANASSE, M. S.,AND ZWEIG, G. 1997. Syntactic clustering ofthe Web. Comput. Netw. ISDN Syst. 29, 8-13,1157–1166.

BUSINESS WEEK. 1997. Special report on speechtechnologies. Business Week.

COMMUNICATIONS OF THE ACM. 1993. Special issueon the next generation GUIs. Commun. ACM.

COMMUNICATIONS OF THE ACM. 1994. Special issueon internet technology. Commun. ACM.

COMMUNICATIONS OF THE ACM. 1995. Special is-sues on digital libraries. Commun. ACM.

COMMUNICATIONS OF THE ACM. 1999. Special is-sues on knowledge discovery. Commun. ACM.

CARD, S., MACKINLAY, J., AND SHNEIDERMAN, B.1999. Readings in Information Visualization:Using Vision to Think. Morgan KaufmannPublishers Inc., San Francisco, CA.

CARD, S. K., ROBERTSON, G. G., AND YORK, W. 1996.The WebBook and the Web Forager: an infor-mation workspace for the World-Wide Web. InProceedings of the ACM Conference on HumanFactors in Computing Systems (CHI ’96, Van-couver, B.C., Apr. 13–18), M. J. Tauber, Ed.ACM Press, New York, NY, 111ff.

CARL, J. 1995. Protocol gives sites way to keep outthe ’bots’. Web Week 1, 7 (Nov.).

CARRIÈRE, J. AND KAZMAN, R. 1997. WebQuery:Searching and visualizing the Web throughconnectivity. In Proceedings of the Sixth Inter-national Conference on the World Wide Web(Santa Clara CA, Apr.).

CASTELLI, V., BERGMAN, L., KONTOYIANNINS, I., LI,C.-S., ROBINSON, J., AND TUREK, J. 1998.Progressive search and retrieval in large im-age archives. IBM J. Res. Dev. 42, 2 (Mar.),253–268.

CATHRO, W. 1997. Matching discovery andrecovery. In Proceedings of the Seminar onStandards Australia. www.nla.gov.au/staffpa-per/cathro3.html

CHAKRABARTI, S., DOM, B., GIBSON, D., KUMAR, S.,RAGHAVAN, P., RAJAGOPALAN,, S., AND TOMKINS,A. 1988. Experiments in topic distillation. InProceedings of the ACM SIGIR Workshop onHypertext Information Retrieval for the Web(Apr.). ACM Press, New York, NY.

CHAKRABARTI, S., DOM, B., RAGHAVAN, P., RAJAGO-PALAN, S., GIBSON, D., AND KLEINBERG, J. 1998.Automatic resource compilation by analyzinghyperlink structure and associated text. InProceedings of the Seventh International Con-ference on World Wide Web 7 (WWW7, Bris-bane, Australia, Apr. 14–18), P. H. Enslowand A. Ellis, Eds. Elsevier Sci. Pub. B. V.,Amsterdam, The Netherlands, 65–74.

CHAKRABARTI, S. AND RAJAGOPALAN, S. 1997. Sur-vey of information retrieval research andproducts. Home page: w3.almaden.ibm.com/soumen/ir.html

CHALMERS, M. AND CHITSON, P. 1992. Bead: explo-rations in information visualization. In Pro-ceedings of the 15th Annual InternationalACM Conference on Research and Develop-ment in Information Retrieval (SIGIR ’92,Copenhagen, Denmark, June 21–24), N. Bel-kin, P. Ingwersen, and A. M. Pejtersen, Eds.ACM Press, New York, NY, 330–337.

CHANDRASEKARAN, R. 1998. ”Portals“ offer one-stopsurfing on the net. Int. Herald Tribune 19/21.

CHANG, S.-F. 1995. Compressed domain tech-niques for image/video indexing andmanipulation. In Proceedings of the Confer-ence on Information Processing.

CHO, J., GARCIA-MOLINA, H., AND PAGE, L. 1998.Efficient crawling through URL ordering.Comput. Netw. ISDN Syst. 30, 1-7, 161–172.

CLARK, D. 2000. Shopbots become agents for busi-ness change. IEEE Computer.

CLEVERDON, C. 1970. Progress in documentation.J. Doc. 26, 55–67.

CLIR. 1999. Cross-language information retrievalproject, resource page. Tech. Rep. Universityof Maryland at College Park, College Park,MD.

COHEN, A. 1999. The attic of e. Time Mag.COMPUT. NETW. ISDN SYST. 2000. World Wide

Web conferences. 1995-2000. Comput. Netw.ISDN Syst. www.w3.org/Conferences/Overview-WWW.html

COOPER, W. 1969. Is interindexer consistency ahobgoblin? Am. Doc. 20, 3, 268–278.

CRANOR, L. F. AND LA MACCHIA, B. A. 1998. Spam!Commun. ACM 41, 8, 74–83.

CRESTANI, F., LALMAS, M., VAN RIJSBERGEN, C. J.,AND CAMPBELL, I. 1998. Is this document relevant?



Probably: A survey of probabilistic models ininformation retrieval. ACM Comput. Surv. 30,4, 528–552.

CUNNINGHAM, M. 1997. Brewster’s millions. IrishTimes.www.irish-times.com/irish-times/paper/1997/0127/cmp1.html

CUTTING, D. R., KARGER, D. R., AND PEDERSEN, J.O. 1993. Constant interaction-time scatter/gather browsing of very large documentcollections. In Proceedings of the 16th AnnualInternational ACM Conference on Researchand Development in Information Retrieval(SIGIR ’93, Pittsburgh, PA, June 27–July), R.Korfhage, E. Rasmussen, and P. Willett, Eds.ACM Press, New York, NY, 126–134.

DEERWESTER, S., DUMAI, S. T., FURNAS, G. W.,LANDAUER, T. K., AND HARSHMAN, R. 1990.Indexing by latent semantic analysis. J. Am.Soc. Inf. Sci. 41, 6, 391–407.

DEMMEL, J. W. 1997. Applied Numerical LinearAlgebra. SIAM, Philadelphia, PA.

DHILLON, I. AND MOHDA, D. 1999. A data-cluster-ing algorithm on distributed memorymultiprocessors. In Proceedings of the Work-shop on Large-Scale Parallel KDD Systems(ACM SIGKDD., Aug. 15-18). ACM Press,New York, NY.

DHILLON, I. AND MODHA, D. 2000. Concept decom-positions for large sparse text data usingclustering. Mach. Learn.

ECHIGO, T., KUROKAWA, M., TOMITA, A., TOMITA, A.,MIYAMORI, AND IISAKU, S. 2000. Video enrich-ment: Retrieval and enhanced visualizationbased on behaviors of objects. In Proceedingsof the Fourth Asian Conference on ComputerVision (ACCV2000, Jan. 8-11). 364–369.

EICHMANN, D., RUIZ, M. E., AND SRINIVASAN, P.1998. Cross-language information retrievalwith the UMLS metathesaurus. In Proceed-ings of the 21st Annual International ACMSIGIR Conference on Research and Develop-ment in Information Retrieval (SIGIR ’98,Melbourne, Australia, Aug. 24–28), W. B.Croft, A. Moffat, C. J. van Rijsbergen, R.Wilkinson, and J. Zobel, Chairs. ACM Press,New York, NY, 72–80.

ESTER, M., KRIEGEL, H.-S., SANDER, J., AND XU, X.1995a. A density-based algorithm for discov-ering clusters in large spatial databases withnoise. In Proceedings of the First Interna-tional Conference on Knowledge Discovery andData Mining (Montreal, Canada, Aug. 20-21).

ESTER, M., KRIEGEL, H.-S., AND XU, X. 1995b. Adatabase interface for clustering in large spa-tial databases. In Proceedings of the FirstInternational Conference on Knowledge Dis-covery and Data Mining (Montreal, Canada,Aug. 20-21).

ESTER, M., KRIEGEL, H.-S., AND XU, X. 1995c. Fo-cusing techniques for efficient classidentification. In Proceedings of the FourthInternational Symposium on Large SpatialDatabases.

ETZIONI, O. AND WELD, D. 1995. Intelligent agentson the Internet: Fact, fiction and forecast.Tech. Rep. University of Washington, Seattle,WA.

FALOUTSOS, C. 1996. Searching Multimedia Data-bases by Content. Kluwer Academic Publish-ers, Hingham, MA.

FALOUTSOS, C. AND LIN, K. 1995. FastMap: A fastalgorithm for indexing, data-mining and visu-alization of traditional and multimediadatasets. In Proceedings of the ACM SIGMODConference on Management of Data (ACM-SIGMOD, San Jose, CA, May). SIGMOD.ACM Press, New York, NY, 163–174.

FALOUTSOS, C. AND OARD, D. W. 1995. A survey ofinformation retrieval and filtering methods.Univ. of Maryland Institute for AdvancedComputer Studies Report. University ofMaryland at College Park, College Park, MD.

FELDMAN, S. 1998. Web search services in 1998:Trends and challenges. Inf. Today 9.

FERGUSON, A. 1999. Auction nation. Time Mag.FININ, T., NICHOLAS, C., AND MAYFIELD, J. 1998.

Software agents for information retrieval(short course notes). In Proceedings of theThird ACM Conference on Digital Libraries(DL ’98, Pittsburgh, PA, June 23–26), I. Wit-ten, R. Akscyn, and F. M. Shipman, Eds. ACMPress, New York, NY.

FINKELSTEIN, A. AND SALESIN, D. 1995. Fast multi-resolution image querying. In Proceedings ofthe ACM SIGGRAPH Conference on Visual-ization: Art and Interdisciplinary Programs(SIGGRAPH ’95, Los Angeles, CA, Aug.6–11), K. O’Connell, Ed. ACM Press, NewYork, NY, 277–286.

FISHER, D. 1995. Iterative optimization and sim-plification of hierarchical clusterings. Tech.Rep. Vanderbilt University, Nashville, TN.

FLICKNER, M., SAWHNEY, H., NIBLACK, W., ASHLEY,J., HUANG, Q., DOM, B., GORKANI, M., HAFNER,J., LEE, D., PETKOVIC, D., STEELE, D., ANDYANKER, P. 1997. Query by image and videocontent: the QBIC system. In Intelligent Multi-media Information Retrieval, M. T. Maybury,Ed. MIT Press, Cambridge, MA, 7–22.

FLYNN, L. 1996. Desperately seeking surfers: Webprogrammers try to alter search engines’results. New York Times.

FRAKES, W. B. AND BAEZA-YATES, R., EDS. 1992.Information Retrieval: Data Structures andAlgorithms. Prentice-Hall, Inc., Upper SaddleRiver, NJ.

FRISSE, M. E. 1988. Searching for information in ahypertext medical handbook. Commun. ACM31, 7 (July), 880–886.

GILBERT, D. 1997. Intelligent agents: The rightinformation at the right time. Tech. Rep. IBMCorp., Research Triangle Park, NC.

GLOOR, P. AND DYNES, S. 1998. Cybermap: Visuallynavigating the web. J. Visual Lang. Comput.9, 3 (June), 319–336.

GOLUB, G. H. AND VAN LOAN, C. F. 1996. MatrixComputations. 3rd. Johns Hopkins studies in



the mathematical sciences. Johns HopkinsUniversity Press, Baltimore, MD.

GRAVANO, L. 1998. Querying multiple documentcollections across the Internet. Ph.D.Dissertation. Stanford University, Stanford,CA.

GUDIVADA, V., RAGHAVAN, V., GROSKY, W., AND

KASAANAGOTTU, R. 1997. Information retrievalon the world wide web. IEEE Internet Com-put. 1, 1 (May/June), 58–68.

GUGLIELMO, C. 1997. Upside today (on-line). homepage: inc.com/cgi-bin/tech link.cgi?url5http://www.upside.com.

GUHA, S., RASTOGI, R., AND SHIM, K. 1998. Cure:An efficient clustering algorithm for largedatabases. In Proceedings of the ACM SIG-MOD Conference on Management of Data(SIGMOD, Seattle, WA, June). ACM Press,New York, NY.

HAWKING, D., CRASWELL, N., THISTLEWAITE, P., AND

HARMAN, D. 1999. Results and Challenges inWeb Search Evaluation.

HEARST, M. A. 1995. TileBars: Visualization ofterm distribution information in full text in-formation access. In Proceedings of the ACMConference on Human Factors in ComputingSystems (CHI ’95, Denver, CO, May 7–11), I.R. Katz, R. Mack, L. Marks, M. B. Rosson,and J. Nielsen, Eds. ACM Press/Addison-Wes-ley Publ. Co., New York, NY, 59–66.

HEARST, M. 1997. Interfaces for searching theweb. Sci. Am., 68–72.

HEARST, M. 1999. User interfaces andvisualization. In Modern Information Re-trieval, R. Baeza-Yates and B. Ribeiro-Neto.Addison-Wesley, Reading, MA, 2257–3232.

HEARST, M. A. AND PEDERSEN, J. O. 1996. Visualiz-ing information retrieval results: a demon-stration of the TileBar interface. In Proceed-ings of the CHI ’96 Conference Companion onHuman Factors in Computing Systems: Com-mon Ground (CHI ’96, Vancouver, British Co-lumbia, Canada, Apr. 13–18), M. J. Tauber,Ed. ACM Press, New York, NY, 394–395.

HENZINGER, M., HEYDON, A., MITZENMACHER, M.,AND NAJORK, M. 1999. Measuring index qual-ity using random walks on the web.

HERNANDEZ, M. 1996. A generalization of bandjoins and the merge/purge problem. Ph.D.Dissertation. Columbia Univ., New York, NY.

HOWE, A. AND DREILINGER, D. 1997. Savvysearch:A metasearch engine that learns which searchengine to query. AI Mag. 18, 2, 19–25.

HUBERMAN, B. AND LUKOSE, R. 1997. A metasearchengine that learns which search engine toquery. Science 277, 535–537.

HUBERMAN, B., PIROLLI, P., PITKOW, J., AND LU-KOSE, R. 1998. Strong regularities in worldwide web surfing. Science 280, 95–97.

HYLTON, J. 1996. Identifying and merging relatedbibliographic records. Master’s Thesis.

IEEE. 1999. Special issue on intelligent informa-tion retrieval. IEEE Expert.

IEEE. 1998a. News and trends section. IEEE In-ternet Comput.

IEEE. 1998b. Special issue on knowledgemanagement. IEEE Expert.

IEEE. 1996a. Special issue on intelligent agents.IEEE Expert/Intelligent Systems and TheirApplications.

IEEE. 1996b. Special issue on digital libraries:representation and retrieval. IEEE Trans.Pattern Anal. Mach. Intell. 18, 8, 771–859.

IFIP. 1989. Visual Data Base Systems I and II.Elsevier North-Holland, Inc., Amsterdam,The Netherlands.

IWAI, Y., MARUO, J., YACHIDA, M., ECHIGO, T., AND

IISAKU, S. 2000. A framework for visual eventextraction from soccer games. In Proceedingsof the Fourth Asian Conference on ComputerVision (ACCV2000, Jan. 8-11). 222–227.

JACOBY, J. AND SLAMECKA, V. 1962. Indexer consis-tency under minimal conditions. RADC TR62--426. Documentation, Inc., Bethesda, MD,US.

KAGEYAMA, T. AND TAKASHIMA, Y. 1994. A melodyretrieval method with hummed melody. IEICETrans. Inf. Syst. J77, 8 (Aug.), 1543–1551.

KAHLE, B. 1999. Archiving the Internet. homepage: www.alexa.com/ brewster/essays/sciamarticle.html

KEPHART, J., HANSON, J., LEVINE, D., GROSOF, B.,SAIRAMESH, J., AND WHITE, R. S. 1998a. Emer-gent behavior in information economies.

KEPHART, J. O., HANSON, J. E., AND SAIRAMESH, J.1998b. Price-war dynamics in a free-marketeconomy of software agents. In Proceedings ofthe Sixth International Conference on Artifi-cial Life (ALIFE, Madison, WI, July 26–30),C. Adami, R. K. Belew, H. Kitano, and C. E.Taylor, Eds. MIT Press, Cambridge, MA, 53–62.

KLEINBERG, J. M. 1998. Authoritative sources in ahyperlinked environment. In Proceedings ofthe 1998 ACM-SIAM Symposium on DiscreteAlgorithms (San Francisco CA, Jan.). ACMPress, New York, NY.

KOBAYASHI, M., DUPRET, G., KING, O., SAMUKAWA,H., AND TAKEDA, K. 1999. Multi-perspectiveretrieval, ranking and visualization of webdata. In Proceedings of the International Sym-posium on Digital Libraries ((ISDL99),Tsukuba, Japan). 159–162.

KORFHAGE, R. R. 1997. Information Storage andRetrieval. John Wiley and Sons, Inc., NewYork, NY.

KOSTER, M. 1995. Robots in the web: trick ortreat? ConneXions 9, 4 (Apr.).

KOSTER, M. 1996. Examination of the standard forrobots exclusion. home page: info.webcrawler-.com/mak/projects/robots/eval.html

KUROKAWA, M., ECHIGO, T., TOMITA, T., MAEDA, J.,MIYAMORI, H., AND ISISAKU, S. 1999. Represen-tation and retrieval of video scene by usingobject actions and their spatio-temporalrelationships. In Proceedings of the Interna-



tional Conference on ICIP-Image Processing.IEEE Press, Piscataway, NJ.

LAGOZE, C. 1996. The Warwick framework: A con-tainer architecture for diverse sets ofmetadata. D-Lib Mag. www.dlib.org

LAMPING, J., RAO, R., AND PIROLLI, P. 1995. Afocus1context technique based on hyperbolicgeometry for visualizing large hierarchies. InProceedings of the ACM Conference on HumanFactors in Computing Systems (CHI ’95, Den-ver, CO, May 7–11), I. R. Katz, R. Mack, L.Marks, M. B. Rosson, and J. Nielsen, Eds.ACM Press/Addison-Wesley Publ. Co., NewYork, NY, 401–408.

LAWRENCE, S. AND GILES, C. 1998a. Context andpage analysis for improved web search. IEEEInternet Comput. 2, 4, 38–46.

LAWRENCE, S. AND GILES, C. 1998b. Searching theworld wide web. Science 280, 98–100.

LAWRENCE, S. AND GILES, C. 1999a. Accessibility ofinformation on the web. Nature 400, 107–109.

LAWRENCE, S. AND GILES, C. 1999b. Searching theweb: General and scientific informationaccess. IEEE Commun. Mag. 37, 1, 116–122.

LAWRENCE, S. AND GILES, C. 1999c. Text and imagemetasearch on the web. In Proceedings of theInternational Conference on Parallel and Dis-tributed Processing Techniques and Applica-tions (PDPTA99). 829–835.

LEIGHTON, H. AND SRIVASTAVA, J. 1997. Precisionamong world wide web search engines: Alta-vista, excite, hotbot, infoseek, and lycos. homepage: www.winona.msus.edu/library/webind2/webind2.htm.

LETSCHE, T. AND BERRY, M. 1997. Large-scale in-formation retrieval with latent semantic in-dexing: (submitted). Inf. Sci. Appl.

LIAO, H., OSADA, M., AND SHNEIDERMAN, B. 1992. Aformative evaluation of three interfaces forbrowsing directories using dynamic queries.Tech. Rep. CS-TR-2841. Department of Com-puter Science, University of Maryland, Col-lege Park, MD.

LIBERATORE, K. 1997. Getting to the source: Is itreal or spam, ma’am ? MacWorld.

LIDSKY, D. AND KWON, R. 1997. Searching the net.PC Mag., 227–258.

LIECHTI, O., SIFER, M. J., AND ICHIKAWA, T. 1998.Structured graph format: XML metadata fordescribing Web site structure. Comput. Netw.ISDN Syst. 30, 1-7, 11–21.

LOSEE, R. M. 1998. Text Retrieval and Filtering:Analytic Models of Performance. Kluwer in-ternational series on information retrieval.Kluwer Academic Publishers, Hingham, MA.

LYNCH, C. 1997. Searching the Internet. Sci. Am.,52–56.

MAAREK, Y. S., JACOVI, M., SHTALHAIM, M., UR, S.,ZERNIK, D., AND BEN-SHAUL, I. Z. 1997. Web-Cutter: A system for dynamic and tailorablesite mapping. Comput. Netw. ISDN Syst. 29,8-13, 1269–1279.

MACSKASSY, S., BANERJEE, A., DAVISON, B., ANDHIRSH, H. 1998. Human performance on clus-

tering web pages: A preliminary study. InProceedings of the Fourth International Con-ference on Knowledge Discovery and DataMining (Seattle, WA, June ’98). 264–268.

MANBER, U. 1999. Foreword. In Modern Informa-tion Retrieval, R. Baeza-Yates and B.Ribeiro-Neto. Addison-Wesley, Reading, MA,5–8.

MANBER, U., SMITH, M., AND GOPAL, B. 1997. Web-glimpse: Combining borwsing and searching.In Proceedings on USENIX 1997 AnnualTechnical Conference (Jan.). 195–206.

MANJUNATH, B. S. AND MA, W. Y. 1996. Texturefeatures for browsing and retrieval of imagedata. IEEE Trans. Pattern Anal. Mach. Intell.18, 8, 837–842.

MANNING, C. AND SCHUTZE, H. 1999. Foundationsof Statistical Natural Language Processing.MIT Press, Cambridge, MA.

MARCHIONINI, G. 1995. Information Seeking inElectronic Environments. Cambridge Serieson Human-Computer Interaction. CambridgeUniversity Press, New York, NY.

MAYBURY, M. 1997. Intelligent Multmedia Infor-mation Retrieval. MIT Press, Cambridge, MA.

MAYBURY, M. T. AND WAHLSTER, W., EDS. 1998.Readings in Intelligent User Interfaces. Mor-gan Kaufmann Publishers Inc., San Fran-cisco, CA.

MCKNIGHT, L. 2000. Pricing internet services: Ap-proaches and challenges. IEEE Computer,128–129.

MEREU, S. W. AND KAZMAN, R. 1996. Audio en-hanced 3D interfaces for visually impairedusers. In Proceedings of the ACM Conferenceon Human Factors in Computing Systems(CHI ’96, Vancouver, B.C., Apr. 13–18), M. J.Tauber, Ed. ACM Press, New York, NY, 72–78.

MITCHELL, S. 1998. General internet resource find-ing tools. home pages: library.ucr.edu/pubs/navigato.html

MIYAHARA, T, WATANABE, H., TAZOE, E., KAMIYAMA,Y., AND TAKEDA, K. 2000. Internet MachineTranslation. Mainichi Communications, Japan.

MODHA, S. AND SPANGLER, W. 2000. Clusteringhypertext with applications to web searching.In Proceedings of the Conference on Hypertext(May 30-June 3).

MONGE, A. AND ELKAN, C. 1998. An efficient do-main-independent algorithm for detecting ap-proximately duplicate database records. Tech.Rep. University of California at San Diego, LaJolla, CA.

MONIER, L. 1998. Altavista cto responds.www4.zdnet.com/anchordesk/talkback/talkback13066.html.

MOROHASHI, M., TAKEDA, K., NOMIYAMA, H., ANDMARUYAMA, H. 1995. Information outlining. InProceedings of International Symposium onDigital Libraries (Tsukuba, Japan).

MUNZNER, T. AND BURCHARD, P. 1995. Visualizingthe structure of the World Wide Web in 3Dhyperbolic space. In Procedings of the Sympo-sium on Virtual Reality Modeling Language



(VRML ’95, San Diego, CA, Dec. 14–15), D. R.Nadeau and J. L. Moreland, Chairs. ACMPress, New York, NY, 33–38.

NAGAO, K. AND HASIDA, K. 1998. Automatic textsummarization based on the global documentannotation. In Proceedings of the Conferenceon COLING-ACL.

NAGAO, K., HOSOYA, S., KAWAKITA, Y., ARIGA, S.,SHIRAI, Y., AND YURA, J. 1999. Semantictranscoding: Making the world wide web moreunderstandable and reusable by external an-notations.

NAVARRO, G. 1998. Approximate text searching.Ph.D. Dissertation. Univ. of Chile, Santiago,Chile.

NG, R. AND HAN, J. 1994. Efficient and effectivemethods for spatial data mining. In Proceed-ings of the 20th International Conference onVery Large Data Bases (VLDB’94, Santiago,Chile, Sept.). VLDB Endowment, Berkeley,CA.

NIBLACK, W. 1993. The qbic project: Query byimage by content using color, texture, andshape. In Proceedings of the Conference onStorage and Retrieval for Image and VideoDatabases. SPIE Press, Bellingham, WA,173–187.

NIELSEN, J. 1993. Usability Engineering. Academ-ic Press Prof., Inc., San Diego, CA.

NIELSEN, J. 1999. User interface directions for theWeb. Commun. ACM 42, 1, 65–72.

NOMIYAMA, H., KUSHIDA, T., URAMOTO, N., IOKA,M., KUSABA, M., KUSABA, J.-K., CHIGONO, A.,ITOH, T., AND TSUJI, M. 1997. Informationnavigation system for multimedia data. Res.Rep. RT-0227. Research Laboratory, IBMTokyo, Tokyo, Japan.

OARD, D. 1997a. Cross-language text retrieval re-search in the USA. In Proceedings of theThird Delos Workshop on ERCIM (Mar.).

OARD, D. 1997b. Serving users in many languages.D-Lib Mag. 3, 1 (Jan.).

OARD, D. 1997c. Speech-based information re-trieval for digital libraries. Tech. Rep.CS-TR-3778. University of Maryland at Col-lege Park, College Park, MD.

OARD, D. W. AND DORR, B. J. 1996. A survey ofmultilingual text retrieval. Tech. RepUMIACS-TR-96-19. University of Marylandat College Park, College Park, MD.

OMIECINSKI, E. AND SCHEUERMANN, P. 1990. A par-allel algorithm for record clustering. ACMTrans. Database Syst. 15, 4 (Dec.), 599–624.

OOGANE, T. AND ASAKAWA, C. 1998. An interactivemethod for accessing tables in HTML. In Pro-ceedings of the Third International ACM Con-ference on Assistive Technologies (Assets ’98,Marina del Rey, CA, Apr. 15–17), M. M. Blat-tner and A. I. Karshmer, Chairs. ACM Press,New York, NY, 126–128.

PARLETT, B. N. 1998. The Symmetric EigenvalueProblem. Prientice-Hall SIAM Classics in Ap-plied Mathematics Series. Prentice-Hall, Inc.,Upper Saddle River, NJ.

PERKOWITZ, M. AND ETZIONI, O. 1999. Adaptiveweb sites: An ai challenge. Tech. Rep. Univer-sity of Washington, Seattle, WA.

PIRKOLA, A. 1998. The effects of query structureand dictionary setups in dictionary-basedcross-language information retrieval. In Pro-ceedings of the 21st Annual InternationalACM SIGIR Conference on Research and De-velopment in Information Retrieval (SIGIR’98, Melbourne, Australia, Aug. 24–28), W. B.Croft, A. Moffat, C. J. van Rijsbergen, R.Wilkinson, and J. Zobel, Chairs. ACM Press,New York, NY, 55–63.

PIROLLI, P., PITKOW, J., AND RAO, R. 1996. Silkfrom a sow’s ear: extracting usable structuresfrom the Web. In Proceedings of the ACMConference on Human Factors in ComputingSystems (CHI ’96, Vancouver, B.C., Apr. 13–18), M. J. Tauber, Ed. ACM Press, New York,NY, 118–125.

PLAISANT, C. 1994. Dynamic queries on a healthstatistics atlas. Tech. Rep. University ofMaryland at College Park, College Park, MD.

PRESCHEL, B. 1972. Indexer consistency in percep-tion of concepts and choice of terminology.Final Rep. Columbia Univ., New York, NY.

RAGHAVAN, P. 1997. Information retrieval algo-rithms: A survey. In Proceedings of the Sym-posium on Discrete Algorithms. ACM Press,New York, NY.

RAMAN, T. V. 1996. Emacspeak—a speechinterface. In Proceedings of the ACM Confer-ence on Human Factors in Computing Systems(CHI ’96, Vancouver, B.C., Apr. 13–18), M. J.Tauber, Ed. ACM Press, New York, NY, 66–71.

RAO, R., PEDERSEN, J. O., HEARST, M. A., MACKIN-LAY, J. D., CARD, S. K., MASINTER, L., HAL-VORSEN, P.-K., AND ROBERTSON, G. C. 1995.Rich interaction in the digital library. Com-mun. ACM 38, 4 (Apr.), 29–39.

RASMUSSEN, E. 1992. Clustering algorithms. InInformation Retrieval: Data Structures andAlgorithms, W. B. Frakes and R. Baeza-Yates,Eds. Prentice-Hall, Inc., Upper Saddle River,NJ, 419–442.

RATHA, N. K., KARU, K., CHEN, S., AND JAIN, A. K.1996. A real-time matching system for largefingerprint databases. IEEE Trans. PatternAnal. Mach. Intell. 18, 8, 799–813.

RENNISON, E. 1994. Galaxy of news: an approachto visualizing and understanding expansivenews landscapes. In Proceedings of the 7thAnnual ACM Symposium on User InterfaceSoftware and Technology (UIST ’94, Marinadel Rey, CA, Nov. 2–4), P. Szekely, Chair.ACM Press, New York, NY, 3–12.

RIVLIN, E., BOTAFOGO, R., AND SHNEIDERMAN, B.1994. Navigating in hyperspace: designing astructure-based toolbox. Commun. ACM 37, 2(Feb.), 87–96.

ROBERTSON, G. G., MACKINLAY, J. D., AND CARD, S.K. 1991. Cone trees: Animated 3D visualiza-tions of hierarchical information. In Proceedings



of the Conference on Human Factors in Com-puting Systems: Reaching through Technology(CHI ’91, New Orleans, LA, Apr. 27–May 2),S. P. Robertson, G. M. Olson, and J. S. Olson,Eds. ACM Press, New York, NY, 189–194.

SAKAIRI, T. 1999. A site map for visualizing both aweb site’s structure and keywords. In Pro-ceedings of the IEEE Conference on System,Man, and Cybernetics (SMC ’99). IEEE Com-puter Society Press, Los Alamitos, CA, 200–205.

SALTON, G. 1969. A comparison between manualand automatic indexing methods. Am. Doc.20, 1, 61–71.

SALTON, G. 1970. Automatic text analysis. Science168, 335–343.

SALTON, G., ED. 1971. The Smart Retrieval Sys-tem: Experiments in Automatic DocumentProcessing. Prentice-Hall, Englewood Cliffs,NJ.

SALTON, G., ED. 1988. Automatic Text Processing.Addison-Wesley Series in Computer Science.Addison-Wesley Longman Publ. Co., Inc.,Reading, MA.

SALTON, G. AND MCGILL, M. J. 1983. Introductionto Modern Information Retrieval. McGraw-Hill, Inc., Hightstown, NJ.

SAMET, H. AND SOFFER, A. 1996. MARCO: MApRetrieval by COntent. IEEE Trans. PatternAnal. Mach. Intell. 18, 8, 783–798.

SCHATZ, B. 1997. Information retrieval in digitallibraries: Bringing search to the net. Science275, 327–334.

SCHAUBLE, P. 1997. Multimedia Information Re-trieval: Content-Based Information Retrievalfrom Large Text and Audio Databases. Klu-wer Academic Publishers, Hingham, MA.

SCIENTIFIC AMERICAN. 1997. The Internet: Ful-fillling the promise: special report. ScientificAmerican, Inc., New York, NY.

SELBERG, E. AND ETZIONI, O. 1995a. Themetacrawler architecture for resource aggre-gation on the web. IEEE Expert.

SELBERG, E. AND ETZIONI, O. 1995b. Multiple ser-vice search and comparison using themetacrawler. In Proceedings of the FourthInternational Conference on The World WideWeb (Boston, MA).

SHAKES, J., LANGHEINRICH, M., AND ETZIONI, O.1997. Dynamic reference sifting: A case studyin the homepage domain. In Proceedings ofthe Conference on The World Wide Web. 189–200.

SHIVAKUMAR, N. AND GARCIA-MOLINA, H. 1998.Finding near-replicas of documents on theweb. In Proceedings of the Workshop on WebDatabases (Valencia, Spain, Mar.).

SHNEIDERMAN, B. 1994. Dynamic queries for visualinformation seeking. Tech. Rep. CS-TR-3022.University of Maryland at College Park, Col-lege Park, MD.

SHNEIER, M. AND ABDEL-MOTTALEB, M. 1996. Ex-ploiting the JPEG compression scheme for

image retrieval. IEEE Trans. Pattern Anal.Mach. Intell. 18, 8, 849–853.

SILBERSCHATZ, A., STONEBRAKER, M., AND ULLMAN,J. 1995. Database research: Achievementsand opportunities into the 21st century. InProceedings of the NSF Workshop on The Fu-ture of Database Research (May).

SMALL, H. 1973. Co-citation in the scientific liter-ature: a new measure of the relationship be-tween two documents. J. Am. Soc. Inf. Sci. 24,265–269.

SMITH, J. R. AND CHANG, S.-F. 1996. VisualSEEk:A fully automated content-based image querysystem. In Proceedings of the Fourth ACMInternational Conference on Multimedia (Mul-timedia ’96, Boston, MA, Nov. 18–22), P.Aigrain, W. Hall, T. D. C. Little, and V. M.Bove, Chairs. ACM Press, New York, NY,87–98.

SMITH, J. R. AND CHANG, S.-F. 1997a. Querying bycolor regions using VisualSEEk content-basedvisual query system. In Intelligent Multi-media Information Retrieval, M. T. Maybury,Ed. MIT Press, Cambridge, MA, 23–41.

SMITH, J. AND CHANG, S.-F. 1997b. Searching forimages and videos on the world-wide web.IEEE MultiMedia.

SMITH, Z. 1973. The truth about the web: Crawlingtowards eternity. Web Tech. Mag. www.webt-echniques.com/features/1997/05/burner/burn-er.html

SMOLIAR, S. W. AND ZHANG, H. 1994. Content-based video indexing and retrieval. IEEEMultiMedia 1, 2 (Summer), 62–72.

SNEATH, P. H. A. AND SOKAL, R. R. 1973. Numeri-cal Taxonomy. Freeman, London, UK.

SOERGEL, D. 1985. Organizing Information: Prin-ciples of Data Base and Retrieval Systems.Academic Press library and information sci-ence series. Academic Press Prof., Inc., SanDiego, CA.

SOFFER, A. AND SAMET, H. 2000. Pictorial queryspecification for browsing through spatially-referenced image databases. J. Visual Lang.Comput.

SPARCK JONES, K. AND WILLETT, P., EDS. 1997.Readings In Information Retrieval. MorganKaufmann multimedia information and sys-tems series. Morgan Kaufmann PublishersInc., San Francisco, CA.

STOLFO, S. AND HERNANDEZ, M. 1995. The merge/purge problem for large databases. In Pro-ceedings of the ACM SIGMOD Conference onManagement of Data (San Jose, CA, May).127–138.

STRATEGYALLEY. 1998. White paper on the viabil-ity of the internet for business. home page:www.strategyalley.com/articles/inet1.htm.

STRZALKOWSKI, T. 1999. Natural Language Infor-mation Retreival. Kluwer Academic Publish-ers, Hingham, MA.

TAKEDA, K. AND NOMIYAMA, H. 1997. Informationoutlining and site outlining. In Proceedings of



the International Symposium on Digital Li-braries (ISDL97, Tsukuba, Japan).

TETRANET SOFTWARE INC. 1998. Wisebot. Homepage for Wisebots: www.tetranetsoftware.com/products/wisebot.htm

TUFTE, E. R. 1986. The Visual Display of Quanti-tative Information. Graphics Press, Cheshire,CT.

VAN RIJSBERGEN, C. 1977. A theoretical basis forthe use of cooccurrence data in informationretrieval. J. Doc. 33, 2.

VAN RIJSBERGEN, C. 1979. Information Retrieval.2nd ed. Butterworths, London, UK.

WALKER, J., CASE, T., JORASCH, J., AND SPARICO, T.1996. Method, apparatus, and program forpricing, selling, and exercising options to pur-chase airline tickets: U.S. Patent no. 5797127.

WALKER, J., SPARICO, T., AND CASE, T. 1997. Meth-od and apparatus for the sale of airline-speci-fied flight tickets: U.S. Patent no. 5897620.

WATANABE, H. AND TAKEDA, K. 1998. A pattern-based machine translation system extendedby example-based processing. In Proceedingsof the Conference on COLING-ACL. 1369–1373.

WEBSTER, K. AND PAUL, K. 1996. Beyond surfing:Tools and techniques for searching the web.home page: magi.com/mmelick/it96jan.htm.

WESTERA, G. 1996. Robot-driven search engineevaluation overview. www.curtin.edu.au/curtin/library/staffpages/gwpersonal/senginestudy/.

WHITE, H. AND MCCAIN, K. 1989. Bibliometrics.Annual Review Information Science and Tech-nology.

WILLETT, P. 1988. Recent trends in hierarchicdocument clustering: a critical review. Inf.Process. Manage. 24, 5 (), 577–597.

WILLIAMS, M. 1984. What makes rabbit run? Int.J. Man-Mach. Stud. 2a, 1, 333–352.

WILLIAMSON, C. AND SHNEIDERMAN, B. 1992. Thedynamic HomeFinder: Evaluating dynamicqueries in a real-estate information explora-tion system. In Proceedings of the 15th An-nual International ACM Conference on Re-search and Development in InformationRetrieval (SIGIR ’92, Copenhagen, Denmark,

June 21–24), N. Belkin, P. Ingwersen, and A.M. Pejtersen, Eds. ACM Press, New York, NY,339–346.

WISE, J., THOMAS, J., PENNOCK, K., LANTRIP, D.,POTTIER, M., AND SCHUR, A. 1995. Visualizingthe non-visual: spatial analysis and interac-tion with information from text documents. InProceedings of the IEEE Conference on Infor-mation Visualization. IEEE Computer SocietyPress, Los Alamitos, CA, 51–58.

WITTEN, I. H., MOFFAT, A., AND BELL, T. C. 1994.Managing Gigabytes: Compressing and Index-ing Documents and Images. Van NostrandReinhold Co., New York, NY.

WU, J. K. AND NARASIMHALU, A. D. 1994. Identify-ing faces using multiple retrievals. IEEE Multi-Media 1, 2 (Summer), 27–38.

ZAMIR, O. AND ETZIONI, O. 1998. Web documentclustering: A feasibility demonstration. InProceedings of the 21st Annual InternationalACM SIGIR Conference on Research and De-velopment in Information Retrieval (SIGIR’98, Melbourne, Australia, Aug. 24–28), W. B.Croft, A. Moffat, C. J. van Rijsbergen, R.Wilkinson, and J. Zobel, Chairs. ACM Press,New York, NY, 46–54.

ZAMIR, O., ETZIONI, O., MADANI, O., AND KARP, R.1997. Fast and intuitive clustering of webdocuments. In Proceedings of the ACM SIG-MOD International Workshop on Data Miningand Knowledge Discovery (SIGMOD-96, Aug.),R. Ng, Ed. ACM Press, New York, NY, 287–290.

ZHANG, T., RAMAKRISHNAN, R., AND LIVNY, M. 1996.Birch: An efficient data clustering method forlarge databases. In Proceedings of the ACM-SIGMOD Conference on Management of Data(Montreal, Canada, June). ACM, New York,NY.

ZHONG, D. AND CHANG, S.-F. 1997. Video objectmodel and segmentation for content-basedvideo indexing. In Proceedings of the Interna-tional Conference on Circuits and Systems.IEEE Computer Society Press, Los Alamitos,CA.

Received: November 1998; revised: April 2000; accepted: July 2000



Information retrieval on the web

Technology

information search

information retrieval

information retrieval

retrieval ieee

webbased retrieval

retrieval korfhage

text retrieval

multimedia retrieval