Focused Crawling and Collection Synthesis

December 20, 2002 CUL Metadata WG Meeting 1

Focused Crawling and Collection Synthesis

Donna Bergmark

Cornell Information Systems


Outline

• Crawlers

• Collection Synthesis

• Focused Crawling

• Some Results

• Student Project (Fall 2002)


Definition

Spider = robot = crawler

Crawlers are computer programs that roam the Web with the goal of automating specific tasks related to the Web.


Crawlers – some background

• Resource discovery

• Crawlers and internet history

• Crawling and crawlers

• Mercator


Resource Discovery

• Finding info on the Web– Surfing (random strategy, goal is serendipity)

– Searching (inverted indices; specific info)

– Crawling (“all” the info)

• Uses for crawling– Find stuff

– Gather stuff

– Check stuff


Crawlers and internet history• 1991: HTTP• 1992: 26 servers• 1993: 60+ servers; self-register; archie• 1994 (early) – first crawlers• 1996 – search engines abound• 1998 – focused crawling• 1999 – web graph studies• 2002 – use for digital libraries


Crawling and Crawlers

• Web overlays the internet

• A crawl overlays the webseed


Crawler Issues

• The web is so big

• Visit Order

• The URL itself

• Politeness

• Robot Traps

• The hidden web

• System Considerations


Standard for Robot Exclusion

• Martin Koster (1994)

• http://any-server:80/robots.txt

• Maintained by the webmaster

• Forbid access to pages, directories

• Commonly excluded: /cgi-bin/

• Adherence is voluntary for the crawler


Robot Traps

• Cycles in the Web graph

• Infinite links on a page

• Traps set out by the Webmaster


The Hidden Web

• Dynamic pages increasing

• Subscription pages

• Username and password pages

• Research in progress on how crawlers can “get into” the hidden web


System Issues

• Crawlers are complicated systems

• Efficiency is of utmost importance

• Crawlers are demanding of system and network resources

13CUL Metadata WG MeetingDecember 20, 2002


Mercator Features

• Written in Java• One file configures a crawl• Can add your own code

– Extend one or more of M’s base classes– Add totally new classes called by your own

• Industrial-strength crawler:– uses its own DNS and java.net package


Collection Synthesis

• The NSDL– National Scientific Digital Library– Educational materials for K-thru-grave– A collection of digital collections

• Collection (automatically derived)– 20-50 items on a topic, represented by their

URLs, expository in nature, precision trumps recall


Crawler is the Key

• A general search engine is good for precise results, few in number

• A search engine must cover all topics, not just scientific

• For automatic collection assembly, a Web crawler is needed

• A focused crawler is the key


Focused Crawling


Focused Crawling

432

765

1

1

R

Breadth-first crawl

1

432

5R

X X

Focused crawl


Collections and Clusters

• Traditional – document universe is divided into clusters, or collections

• Each collection represented by its centroid• Web – size of document universe is infinite• Agglomerative clustering is used instead• Two aspects:

– Collection descriptor– Rule for when items belong to that Collection


Q = 0.2

Q = 0.6


The Setup

A virtual collection of items about Chebyshev Polynomials


Adding a Centroid

An empty collection of items about Chebyshev Polynomials


Document Vector Space

• Classic information retrieval technique

• Each word is a dimension in N-space

• Each document is a vector in N-space Example: <0, 0.003, 0,0,.01, .984,0,.001>

• Normalize the weights

Both the “centroid” and the downloaded document are term vectors


Agglomerate

A collection with 3 items about Ch. Polys.


Where does the Centroid come from?

“ChebyshevPolynomials”

A really good centroid fora collection about C.P.’s


Building a Centroid

1. Google(“Chebyshev Polynomials”) {url1 … url-n

2. Let H be a hash (k,v) where k=word, value=freq

3. For each url in {u1 … un} do

D download(url)V term vector(d)

For each term t in V doIf t not in H add it with value H(t) ++

4. Compute tf-idf weights. C top 20 terms.


Dictionary

• Given centroids C1, C2, C3 …

• Dictionary is C1 + C2 + C3 …– Terms are union of terms in Ci– Term Frequencies are total frequency in Ci– Document Frequency is how many C’s have t– Term IDF is as from Berkeley

• Dictionary is 300-500 terms


Focused Crawling• Recall the cartoon for a focused crawl:

• A simple way to do it is with 2 “knobs”

1

432

5R

X X


Focusing the Crawl

• Threshold: page is on-topic if correlation to the closest centroid is above this value

• Cutoff: follow links from pages whose “distance” from closest on-topic ancestor is less than the cutoff


Illustration

2 3

4

6

7

1

5555

Cutoff = 1

Corr >= threshold


Min-avg-max correlation vs. crawl length

00.10.2

0.30.40.50.6

0.70.8

0 20000 40000 60000 80000 100000 120000

No. documents downloaded

corr

elat

ion Maximum

Average

Minimum

Closest

Furthest


Collection “Evaluation”

• Assume higher correlations are good

• With human relevance assessments, one can also compute a “precision” curve

• Precision P(n) after considering the n most highly ranked items is number of relevant, divided by n.


Cutoff = 0Threshold = 0.3


Precision vs. Rank

0

0.2

0.4

0.6

0.8

1

1.2

0 20 40 60

Rank

Pre

cisi

on

Crawling

Google


Tunneling with Cutoff

• Nugget – dud – dud… - dud – nugget

Notation: 0 – X – X … - X – 0

• Fixed cutoff: 0 – X1 – X2 - … Xc

• Adaptive cutoff: 0 – X1 – X2 - … X?


Statistics Collected

• 500,000 documents

• Number of seeds: 4

• Path data for all but seeds

• 6620 completed paths (0-x…x-0)

• 100,000s incomplete paths (0-x…x..)


Nuggets that are x steps from a nugget

0

200

400

600

800

1000

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

X - number of links from nugget

# nuggets


Nuggets that are x steps from a seed and/or a nugget

0

200

400

600

800

1000

1200

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

X - number of links from nugget

from seeds# nuggets


Better parents have better children.

0

0.05

0.1

0.15

0.2

0.251 3 5 7 9 11

13

15

17

Correlation bracket

Nu

mb

er

of

no

de

s

General Population

children of .45-.5nodes


Using the Empirical Observations

• Use the path history

• Use the page quality - cosine correlation

• Current distance should increase exponentially as you get away from quality nodes

Distance = 0 if this is a nugget, otherwise:1 or (1-corr) exp (2 x parent’s distance / cutoff)


Results

• Details in the ECDL paper

• Smaller frontier more docs/second

• More documents downloaded in same time

• Higher-scoring documents were downloaded

• Cutoff of 20 averaged 7 steps at the cutoff


Fall 2002 Student Project

Query

Mercator

Centroid Collection Description

Term vectors

Centroids,Dictionary

CollectionURLs

Chebyshev P.s HTML


Conclusion

• We’ve covered crawling – history, technology, use

• Focused crawling with tunneling• Adaptive cutoff with tunneling

• We have a good experimental setup for exploring automatic collection synthesis

Focused Crawling and Collection Synthesis

Documents