Top Banner
December 20, 200 2 CUL Metadata WG Meeting 1 Focused Crawling and Collection Synthesis Donna Bergmark Cornell Information Systems
43

Focused Crawling and Collection Synthesis

Jan 03, 2016

Download

Documents

cherokee-solis

Focused Crawling and Collection Synthesis. Donna Bergmark Cornell Information Systems. Outline. Crawlers Collection Synthesis Focused Crawling Some Results Student Project (Fall 2002). Definition. Spider = robot = crawler - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Focused Crawling and Collection Synthesis

December 20, 2002 CUL Metadata WG Meeting 1

Focused Crawling and Collection Synthesis

Donna Bergmark

Cornell Information Systems

Page 2: Focused Crawling and Collection Synthesis

December 20, 2002 CUL Metadata WG Meeting 2

Outline

• Crawlers

• Collection Synthesis

• Focused Crawling

• Some Results

• Student Project (Fall 2002)

Page 3: Focused Crawling and Collection Synthesis

December 20, 2002 CUL Metadata WG Meeting 3

Definition

Spider = robot = crawler

Crawlers are computer programs that roam the Web with the goal of automating specific tasks related to the Web.

Page 4: Focused Crawling and Collection Synthesis

December 20, 2002 CUL Metadata WG Meeting 4

Crawlers – some background

• Resource discovery

• Crawlers and internet history

• Crawling and crawlers

• Mercator

Page 5: Focused Crawling and Collection Synthesis

December 20, 2002 CUL Metadata WG Meeting 5

Resource Discovery

• Finding info on the Web– Surfing (random strategy, goal is serendipity)

– Searching (inverted indices; specific info)

– Crawling (“all” the info)

• Uses for crawling– Find stuff

– Gather stuff

– Check stuff

Page 6: Focused Crawling and Collection Synthesis

December 20, 2002 CUL Metadata WG Meeting 6

Crawlers and internet history• 1991: HTTP• 1992: 26 servers• 1993: 60+ servers; self-register; archie• 1994 (early) – first crawlers• 1996 – search engines abound• 1998 – focused crawling• 1999 – web graph studies• 2002 – use for digital libraries

Page 7: Focused Crawling and Collection Synthesis

December 20, 2002 CUL Metadata WG Meeting 7

Crawling and Crawlers

• Web overlays the internet

• A crawl overlays the webseed

Page 8: Focused Crawling and Collection Synthesis

December 20, 2002 CUL Metadata WG Meeting 8

Crawler Issues

• The web is so big

• Visit Order

• The URL itself

• Politeness

• Robot Traps

• The hidden web

• System Considerations

Page 9: Focused Crawling and Collection Synthesis

December 20, 2002 CUL Metadata WG Meeting 9

Standard for Robot Exclusion

• Martin Koster (1994)

• http://any-server:80/robots.txt

• Maintained by the webmaster

• Forbid access to pages, directories

• Commonly excluded: /cgi-bin/

• Adherence is voluntary for the crawler

Page 10: Focused Crawling and Collection Synthesis

December 20, 2002 CUL Metadata WG Meeting 10

Robot Traps

• Cycles in the Web graph

• Infinite links on a page

• Traps set out by the Webmaster

Page 11: Focused Crawling and Collection Synthesis

December 20, 2002 CUL Metadata WG Meeting 11

The Hidden Web

• Dynamic pages increasing

• Subscription pages

• Username and password pages

• Research in progress on how crawlers can “get into” the hidden web

Page 12: Focused Crawling and Collection Synthesis

December 20, 2002 CUL Metadata WG Meeting 12

System Issues

• Crawlers are complicated systems

• Efficiency is of utmost importance

• Crawlers are demanding of system and network resources

Page 13: Focused Crawling and Collection Synthesis

13CUL Metadata WG MeetingDecember 20, 2002

Page 14: Focused Crawling and Collection Synthesis

December 20, 2002 CUL Metadata WG Meeting 14

Mercator Features

• Written in Java• One file configures a crawl• Can add your own code

– Extend one or more of M’s base classes– Add totally new classes called by your own

• Industrial-strength crawler:– uses its own DNS and java.net package

Page 15: Focused Crawling and Collection Synthesis

December 20, 2002 CUL Metadata WG Meeting 15

Collection Synthesis

• The NSDL– National Scientific Digital Library– Educational materials for K-thru-grave– A collection of digital collections

• Collection (automatically derived)– 20-50 items on a topic, represented by their

URLs, expository in nature, precision trumps recall

Page 16: Focused Crawling and Collection Synthesis

December 20, 2002 CUL Metadata WG Meeting 16

Crawler is the Key

• A general search engine is good for precise results, few in number

• A search engine must cover all topics, not just scientific

• For automatic collection assembly, a Web crawler is needed

• A focused crawler is the key

Page 17: Focused Crawling and Collection Synthesis

December 20, 2002 CUL Metadata WG Meeting 17

Focused Crawling

Page 18: Focused Crawling and Collection Synthesis

December 20, 2002 CUL Metadata WG Meeting 18

Focused Crawling

432

765

1

1

R

Breadth-first crawl

1

432

5R

X X

Focused crawl

Page 19: Focused Crawling and Collection Synthesis

December 20, 2002 CUL Metadata WG Meeting 19

Collections and Clusters

• Traditional – document universe is divided into clusters, or collections

• Each collection represented by its centroid• Web – size of document universe is infinite• Agglomerative clustering is used instead• Two aspects:

– Collection descriptor– Rule for when items belong to that Collection

Page 20: Focused Crawling and Collection Synthesis

December 20, 2002 CUL Metadata WG Meeting 20

Q = 0.2

Q = 0.6

Page 21: Focused Crawling and Collection Synthesis

December 20, 2002 CUL Metadata WG Meeting 21

The Setup

A virtual collection of items about Chebyshev Polynomials

Page 22: Focused Crawling and Collection Synthesis

December 20, 2002 CUL Metadata WG Meeting 22

Adding a Centroid

An empty collection of items about Chebyshev Polynomials

Page 23: Focused Crawling and Collection Synthesis

December 20, 2002 CUL Metadata WG Meeting 23

Document Vector Space

• Classic information retrieval technique

• Each word is a dimension in N-space

• Each document is a vector in N-space Example: <0, 0.003, 0,0,.01, .984,0,.001>

• Normalize the weights

Both the “centroid” and the downloaded document are term vectors

Page 24: Focused Crawling and Collection Synthesis

December 20, 2002 CUL Metadata WG Meeting 24

Agglomerate

A collection with 3 items about Ch. Polys.

Page 25: Focused Crawling and Collection Synthesis

December 20, 2002 CUL Metadata WG Meeting 25

Where does the Centroid come from?

“ChebyshevPolynomials”

A really good centroid fora collection about C.P.’s

Page 26: Focused Crawling and Collection Synthesis

December 20, 2002 CUL Metadata WG Meeting 26

Building a Centroid

1. Google(“Chebyshev Polynomials”) {url1 … url-n

2. Let H be a hash (k,v) where k=word, value=freq

3. For each url in {u1 … un} do

D download(url)V term vector(d)

For each term t in V doIf t not in H add it with value H(t) ++

4. Compute tf-idf weights. C top 20 terms.

Page 27: Focused Crawling and Collection Synthesis

December 20, 2002 CUL Metadata WG Meeting 27

Dictionary

• Given centroids C1, C2, C3 …

• Dictionary is C1 + C2 + C3 …– Terms are union of terms in Ci– Term Frequencies are total frequency in Ci– Document Frequency is how many C’s have t– Term IDF is as from Berkeley

• Dictionary is 300-500 terms

Page 28: Focused Crawling and Collection Synthesis

December 20, 2002 CUL Metadata WG Meeting 28

Focused Crawling• Recall the cartoon for a focused crawl:

• A simple way to do it is with 2 “knobs”

1

432

5R

X X

Page 29: Focused Crawling and Collection Synthesis

December 20, 2002 CUL Metadata WG Meeting 29

Focusing the Crawl

• Threshold: page is on-topic if correlation to the closest centroid is above this value

• Cutoff: follow links from pages whose “distance” from closest on-topic ancestor is less than the cutoff

Page 30: Focused Crawling and Collection Synthesis

December 20, 2002 CUL Metadata WG Meeting 30

Illustration

2 3

4

6

7

1

5555

Cutoff = 1

Corr >= threshold

Page 31: Focused Crawling and Collection Synthesis

December 20, 2002 CUL Metadata WG Meeting 31

Min-avg-max correlation vs. crawl length

00.10.2

0.30.40.50.6

0.70.8

0 20000 40000 60000 80000 100000 120000

No. documents downloaded

corr

elat

ion Maximum

Average

Minimum

Closest

Furthest

Page 32: Focused Crawling and Collection Synthesis

December 20, 2002 CUL Metadata WG Meeting 32

Collection “Evaluation”

• Assume higher correlations are good

• With human relevance assessments, one can also compute a “precision” curve

• Precision P(n) after considering the n most highly ranked items is number of relevant, divided by n.

Page 33: Focused Crawling and Collection Synthesis

December 20, 2002 CUL Metadata WG Meeting 33

Cutoff = 0Threshold = 0.3

Page 34: Focused Crawling and Collection Synthesis

December 20, 2002 CUL Metadata WG Meeting 34

Precision vs. Rank

0

0.2

0.4

0.6

0.8

1

1.2

0 20 40 60

Rank

Pre

cisi

on

Crawling

Google

Page 35: Focused Crawling and Collection Synthesis

December 20, 2002 CUL Metadata WG Meeting 35

Tunneling with Cutoff

• Nugget – dud – dud… - dud – nugget

Notation: 0 – X – X … - X – 0

• Fixed cutoff: 0 – X1 – X2 - … Xc

• Adaptive cutoff: 0 – X1 – X2 - … X?

Page 36: Focused Crawling and Collection Synthesis

December 20, 2002 CUL Metadata WG Meeting 36

Statistics Collected

• 500,000 documents

• Number of seeds: 4

• Path data for all but seeds

• 6620 completed paths (0-x…x-0)

• 100,000s incomplete paths (0-x…x..)

Page 37: Focused Crawling and Collection Synthesis

December 20, 2002 CUL Metadata WG Meeting 37

Nuggets that are x steps from a nugget

0

200

400

600

800

1000

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

X - number of links from nugget

# nuggets

Page 38: Focused Crawling and Collection Synthesis

December 20, 2002 CUL Metadata WG Meeting 38

Nuggets that are x steps from a seed and/or a nugget

0

200

400

600

800

1000

1200

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

X - number of links from nugget

from seeds# nuggets

Page 39: Focused Crawling and Collection Synthesis

December 20, 2002 CUL Metadata WG Meeting 39

Better parents have better children.

0

0.05

0.1

0.15

0.2

0.251 3 5 7 9 11

13

15

17

Correlation bracket

Nu

mb

er

of

no

de

s

General Population

children of .45-.5nodes

Page 40: Focused Crawling and Collection Synthesis

December 20, 2002 CUL Metadata WG Meeting 40

Using the Empirical Observations

• Use the path history

• Use the page quality - cosine correlation

• Current distance should increase exponentially as you get away from quality nodes

Distance = 0 if this is a nugget, otherwise:1 or (1-corr) exp (2 x parent’s distance / cutoff)

Page 41: Focused Crawling and Collection Synthesis

December 20, 2002 CUL Metadata WG Meeting 41

Results

• Details in the ECDL paper

• Smaller frontier more docs/second

• More documents downloaded in same time

• Higher-scoring documents were downloaded

• Cutoff of 20 averaged 7 steps at the cutoff

Page 42: Focused Crawling and Collection Synthesis

December 20, 2002 CUL Metadata WG Meeting 42

Fall 2002 Student Project

Query

Mercator

Centroid Collection Description

Term vectors

Centroids,Dictionary

CollectionURLs

Chebyshev P.s HTML

Page 43: Focused Crawling and Collection Synthesis

December 20, 2002 CUL Metadata WG Meeting 43

Conclusion

• We’ve covered crawling – history, technology, use

• Focused crawling with tunneling• Adaptive cutoff with tunneling

• We have a good experimental setup for exploring automatic collection synthesis