Outline Introduction Models Experiments Summary Crawling the Infinite Web: Five Levels are Enough Ricardo Baeza-Yates and Carlos Castillo Center for Web Research www.cwr.cl WAW 2004 R. Baeza-Yates and C. Castillo Center for Web Research Crawling the Infinite Web
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Outline Introduction Models Experiments Summary
Crawling the Infinite Web:Five Levels are Enough
Ricardo Baeza-Yates and Carlos Castillo
Center for Web Researchwww.cwr.cl
WAW 2004
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Infinite Web
Outline Introduction Models Experiments Summary
1 Introduction
2 Models
3 Experiments
4 Summary
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Infinite Web
Outline Introduction Models Experiments Summary
Introduction
Dynamic page: “a page which is created on request”
Dynamic pages with links to other dynamic pages
Malicious: loops and/or near-duplicates
Legitimate: recommendation systems, calendars, iterativealgorithms, etc.
The number of pages on the Web can be considered infinite
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Infinite Web
Outline Introduction Models Experiments Summary
Introduction
Dynamic page: “a page which is created on request”
Dynamic pages with links to other dynamic pages
Malicious: loops and/or near-duplicates
Legitimate: recommendation systems, calendars, iterativealgorithms, etc.
The number of pages on the Web can be considered infinite
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Infinite Web
Outline Introduction Models Experiments Summary
Introduction
Dynamic page: “a page which is created on request”
Dynamic pages with links to other dynamic pages
Malicious: loops and/or near-duplicates
Legitimate: recommendation systems, calendars, iterativealgorithms, etc.
The number of pages on the Web can be considered infinite
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Infinite Web
Outline Introduction Models Experiments Summary
Introduction
Dynamic page: “a page which is created on request”
Dynamic pages with links to other dynamic pages
Malicious: loops and/or near-duplicates
Legitimate: recommendation systems, calendars, iterativealgorithms, etc.
The number of pages on the Web can be considered infinite
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Infinite Web
Outline Introduction Models Experiments Summary
Introduction
Dynamic page: “a page which is created on request”
Dynamic pages with links to other dynamic pages
Malicious: loops and/or near-duplicates
Legitimate: recommendation systems, calendars, iterativealgorithms, etc.
The number of pages on the Web can be considered infinite
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Infinite Web
Outline Introduction Models Experiments Summary
Conflicting interests
Web site administrator: would like to have all of the Website indexed
Search engine administrator: would like to use efficientlythe network and storage capacity available
Search engine user: would like to find what he is looking for
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Infinite Web
Outline Introduction Models Experiments Summary
Conflicting interests
Web site administrator: would like to have all of the Website indexed
Search engine administrator: would like to use efficientlythe network and storage capacity available
Search engine user: would like to find what he is looking for
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Infinite Web
Outline Introduction Models Experiments Summary
Conflicting interests
Web site administrator: would like to have all of the Website indexed
Search engine administrator: would like to use efficientlythe network and storage capacity available
Search engine user: would like to find what he is looking for
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Infinite Web
Outline Introduction Models Experiments Summary
Our approach
Users do not go so deep inside Web sites
If something is important it has to be easily reachable
We will download only a few levels of each Web site
How many levels?
How much do you lost?
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Infinite Web
Outline Introduction Models Experiments Summary
Our approach
Users do not go so deep inside Web sites
If something is important it has to be easily reachable
We will download only a few levels of each Web site
How many levels?
How much do you lost?
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Infinite Web
Outline Introduction Models Experiments Summary
Our approach
Users do not go so deep inside Web sites
If something is important it has to be easily reachable
We will download only a few levels of each Web site
How many levels?
How much do you lost?
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Infinite Web
Outline Introduction Models Experiments Summary
Our approach
Users do not go so deep inside Web sites
If something is important it has to be easily reachable
We will download only a few levels of each Web site
How many levels?
How much do you lost?
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Infinite Web
Outline Introduction Models Experiments Summary
Our approach
Users do not go so deep inside Web sites
If something is important it has to be easily reachable
We will download only a few levels of each Web site
How many levels?
How much do you lost?
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Infinite Web
Outline Introduction Models Experiments Summary
ModelsNavigating a tree ≈ Moving through levels
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Infinite Web
Outline Introduction Models Experiments Summary
ActionsPossible actions at a given level
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Infinite Web
Outline Introduction Models Experiments Summary
Type of models we study
There is a set of atomic actionsA = {next, start/jump, back, stay , prev , fwd}Pr(action|`) is the probability of taking an action∑
action∈A Pr(action|`) = 1
The probability Pr(next|`) is constant
Stationary distribution → how much time users spent at eachlevel
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Infinite Web
Outline Introduction Models Experiments Summary
Type of models we study
There is a set of atomic actionsA = {next, start/jump, back, stay , prev , fwd}Pr(action|`) is the probability of taking an action∑
action∈A Pr(action|`) = 1
The probability Pr(next|`) is constant
Stationary distribution → how much time users spent at eachlevel
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Infinite Web
Outline Introduction Models Experiments Summary
Type of models we study
There is a set of atomic actionsA = {next, start/jump, back, stay , prev , fwd}Pr(action|`) is the probability of taking an action∑
action∈A Pr(action|`) = 1
The probability Pr(next|`) is constant
Stationary distribution → how much time users spent at eachlevel
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Infinite Web
Outline Introduction Models Experiments Summary
Type of models we study
There is a set of atomic actionsA = {next, start/jump, back, stay , prev , fwd}Pr(action|`) is the probability of taking an action∑
action∈A Pr(action|`) = 1
The probability Pr(next|`) is constant
Stationary distribution → how much time users spent at eachlevel
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Infinite Web
Outline Introduction Models Experiments Summary
Type of models we study
There is a set of atomic actionsA = {next, start/jump, back, stay , prev , fwd}Pr(action|`) is the probability of taking an action∑
action∈A Pr(action|`) = 1
The probability Pr(next|`) is constant
Stationary distribution → how much time users spent at eachlevel
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Infinite Web
Outline Introduction Models Experiments Summary
Model AForwards and backwards one level at a time
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Infinite Web
Outline Introduction Models Experiments Summary
Model AForwards and backwards one level at a time
Birth and death process
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Infinite Web
Outline Introduction Models Experiments Summary
Model BBack to first level
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Infinite Web
Outline Introduction Models Experiments Summary
Model BBack to first level
Birth and death process with extinction
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Infinite Web
Outline Introduction Models Experiments Summary
Model CBack to any previous level
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Infinite Web
Outline Introduction Models Experiments Summary
Model CBack to any previous level
Birth and death process with extinction and disaster?
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Infinite Web
Outline Introduction Models Experiments Summary
Cumulative probability of levels 0 . . . kBased on solutions given in the paper
R. Baeza-Yates and C. Castillo Center for Web Research