Top Banner
Web Categorization Crawler – Part I Mohammed Agabaria Adam Shobash Supervisor: Victor Kulikov Winter 2009/10 Final Presentation Sep. 2010 1 Web Categorization Crawler
34

Web Categorization Crawler – Part I

Feb 24, 2016

Download

Documents

Khôi

Web Categorization Crawler – Part I. Mohammed Agabaria Adam Shobash Supervisor: Victor Kulikov Winter 2009/10 Final Presentation Sep. 2010. Contents. Crawler Overview Introduction and Basic Flow Crawling Problems Project Technologies Project Main Goals System High Level Design - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Web Categorization Crawler – Part I

1

Web Categorization Crawler – Part I

Mohammed AgabariaAdam Shobash

Supervisor: Victor KulikovWinter 2009/10

Final PresentationSep. 2010

Web Categorization Crawler

Page 2: Web Categorization Crawler – Part I

2

Contents Crawler Overview

Introduction and Basic Flow Crawling Problems

Project Technologies Project Main Goals System High Level Design System Design

Crawler Application Design Frontier Structure Worker Structure

Database Design - ERD of DB Storage System Design

Web Application GUI Summary

Web Categorization Crawler

Page 3: Web Categorization Crawler – Part I

3

Crawler Overview – Intro. A Web Crawler is a computer program that browses the

World Wide Web in a methodical automated manner The Crawler starts with a list of URLs to visit, called the

seeds list The Crawler visits these URLs and identifies all the

hyperlinks in the page and adds them to the list of URLs to visit, called the frontier

URLs from the frontier are recursively visited according to a predefined set of policies

Web Categorization Crawler

Page 4: Web Categorization Crawler – Part I

4

Crawler Overview – Basic Flow The basic flow of a standard crawler is as seen in the

illustration and as follows: The Frontier, that contains the URLs to visit, is Initialized with seed URLs A URL is picked from the frontier and

the page with that URL is fetched from the internet

The page that has been fetched is parsed in order to:

Extract hyperlinks from the page Process the page

Add extracted URLs to Frontier

Web Categorization Crawler

Page 5: Web Categorization Crawler – Part I

5

Crawling Problems

Web Categorization Crawler

The World Wide Web contains a large volume of data Crawler can only download a fraction of the Web pages Thus there is a need to prioritize and speed up downloads, and crawl

only the relevant pages Dynamic page generation

May cause duplication in content retrieved by the crawler Also causes a crawler traps

Endless combination of HTTP requests to the same page Fast rate of Change

Pages that were downloaded may have been changed since the last time they were visited

Some crawlers may need to revisit the pages in order to keep up to date data

Page 6: Web Categorization Crawler – Part I

6

Project Technologies C# (C Sharp), a simple, modern, general-purpose, and

object oriented programming language ASP.NET, a web application framework Relational Data Base SQL, a database computer language for managing data SVN, a revision control system to maintain current and

historical versions of files

Web Categorization Crawler

Page 7: Web Categorization Crawler – Part I

7

Project Main Goals

Web Categorization Crawler

Design and implement a scalable and extensible crawler Multi-threaded design in order to utilize all the system resources Increase the crawler’s performance by implementing an efficient

algorithms and data structures The Crawler will be designed in a modular way, with expectation that

new functionality will be added by others Build a friendly web application GUI including all the features

supported for the crawl progress

Page 8: Web Categorization Crawler – Part I

8

Main GUI

System High Level Design

Web Categorization Crawler

Storage System

Data Base

Crawler

Fron

tier worker1

worker2worker3.

.

.

View results

Store Configurations

Load Configurations

Store Results

There are 3 major parts in the System Crawler (Server Application) StorageSystem Web Application GUI (User)

Page 9: Web Categorization Crawler – Part I

9

Crawler Application Design Maintains and activates both of the Frontier and the Workers

The Frontier is the data structure that holds the urls to visit A Worker’s role is to fetch and process pages

Multi Threaded There are predefined number of Worker threads There is a single Frontier thread Requires to protect the shared resources from simultaneous access

The shared resource between the Workers and the Frontier is the Queue that holds the urls to visit

Web Categorization Crawler

-SetFlag(in flag : string, in value : string) : bool-ParseArgument(in args : string) : bool-SelectTask(in user : string, in task : string) : void-SetInitializer(in taskId : string) : void-InitQueues(in taskId : string) : void-InvokeThreads() : void-TerminateThreads() : void+Main() : void

Crawler-_numWorkers : int-_refreshRate : int-_seedList : List-_operationMode : opertationMode_t-_categories : List-_constraints : Constraints-_initializer : Initializer-_serversQueues : List-_feedBackQueue : Queue-_threadsPool : List-_workersPool : List-_frontiersPool : List

+run() : void+RequestStop() : void+setPollingTimer(in period : int) : void

Worker-_tasks : Queue-_feedback : Queue-_timer : int-_shouldStop : bool-_fetchers : FetcherManager-_processors : ResourceProcessorManager

+scheduleTasks() : void+RequestStop() : void+setPollingTimer(in period : int) : void

Frontier-_taskQueue : Queue-_serversQueues : List-_timer : int-_limit : int-_checkStatusLimit : int-_shouldStop : bool

11

1 *

Page 10: Web Categorization Crawler – Part I

10

Frontier Structure Maintains the data structure that contains all the Urls that have not

been visited yet FIFO Queue*

Distributes the Urls uniformly between the workers

Web Categorization Crawler

Frontier Queue

Worker Queues

(*) first implementation

FIs Seen

TestRoute

RequestT

DeleteRequest

Page 11: Web Categorization Crawler – Part I

11

Worker Structure The Worker fetches a page from the Web and processes the

fetched page with the following steps: Extracting all the Hyper links from the page. Filtering part of the extracted Urls. Ranking the Url* Categorizing the page* Writing the results to the data base. Writing back the extracted urls back to the frontier.

Web Categorization Crawler

Fetcher

Categorizer

URLfilterExtractor

Page Ranker

DB

(*) will be implemented at part II

Worker Queue

Frontier Queue

Page 12: Web Categorization Crawler – Part I

12

Class Diagram of Worker

Web Categorization Crawler

+fetchResource(in url : string) : ResourceContent+setTimeOut(in timeout : int) : void+addProtocol(in protocol : ResourceFetcher) : void+removeProtocol(in protocolId : string) : void

-timeOut : int-resourceFetchers : Map

FetcherManager

+processResource(in resource : ResourceContent) : void+attachProcessor(in procId : string, in proc : ResourceProcessor) : void+deattachProcessor() : string

-resourceProcessors : Map

ResourceProcessorManager

+run() : void+RequestStop() : void+setPollingTimer(in period : int) : void

-_tasks : Queue-_feedback : Queue-_timer : int-_shouldStop : bool-_fetchers : FetcherManager-_processors : ResourceProcessorManager

Worker

1

1

1

1

Page 13: Web Categorization Crawler – Part I

13

Class Diagram Of Worker-Cont.

Web Categorization Crawler

+fetchResource(in url : string) : ResourceContent+setTimeOut(in timeout : int) : void+addProtocol(in protocol : ResourceFetcher) : void+removeProtocol(in protocolId : string) : void

-timeOut : int-resourceFetchers : Map

FetcherManager

+getResourceURL() : string+getResourceType() : ResourceType+getResourceContent() : ResourceContent+getReturnCode() : int+getRankOfUrl() : int+isValid() : bool

-url : string-resourceType : ResourceType-resourceContent : string-returnCode : int-rankOfUrl : int

ResourceContent

1 *+processResource(in resource : ResourceContent) : void+attachProcessor(in procId : string, in proc : ResourceProcessor) : void+deattachProcessor() : string

-resourceProcessors : Map

ResourceProcessorManager

* 1

1*

Queue is given to the constructor as a reference,the processor class should not allocate new queue,and should use the given reference instead.

+fetch(in url : string, in timeOut : int, in rankOfUrl : int) : ResourceContent+canFetch(in url : string) : bool

«interface»ResourceFetcher

+fetch(in url : string, in timeOut : int, in rankOfUrl : int) : ResourceContent+canFetch(in url : string) : bool

HttpResourceFetcher

1

*

+process(in resource : ResourceContent) : void+canProcess(in resource : ResourceContent) : bool

«interface»ResourceProcessor

+process(in content : ResourceContent) : void+canProcess(in content : ResourceContent) : bool+deployResourceToStorage(in result : Result) : void+deployLinksToFrontier(in urlProcessed : Url) : void-hashUrl(in urlName : string) : int

-extractor : Extractor-categorizer : Categorizer-ranker : Ranker-filter : Filter-queueFrontier : Frontier-taskId : string

HtmlPageCategorizationProcessor

Page 14: Web Categorization Crawler – Part I

14

Class Diagram Of Worker-Cont.

Web Categorization Crawler

+extractLinks(in links : List, in page : string) : List-getText(in page : string) : void-removeTags(in content : string) : string-getLinks(in links : List) : void+getLink(in tag : string) : string+isRelative(in link : string) : bool

Extractor

+process(in content : ResourceContent) : void+canProcess(in content : ResourceContent) : bool+deployResourceToStorage(in result : Result) : void+deployLinksToFrontier(in urlProcessed : Url) : void-hashUrl(in urlName : string) : int

-extractor : Extractor-categorizer : Categorizer-ranker : Ranker-filter : Filter-queueFrontier : Frontier-taskId : string

HtmlPageCategorizationProcessor

+rankUrl(in parentRank : int, in parentConent : string, in url : string) : int-categorizer : Categorizer

Ranker

+classifyContent(in resource : string, in url : string) : List-CategoryList : List

Categorizer

+getCategoryID() : string+getParentName() : string+getCategoryName() : string+getConfidenceLevel() : int+getKeywords() : List+getMatchLevel(in WordList : string) : int-canonicForm(in text : string) : string-synonymousList(in word : string) : List

-categoryID : string-ParentName : string-categoryName : string-keywordList : List-confidenceLevel : int

Category

11

*

1

11

+filterLinks(in linkList : List) : List+canonize(in link : string) : string

-prefix : string-constraints : Constraints

Filter

+getAllowedDepth() : uint+getRestrictionList() : List+getCrawlList() : List+isParametrizationAllowed() : bool+isUrlValid(in url : string) : bool-getUrlDepth(in url : string) : uint-getUrlNetwork(in url : string) : string-containsParameter(in url : string) : bool

-linkDepth : int-restrictedNetworks : List-crawlNetworks : List-allowUrlParameters : bool

Constraints

1

1

«requirement»In extractLinks the output will be a list of classes of LinkItem which contain url and neighbour word list for further processing

1

1

1 1

1

1

Categorizer is given as a reference to the constuctor of Ranker,and Ranker should not allocate new categorizer,instead it should use the given reference.

Category class will be immutable class, which meansthat every property will be defined while constructing the object instance

Page 15: Web Categorization Crawler – Part I

15

ERD of Data Base Tables in the Data Base:

Task, contains basic details about the task TaskProperties, contains the following properties about a task : Seed list, allowed

networks, restricted networks* Results, contains details about the results that the crawler have reached to them Category, contains details about all the categories that have been defined Users, contains details about the users of the

system**

Web Categorization Crawler

Category

PK CategoryID

CategoryName Keywords ParentCategory ConfidenceLevelFK1 TaskID

Results

PK ResultID

UrlFK1 CategoryID Rank TrustMeterFK2 TaskID

TaskProperties

PK PropID

Property ValueFK1 TaskID

Users

PK UserID

UserName UserPassword

Task

PK TaskID

TaskName Status ElapsedTime LinkDepth AllowUrlParamFK1 UserID

(*) Any other properties can be added and used easily(**) Not used in the current GUI

Page 16: Web Categorization Crawler – Part I

16

Storage System Storage System is the connector class between the GUI and the Crawler to

the DB Using the Storage System you can save data into the data base, or you can extract data

from the data base The Crawler uses the Storage System to extract the configurations of a task from the DB,

and to save the results to the DB The GUI uses the Storage System to save configurations of a task into the DB, and to

extract the results from the DB

Web Categorization Crawler

Page 17: Web Categorization Crawler – Part I

17

Class Diagram of Storage System

Web Categorization Crawler

Crawler

+StorageSystem()

-_storageSystem : StorageSystem-_configurationStorageImp : ConfigurationStorageImp-_categoriesStorageImp : CategoriesStorageImp-_settingsStorageImp : SettingsStorageImp-_resultsStorageImp : ResultsStorageImp

StorageSystem

WebApplicationGUI

*

1 1

1

Web application GUI will usethe storgae system API in order to: - creating and releasing works (tasks) - save all the crawling configuration and options - viewing results and status for specified task

Crawler System will use the Storage System API for: - getting the work status and execution updates - getting all the crwaling configurations - storing all the crawled results

+GetWorkDetails() : List+CreateWorkResources() : string+ReleaseWorkResources() : void+ChangeWorkDetails() : void

«implementation class»ConfigurationStorageImp

+getCategories() : List+setCategories() : void+resetCategories() : int+setParentToSon() : void

«implementation class»CategoriesStorageImp

+getRestrictions() : Constraints+removeProperty() : int+setRestrictions() : void+getSeedList() : List+setSeedList() : void

«implementation class»SettingsStorageImp

+getURLResults() : List+getURLsFromCategory() : List+getTotalUrls() : ulong+addURLResult() : void+replaceURLResult() : void+removeURLResult() : void-sortListOfResults() : List-CompareResultByRank() : int-CompareResultByTrustMeter() : int

«implementation class»ResultsStorageImp

11

1

1

1

1

1 1

+GetWorkDetails() : List+CreateWorkResources() : string+ReleaseWorkResources() : void+ChangeWorkDetails() : void

«interface»ConfigurationStorage

+getCategories() : List+setCategories() : void

«interface»CategoriesStorage

+getRestrictions() : Constraints+getSeedList() : List+setRestrictions() : void+setSeedList() : void

«interface»SettingsStorage

+getURLResults() : List+getURLsFromCategory() : List+getTotalUrls() : ulong+addURLResult() : void+replaceURLResult() : void+removeURLResult() : void

«interface»ResultsStorage

Page 18: Web Categorization Crawler – Part I

18

Web Application GUI Simple and Convenient to use User Friendly User can do the following:

Edit and create a task Launch the Crawler View the results that the crawler has reached Stop the Crawler

Web Categorization Crawler

Page 19: Web Categorization Crawler – Part I

19

Web Categorization Crawler – Part II

Mohammed AgabariaAdam Shobash

Supervisor: Victor KulikovSpring 2009/10

Final PresentationDec. 2010

Web Categorization Crawler

Page 20: Web Categorization Crawler – Part I

20

Contents Reminder From Part I

Crawler Overview System High Level Design Worker Structure Frontier Structure

Project Technologies Project Main Goals Categorizing Algorithm Ranking Algorithm

Motivation Background Ranking Algorithm

Frontier Structure – Enhanced Ranking Trie Basic Flow

SummaryWeb Categorization Crawler

Page 21: Web Categorization Crawler – Part I

21

Reminder: Crawler Overview A Web Crawler is a computer program that browses the

World Wide Web in a methodical automated manner The Crawler starts with a list of URLs to visit, called the

seeds list The Crawler visits these URLs and identifies all the

hyperlinks in the page and adds them to the list of URLs to visit, called the frontier

URLs from the frontier are recursively visited according to a predefined set of policies

Web Categorization Crawler

Page 22: Web Categorization Crawler – Part I

22

Main GUI

Reminder: System High Level Design

Web Categorization Crawler

Storage System

Data Base

Crawler

Fron

tier worker1

worker2worker3.

.

.

View results

Store Configurations

Load Configurations

Store Results

There are 3 major parts in the System Crawler (Server Application) StorageSystem Web Application GUI (User)

Page 23: Web Categorization Crawler – Part I

23

Reminder: Worker Structure The Worker fetches a page from the Web and processes the

fetched page with the following steps: Extracting all the Hyper links from the page. Filtering part of the extracted URLs. Ranking the URL Categorizing the page Writing the results to the data base. Writing back the extracted urls back to the frontier.

Web Categorization Crawler

Fetcher

Categorizer

URLfilterExtractor

Page Ranker

DB

Worker Queue

Frontier Queue

Page 24: Web Categorization Crawler – Part I

24

Reminder: Frontier Structure Maintains the data structure that contains all the Urls that have not

been visited yet FIFO Queue*

Distributes the Urls uniformly between the workers

Web Categorization Crawler

Frontier Queue

Worker Queues

(*) first implementation

FIs Seen

TestRoute

RequestT

DeleteRequest

Page 25: Web Categorization Crawler – Part I

25

Project Technologies C# (C Sharp), a simple, modern, general-purpose, and

object oriented programming language ASP.NET, a web application framework Relational Data Base SQL, a database computer language for managing data SVN, a revision control system to maintain current and

historical versions of files

Web Categorization Crawler

Page 26: Web Categorization Crawler – Part I

26

Project Main Goals

Web Categorization Crawler

Support Categorization of the web pages, which tries to match the given content to predefined categories

Support Ranking of the web pages, which means building a ranking algorithm that evaluates the relevance (rank) of the extracted link based on the content of the parent page

A new implementation of the frontier, which passes on the requests according to their rank, should be fast and memory efficient data structure

Page 27: Web Categorization Crawler – Part I

27

Categorization Algorithm

Web Categorization Crawler

Tries to match the given content to predefined categories Every category is described by a list of keywords The final match result has two factors:

Match Percent which describes the match between the category keywords and the given content:

Non-Zero match which describes how many different keywords appeared in the content:

The total match level of the content to category is obtained from the sum of the two factors aforementioned :

* _sumOfKeywordsAppeared CAT ALPHAmatchPercentnumOfContentWords

_nonZeroCount CAT GAMMAnonZeroBonusnumOfKeyWords

( )total CONST nonZeroBonus matchPercent

* each keyword has max limit of how many times it can appear, any additional appearances won’t be counted

Page 28: Web Categorization Crawler – Part I

28

Categorization Algorithm cont. Overall Categorization progress when matching a certain page to

specific category

Web Categorization Crawler

PageContentWordList

NonZeroCalculator

MatcherCalculator

Category KeywordsKeyword1Keyword2Keyword3.Keyword n

NonZero Bonus

Match Percent

Total Match Level.

Page 29: Web Categorization Crawler – Part I

29

Ranking Algorithm - Motivation

Web Categorization Crawler

The World Wide Web contains a large volume of data Crawler can only download a fraction of the Web pages Thus there is a need to prioritize downloads, and crawl only the relevant pages

Solution: To give every extracted url a rank according to it’s relevance to the categories

that defined by the user The frontier will pass on the urls with higher rank

Relevant pages will be visited first The quality of the Crawler depends on the correctness of the ranker

Page 30: Web Categorization Crawler – Part I

30

Ranking Algorithm - Background

Web Categorization Crawler

Ranking is a kind of prediction The Rank must be given to the url when it is extracted from a page

It is meaningless to give the page a rank after we have downloaded it The content of the url is unavailable when it is extracted

The crawler didn’t download it yet The only information that we can assist of, when the url is extracted, is the page

from which the url has been extracted (aka the parent page) Ranking will be done according to the following factors*

The rank given to the parent page The relevance of the parent page content The relevance of the nearby text content of the

extracted url The relevance of the anchor of the extracted url

Anchor is the text that appears on the link

* Based on SharkSearch Algorithm

Page 31: Web Categorization Crawler – Part I

31

Ranking Algorithm – The Formula*

Web Categorization Crawler

Predicts the relevance of the content of the page of the url extracted The final rank of the url depends on the following factors

Inherited, which describes the relevance of the parent page to the categories:

Neighborhood, which describes the relevance of the nearby text and the anchor of the url:

While ContextRank is given by:

The total rank given to the extracted url is obtained from the aforementioned factors:

(1 )inherited ALPHA ParentUrlRank ALPHA ParentUrlContentRank

(1 )neighborhood BETA anchorRank BETA ContextRank

(1 )totalRank GAMMA inherited GAMMA neighborhood

,100,

NearbyTextRank anchorRank CLContextRank

otherwise

* Based on SharkSearch Algorithm

Page 32: Web Categorization Crawler – Part I

32

Frontier Structure – Ranking Trie

Web Categorization Crawler

A customized data structure that saves the url requests efficiently Holds two sub data structures

Trie, a data structure that holds url strings efficiently for already seen test RankTable, array of entries, each entry holds a list of all the url requests that have the

same rank level which is specified by the array index Supports url seen test in O(|urlString|),

every seen url is being saved in the trie Supports passing on first the urls with higher rank in O(1)

Page 33: Web Categorization Crawler – Part I

33

Frontier Structure - Overall

Web Categorization Crawler

The Frontier is based on the RankingTrie data structure Saves\updates all the new forwarded requests into the ranking trie When a new url request arrives, the frontier just adds it to the RankingTrie When the frontier need to route a request, it gets the high ranked request saved

in the RankingTrie and routes it to the suitable worker queue

Frontier Queue

Worker Queues

RankingTrie

Route Request

Page 34: Web Categorization Crawler – Part I

34

Summary Goals achieved:

Understanding ranking methods Especially the Shark Search

Implementing categorizing algorithm Implementing efficient frontier which supports ranking Implementing a multithreaded Web Categorization Crawler with full

functionality

Web Categorization Crawler(*) will be implemented at part II