Top Banner
Complex Queries over Web Repositories Sriram Raghavan and Hector Garcia-Molina Computer Science Department Stanford University Gülfem IŞIKLAR Gülfem IŞIKLAR M.Mirac KOCATÜRK M.Mirac KOCATÜRK
44

Gülfem IŞIKLAR M.Mirac KOCATÜRK

Jan 13, 2016

Download

Documents

Anaval Anaval

Complex Queries over Web Repositories Sriram Raghavan and Hector Garcia-Molina Computer Science Department Stanford University. Gülfem IŞIKLAR M.Mirac KOCATÜRK. Outline. Introduction Challenges and Solution Approach Model of a Web Repository Query Operators Examples of Complex Queries - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Gülfem IŞIKLAR M.Mirac KOCATÜRK

Complex Queries over Web Repositories

Sriram Raghavan and Hector Garcia-MolinaComputer Science Department

Stanford University

Gülfem IŞIKLARGülfem IŞIKLARM.Mirac KOCATÜRKM.Mirac KOCATÜRK

Page 2: Gülfem IŞIKLAR M.Mirac KOCATÜRK

19.11.200319.11.2003 Complex Queries over Web RepositoriesComplex Queries over Web Repositories 22

OutlineOutline

IntroductionIntroduction

Challenges and Solution ApproachChallenges and Solution Approach

Model of a Web RepositoryModel of a Web Repository

Query OperatorsQuery Operators

Examples of Complex QueriesExamples of Complex Queries

Optimizing and Executing Complex Web QueriesOptimizing and Executing Complex Web Queries

ConclusionConclusion

Page 3: Gülfem IŞIKLAR M.Mirac KOCATÜRK

19.11.200319.11.2003 Complex Queries over Web RepositoriesComplex Queries over Web Repositories 33

IntroductionIntroduction

Web repositories manage large heterogeneous collections of Web pages and associated indexes.

For effective analysis and mining, these repositories must provide a declarative query interface that supports complex expressive Web queries.

In this paper, we model a Web repository in terms of “Web relations” and describe an algebra for expressing complex Web queries.

Page 4: Gülfem IŞIKLAR M.Mirac KOCATÜRK

19.11.200319.11.2003 Complex Queries over Web RepositoriesComplex Queries over Web Repositories 44

Page 5: Gülfem IŞIKLAR M.Mirac KOCATÜRK

19.11.200319.11.2003 Complex Queries over Web RepositoriesComplex Queries over Web Repositories 55

Example 1: Let S be a weighted set consisting of all the pages

in the stanford.edu domain that contain the phrase ’Mobile

networking’. Compute R, the set of all the “.edu” domains

(except stanford.edu) that pages in S point to (we say a page p

points to domain D if it points to any page in D). List the top-10

domains in R in descending order of their weights.

Page 6: Gülfem IŞIKLAR M.Mirac KOCATÜRK

19.11.200319.11.2003 Complex Queries over Web RepositoriesComplex Queries over Web Repositories 66

Example 2: With each comic strip C, he associates a website CS, and a set CW containing the name of the strip and the names of the characters featured in that strip. For example,

DilbertW = {Dilbert, Dogbert, The Boss} and

DilbertS = dilbert.com.

Extract a set of at most 10000 pages from the stanford.edu domain, preferring pages whose URLs either include the “~” character or include the path fragment “/people/”.

For each comic strip C, compute f1 (C), the number of pages in S that contain the words in CW, and f2 (C), the number of pages in CS that pages in S point to.

f1 (C) + f2 (C) is a measure of popularity for comic strip C.

Page 7: Gülfem IŞIKLAR M.Mirac KOCATÜRK

19.11.200319.11.2003 Complex Queries over Web RepositoriesComplex Queries over Web Repositories 77

Challenges and Solution ApproachChallenges and Solution Approach

Query models used in relational or text retrieval systems provide some, but not all of the features required to support Web queries.

Thus, treating a Web repository as an application of a text retrieval system will support the “document collection” view. However, queries involving navigation or relational operators will be extremely hard to formulate and execute.

On the other hand, the relational model provides a rich and well-tested suite of operators for expressing complex predicates over Web page attributes. However, ranks and orders are not intrinsic to the the basic relational model.

Page 8: Gülfem IŞIKLAR M.Mirac KOCATÜRK

19.11.200319.11.2003 Complex Queries over Web RepositoriesComplex Queries over Web Repositories 88

Model of a Web RepositoryModel of a Web Repository

Page: We use the term “page” to refer to any Web resource

that is referenced by a URL, crawled, and stored in the

repository.

Link: We use the term “link” to refer to any hypertext link that

is embedded in the pages in the repository. Each link is

associated with a source page (the page in which the

hypertext link occurs) and a destination page (the page that

the link refers to), and a unique identifier linkID.

Page 9: Gülfem IŞIKLAR M.Mirac KOCATÜRK

19.11.200319.11.2003 Complex Queries over Web RepositoriesComplex Queries over Web Repositories 99

Ordered relation: Given a relation R and a strict partial ordering >R on the tuples of R, we refer to the pair [ R; >R ] as an ordered

relation on R.

For instance, we define an ordered relation [R; >R] = [ R; {a >R d;

a >R e; b >R d; b >R e; c >R d; c >R e } ], where each tuple whose

domain attribute is stanford.edu is >R -related to any tuple outside

the stanford.edu domain.

Page 10: Gülfem IŞIKLAR M.Mirac KOCATÜRK

19.11.200319.11.2003 Complex Queries over Web RepositoriesComplex Queries over Web Repositories 1010

Ranked relation: Given a relation R and a function w that assigns weights ( normalized to the range [0,1] ) to the tuples of R, we can define a new relation [ R ; w ] that is simply R with an additional implicit real-valued attribute w.

We refer to [ R ; w ] as a ranked relation on R and to R as the “base relation” of [ R ; w ].

Page 11: Gülfem IŞIKLAR M.Mirac KOCATÜRK

19.11.200319.11.2003 Complex Queries over Web RepositoriesComplex Queries over Web Repositories 1111

We model a Web repository as a 6-tuple

W = ( IP; IL; WR ; P ; L ; F ):

IP (resp. IL) is an identifier space from which the pageID (resp.

linkID) for every page (resp. link) is chosen.

WR is a set of plain, ranked, or ordered relations called Web

relations. A relation R is said to be a Web relation if it contains at least one attribute whose domain is IP, IL 2Ip, or 2IL .

P WR is a universal page relation. P contains one tuple for

each page in the repository and one column for each page attribute.

P = (pageID, ....)

Page 12: Gülfem IŞIKLAR M.Mirac KOCATÜRK

19.11.200319.11.2003 Complex Queries over Web RepositoriesComplex Queries over Web Repositories 1212

L WR is a universal link relation. L contains one tuple for

each hyperlink in the repository and one column for every available link attribute.

L = (linkID, srcID, destID, ...)

F is a set of predefined page and link ranking functions that have been registered in the repository.

Page 13: Gülfem IŞIKLAR M.Mirac KOCATÜRK

19.11.200319.11.2003 Complex Queries over Web RepositoriesComplex Queries over Web Repositories 1313

Query OperatorsQuery Operators

Page 14: Gülfem IŞIKLAR M.Mirac KOCATÜRK

Unary Relational OperatorsUnary Relational Operators

Select (σ): For an ordered relation, we define ( [ R ; >R ] ) = [ S ; >S ], where S = (R) and >S is defined as a >S b iff a >R b and a; b S.For instance, referring to relation R, given the ordered relation

[ R ; >R ] = [ R; {a > d; a > e; b > d; b > e; c > d; c > e } ], representing “preference to stanford.edu pages over berkeley.edu pages”, σpInDegree≥4 ([R;>R]) will yield [S;>S] where S = σpInDegree≥4 (R) = {b; d; e}, and >S includes the two ordering conditions b >S d and b >S e.

Page 15: Gülfem IŞIKLAR M.Mirac KOCATÜRK

19.11.200319.11.2003 Complex Queries over Web RepositoriesComplex Queries over Web Repositories 1515

Projection (Projection (ΠΠ):): The semantics of projection also carry over unchanged from traditional relational algebra, except for the folllowing changes:

The result of the projection must include at least one attribute whose domain is IP, IL, 2Ip, or 2IL , to ensure that the result is a

Web relation.

Projection on a ranked relation [ R ; w ] will retain the ranking attribute R.w even if it is not listed in the projection list.

Page 16: Gülfem IŞIKLAR M.Mirac KOCATÜRK

19.11.200319.11.2003 Complex Queries over Web RepositoriesComplex Queries over Web Repositories 1616

Group-by ( γ ) on plain Web relation : The γ operator is used

to group the incoming links to WS based on the language of the

page in which the links occur.

Group-by ()

Page 17: Gülfem IŞIKLAR M.Mirac KOCATÜRK

19.11.200319.11.2003 Complex Queries over Web RepositoriesComplex Queries over Web Repositories 1717

Group-by ( γ ) on ranked Web relation: The value of t.w’ for a

tuple t [S ; w’] is computed by averaging the ranks of all the

tuples of R belonging to the corresponding group.

Page 18: Gülfem IŞIKLAR M.Mirac KOCATÜRK

19.11.200319.11.2003 Complex Queries over Web RepositoriesComplex Queries over Web Repositories 1818

Group-by ( γ ) on ordered Web relation: The partial ordering >R is used to express the following preference: “prefer pages

with depths ≤ 3”. Thus, b >R c, g >R e, etc., as shown in the

diagram.

Page 19: Gülfem IŞIKLAR M.Mirac KOCATÜRK

19.11.200319.11.2003 Complex Queries over Web RepositoriesComplex Queries over Web Repositories 1919

Binary Relational OperatorsBinary Relational Operators

Union ( U ): We define [ X; >X ] [ [ Y; >Y ] = [ Z ; >Z ], where

• Z = X U Y

• If a >X b and a >Y b, then a >Z b

• If either a >X b and b ∉ Y or a >Y b and b ∉ X, then a >Z b

Page 20: Gülfem IŞIKLAR M.Mirac KOCATÜRK

19.11.200319.11.2003 Complex Queries over Web RepositoriesComplex Queries over Web Repositories 2020

Set-difference (−): Note that the result has a partial order which is

simply >X restricted to the elements present in the result.

[ X ; >X ] − [ Y ; >Y ] = [ Z ; >Z ],

where Z = X − Y and a >Z b iff a >X b and a; b Z

Page 21: Gülfem IŞIKLAR M.Mirac KOCATÜRK

19.11.200319.11.2003 Complex Queries over Web RepositoriesComplex Queries over Web Repositories 2121

Intersection ( ⋂ ):

[ X ; >X ] ⋂ [ Y ; >Y ] = [ Z ; >Z ], where

Z = X ⋂ Y and a >Z b iff a; b Z, a >X b, and a >Y b

Page 22: Gülfem IŞIKLAR M.Mirac KOCATÜRK

19.11.200319.11.2003 Complex Queries over Web RepositoriesComplex Queries over Web Repositories 2222

Cross-product ( X ): Cross-product operations can involve any

pair of plain, ranked, or ordered relations. The challenge is to

define the ordering or ranking of the result for each possible

combination of operands.

Page 23: Gülfem IŞIKLAR M.Mirac KOCATÜRK

19.11.200319.11.2003 Complex Queries over Web RepositoriesComplex Queries over Web Repositories 2323

Ranking and ordering operators

Rank (Ψ): Operator Ψ simply formalizes the act of applying a

ranking function to a base relation. Thus, given a relation R and

ranking function f : R x {R} → [ 0 ; 1 ], we define Ψ( f ; R ) = [ R ; f ].

Compose ( Θh ,op ): The compose operator Θ is used to merge

two ranked relations to produce another ranked relation.

Page 24: Gülfem IŞIKLAR M.Mirac KOCATÜRK

19.11.200319.11.2003 Complex Queries over Web RepositoriesComplex Queries over Web Repositories 2424

Order ( Φ ): The operator Φ constructs an ordered relation,

given either a ranked relation or a plain base relation.

When applied on a ranked relation, Φ ([R; f]) returns the

corresponding ordered relation [ R ; >f ].

Page 25: Gülfem IŞIKLAR M.Mirac KOCATÜRK

19.11.200319.11.2003 Complex Queries over Web RepositoriesComplex Queries over Web Repositories 2525

Prune ( Ωk ): The prune operator provides a mechanism for

retrieving a fixed-size subset of tuples from a relation. In

particular, given a relation R, Ωk (R) selects a subset of size

min(k; |R|).

For example, consider the ordered relation [ R; {a > b ; a > c ; a > e ; f > b ; f > c ; f > e} ] shown in previous figure, corresponding to the preference for “.com” domains over “.org” domains.

Ω4 on this relation can yield any set of four tuples as long as at least a and f are part of the result (thus, 6 possible results). Thus, one possible result of applying Ω4 is [{a; f; e; d}; {a > e; f > e} ].

Page 26: Gülfem IŞIKLAR M.Mirac KOCATÜRK

19.11.200319.11.2003 Complex Queries over Web RepositoriesComplex Queries over Web Repositories 2626

QUERY OPERATORSQUERY OPERATORS

NAVIGATION OPERATORSNAVIGATION OPERATORS→→Λ is represented as forward navigationΛ is represented as forward navigation←←Λ is represented as backward navigationΛ is represented as backward navigation

These operators are expressed in terms of These operators are expressed in terms of cross product and group by operations.cross product and group by operations.

Page 27: Gülfem IŞIKLAR M.Mirac KOCATÜRK

19.11.200319.11.2003 Complex Queries over Web RepositoriesComplex Queries over Web Repositories 2727

QUERY OPERATORSQUERY OPERATORS

Navigation Operators differ in 2 ways:Navigation Operators differ in 2 ways:

1.1. Binary Navigation operatorBinary Navigation operator

2.2. Unary Navigation operatorUnary Navigation operator

Page 28: Gülfem IŞIKLAR M.Mirac KOCATÜRK

19.11.200319.11.2003 Complex Queries over Web RepositoriesComplex Queries over Web Repositories 2828

QUERY OPERATORSQUERY OPERATORS

Binary Navigation OperatorBinary Navigation Operator1.1. Navigation with RankingNavigation with Ranking2.2. Navigation with OrderingNavigation with Ordering

a. a. Ordering only on pagesOrdering only on pagesb.b. Ordering both pages and links Ordering both pages and links

Page 29: Gülfem IŞIKLAR M.Mirac KOCATÜRK

19.11.200319.11.2003 Complex Queries over Web RepositoriesComplex Queries over Web Repositories 2929

QUERY OPERATORSQUERY OPERATORS(navigation with ordering)(navigation with ordering)

[R,>[R,>RR]=]=ΦΦpLanguage=English>pLanguage≠English(R)pLanguage=English>pLanguage≠English(R)

[S,>[S,>SS]=]=ΦΦIntraDomain=yes>IntraDomain=No(S)IntraDomain=yes>IntraDomain=No(S)

Page 30: Gülfem IŞIKLAR M.Mirac KOCATÜRK

19.11.200319.11.2003 Complex Queries over Web RepositoriesComplex Queries over Web Repositories 3030

QUERY OPERATORSQUERY OPERATORS(navigation with ranking)(navigation with ranking)

In this example we have the terms of;In this example we have the terms of;[R,f] and [S,g][R,f] and [S,g]

Page 31: Gülfem IŞIKLAR M.Mirac KOCATÜRK

19.11.200319.11.2003 Complex Queries over Web RepositoriesComplex Queries over Web Repositories 3131

QUERY OPERATORSQUERY OPERATORS

UNARY NAVIGATION OPERATORSUNARY NAVIGATION OPERATORS

Instead of choosing from a set of tuples these operators Instead of choosing from a set of tuples these operators

permit navigation using all available data links in the permit navigation using all available data links in the

repository.repository.

So if R is ordered and ranked then each neighbour will also So if R is ordered and ranked then each neighbour will also

be correspondingly ordered and ranked.be correspondingly ordered and ranked.

Page 32: Gülfem IŞIKLAR M.Mirac KOCATÜRK

19.11.200319.11.2003 Complex Queries over Web RepositoriesComplex Queries over Web Repositories 3232

EXAMPLES OF COMPLEX QUERIESEXAMPLES OF COMPLEX QUERIES

Example 1Example 1

Page 33: Gülfem IŞIKLAR M.Mirac KOCATÜRK

19.11.200319.11.2003 Complex Queries over Web RepositoriesComplex Queries over Web Repositories 3333

EXAMPLES OF COMPLEX QUERIESEXAMPLES OF COMPLEX QUERIES

Example 2Example 2

Page 34: Gülfem IŞIKLAR M.Mirac KOCATÜRK

19.11.200319.11.2003 Complex Queries over Web RepositoriesComplex Queries over Web Repositories 3434

EXAMPLES OF COMPLEX QUERIESEXAMPLES OF COMPLEX QUERIES

Example 3Example 3

Page 35: Gülfem IŞIKLAR M.Mirac KOCATÜRK

19.11.200319.11.2003 Complex Queries over Web RepositoriesComplex Queries over Web Repositories 3535

EXAMPLES OF COMPLEX QUERIESEXAMPLES OF COMPLEX QUERIES

Example 4Example 4

Page 36: Gülfem IŞIKLAR M.Mirac KOCATÜRK

19.11.200319.11.2003 Complex Queries over Web RepositoriesComplex Queries over Web Repositories 3636

OPTIMIZING AND EXECUTING OPTIMIZING AND EXECUTING COMPLEX WEB QUERIESCOMPLEX WEB QUERIES

An optimizer and execution engine is developed to An optimizer and execution engine is developed to

efficiently executing the complex queries.efficiently executing the complex queries.

The challenges of the system are:The challenges of the system are:

1.1. Certain unique features of Web data setCertain unique features of Web data set

2.2. The storage structures used in Web repositoriesThe storage structures used in Web repositories

3.3. Characteristics of complex web queriesCharacteristics of complex web queries

Page 37: Gülfem IŞIKLAR M.Mirac KOCATÜRK

19.11.200319.11.2003 Complex Queries over Web RepositoriesComplex Queries over Web Repositories 3737

OPTIMIZING AND EXECUTING OPTIMIZING AND EXECUTING COMPLEX WEB QUERIESCOMPLEX WEB QUERIES

As with join operations in relational queries, optimization of As with join operations in relational queries, optimization of

navigation operations is crucial for web queries.navigation operations is crucial for web queries.

There are two techniques to optimize navigation operation:There are two techniques to optimize navigation operation:

1.1. Exploit Query LocalityExploit Query Locality

2.2. Exploit PruneExploit Prune

Page 38: Gülfem IŞIKLAR M.Mirac KOCATÜRK

19.11.200319.11.2003 Complex Queries over Web RepositoriesComplex Queries over Web Repositories 3838

OPTIMIZING AND EXECUTING OPTIMIZING AND EXECUTING COMPLEX WEB QUERIESCOMPLEX WEB QUERIES

PAGE CLUSTERSPAGE CLUSTERS

To identify and exploit locality during query execution, we To identify and exploit locality during query execution, we

partition the entire set in the repository into page clusters.partition the entire set in the repository into page clusters.

We attempt to group together “related” pages so that all the We attempt to group together “related” pages so that all the

pages relevant to a complex query as distributed among a pages relevant to a complex query as distributed among a

relatively small number of clusters.relatively small number of clusters.

Page 39: Gülfem IŞIKLAR M.Mirac KOCATÜRK

19.11.200319.11.2003 Complex Queries over Web RepositoriesComplex Queries over Web Repositories 3939

OPTIMIZING AND EXECUTING OPTIMIZING AND EXECUTING COMPLEX WEB QUERIESCOMPLEX WEB QUERIES

S-NODE REPRESENTATIONS-NODE REPRESENTATION

Supernode graphs resides in memory.Supernode graphs resides in memory.

Graph chunks are loaded from disk on demand.Graph chunks are loaded from disk on demand.

Page 40: Gülfem IŞIKLAR M.Mirac KOCATÜRK

19.11.200319.11.2003 Complex Queries over Web RepositoriesComplex Queries over Web Repositories 4040

EXPERIMENTAL RESULTSEXPERIMENTAL RESULTS

35-million page data set 35-million page data set (approximately; 600 million links with 300 GB of HTML)(approximately; 600 million links with 300 GB of HTML)

Page 41: Gülfem IŞIKLAR M.Mirac KOCATÜRK

19.11.200319.11.2003 Complex Queries over Web RepositoriesComplex Queries over Web Repositories 4141

EXPERIMENTAL RESULTSEXPERIMENTAL RESULTS

30 Web queries over 5 different 20-million data set30 Web queries over 5 different 20-million data set

Page 42: Gülfem IŞIKLAR M.Mirac KOCATÜRK

19.11.200319.11.2003 Complex Queries over Web RepositoriesComplex Queries over Web Repositories 4242

RELATED WORKSRELATED WORKS

Drawing inspiration graph and hyper-text query systems, a Drawing inspiration graph and hyper-text query systems, a

number of web query languages have been developed in the number of web query languages have been developed in the

past such as WebSQL, W3QL, StruQL etc.past such as WebSQL, W3QL, StruQL etc.

These models are not incorporate with the notions ordering These models are not incorporate with the notions ordering

and ranking.and ranking.

At implementation level, these systems are intended for At implementation level, these systems are intended for

“online” queries for Web-Site Management as opposed to our “online” queries for Web-Site Management as opposed to our

“warehouse” model.“warehouse” model.

Page 43: Gülfem IŞIKLAR M.Mirac KOCATÜRK

19.11.200319.11.2003 Complex Queries over Web RepositoriesComplex Queries over Web Repositories 4343

CONCLUSIONCONCLUSION

We addressed the problem of formulating and executing We addressed the problem of formulating and executing

complex queries over Web repositories.complex queries over Web repositories.

We showed that the key characteristics of Web queries are We showed that the key characteristics of Web queries are

the combination of navigation, text search and relational the combination of navigation, text search and relational

operators which can manipulate ordering and ranking.operators which can manipulate ordering and ranking.

Finally we discussed some of the optimization techniques to Finally we discussed some of the optimization techniques to

execute such queries more efficiently.execute such queries more efficiently.

Page 44: Gülfem IŞIKLAR M.Mirac KOCATÜRK

THANK YOUTHANK YOU