Top Banner
Query Optimization for Semistructured Data Jason McHug, Jennifer Widom Stanford University - Rajendra S. Thapa
31

Query Optimization for Semistructured Data

Dec 31, 2015

Download

Documents

Rebecca Chen

Query Optimization for Semistructured Data. Jason McHug, Jennifer Widom Stanford University. - Rajendra S. Thapa. ………..Road Map. Lore System Query Execution Engine Statistic and cost model Performance Results. Lore Data Model - OEM. Data Guide. Path Expression. Simple Path Expression - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Query Optimization for Semistructured Data

Query Optimization for Semistructured Data

Jason McHug, Jennifer Widom Stanford University

- Rajendra S. Thapa

Page 2: Query Optimization for Semistructured Data

………..Road Map

Lore System

Query Execution Engine

Statistic and cost model

Performance Results

Page 3: Query Optimization for Semistructured Data

Lore Data Model - OEM

Page 4: Query Optimization for Semistructured Data

Data Guide

Page 5: Query Optimization for Semistructured Data

Path ExpressionSimple Path Expression

– specifies a single-step navigating in the database

DBGroup.member y – denotes variable y ranges all member-labeled sub-

objects of the object assigned to x

Path Expression– ordered list of simple path expressions

DBGroup.Member x, x.Age y

-variable y ranges over all objects that can be reached by starting with the DBGroup object, following an edge labeled Member, then following an edge labeled Age.

Page 6: Query Optimization for Semistructured Data

Query languageQuery:

SELECT x

FROM DBGroup.Member x

WHERE exists y in x.Age: y<30

<Member>

<Name>Smith</Name>

<Age>28</Age>

<Office>Gates 252 </Office>

<Office>

<Building> CIS </Building>

<Room>411 </Room>

</Office>

</Member>

Result:

Page 7: Query Optimization for Semistructured Data

Lore architecture

Page 8: Query Optimization for Semistructured Data

Lore architectureTextual Interface

DataEngine

Query ProcessingParsing

Preprocessor

Logical Query Plan Generation

Query Optimization

Physical Query Plan Generation

Execution of Physical Query Plan

Page 9: Query Optimization for Semistructured Data

Queries can be executed in many ways

Top down

Bottom Up

Hybrid

SELECT x FROM DBGroup.Member x

WHERE exists y in x.Age: y<30

Page 10: Query Optimization for Semistructured Data

CC

D BD

A

Top-down preferred

Select x

from A.B x

where exists y in x.C: y = 5

Query

•top down would explore only this path

- only one path A.B.C

•bottom-up would visit all leaf objects

with value 5 and their parents

555

C

Page 11: Query Optimization for Semistructured Data

CCC

B BB

A

Bottom-up preferred

•Many A.B.C paths

•But only a leaf satisfying the predicate

•bottom-up is a good candidate

544

Select x

from A.B x

where exists y in x.C: y = 5

Query

Page 12: Query Optimization for Semistructured Data

CCC

B BB

A

Hybrid preferred

544

B

B

D

D

Select x

from A.B x

where exists y in x.C: y = 5

Query

Page 13: Query Optimization for Semistructured Data

Query Execution Engine

• Logical Query Plans

-logical query plan operators

- structure of the plan

• Physical Query Plans

-operators

- some physical plans

• Statistics and Cost Model

• Plan Enumeration

Page 14: Query Optimization for Semistructured Data

Query Execution Engine

Logical operators

Discover

Chain

Glue

Create Temp

Project

---

---

---

Logical Query plans

•Variable binding

a variable x in the query is said to be bound if object o has been assigned to x

•Evaluation

an evaluation of a query plan (or sub-plan) is a list of all variables appearing in the plan along with the object(if any) bound to each variable.

•Rotation

Page 15: Query Optimization for Semistructured Data

Chain

Chain

Discover(x,”B”,y)

Discover(z,”D”,v)

Discover(y,”C”,z)

Representation of a Path expression in the logical query plan

x.B y, y.C z, z.D v

Page 16: Query Optimization for Semistructured Data

CreatTemp(x,t2)

Select(y,<30)Exists(y)Discover(t1,”Member”,x)Name(“DBGroup”,t1)

Glue

GlueChain

Project(t2)

Discover(x,”Age”,y)

Complete logical query planSELECT x

FROM DBGroup.Member x

WHERE exists y in x.Age: y<30

Page 17: Query Optimization for Semistructured Data

Query Execution Engine

Operators

Scan(x, l, y)

Lindex(x, l, y)

Pindex(Path Expression, x)

Bindex(l, x, y)

Name(x, n)

Vindex(Op, Value, l, x)

---

---

---

Physical Query plans

lll

cb

a

y = {a, b, c}

x

Page 18: Query Optimization for Semistructured Data

Some physical plans for a simple logical Query Plan

Discover(A,”B”,x)

Discover(x,”C”,y)

Chain

Logical Query Plan

A.B x, x.C y

Page 19: Query Optimization for Semistructured Data

physical plans

Scan(A,”B”,x)

Scan(x,”C”,y)

NLJ

Scan Plan

Lindex(x,”C”,y)

Name(t, A)

NLJ

Lindex Plan

Lindex(t,”B”,x)

A.B x, x.C y

Page 20: Query Optimization for Semistructured Data

more physical plans... A.B x, x.C y

Name(t, A)

Scan(x,”C”,y)

NLJ

Bindex Plan

Bindex(t,”B”,x)

Pindex(“A.B x, x.C y”, y)

Pindex Plan

Page 21: Query Optimization for Semistructured Data

how physical plans are produced.

• Each logical plan node creates an optimal physical plan given a set of bound variable.

• During plan enumeration we track1. Whether the variable is bound or not

2. Which plan operator has bound the variable

3. All other plan operators that use the variable

4. Whether the variable is stored within a temporary result.

Page 22: Query Optimization for Semistructured Data

how physical plans are produced.SELECT x

FROM DBGroup.Member x

WHERE exists y in x.Age: y<30

Logical plan

Page 23: Query Optimization for Semistructured Data

possible physical plans

Fig. (a)

Logical plan

Page 24: Query Optimization for Semistructured Data

possible physical plans

fig. (c)

Logical plan

Physical plans

Page 25: Query Optimization for Semistructured Data

more physical plan….

Fig. (d)

Logical plan

Page 26: Query Optimization for Semistructured Data

Statistic and Cost Model

• Each physical plan is assigned a cost based on the estimated I/O and CPU time required to execute a plan.

• The costing procedure is recursive.

• I/O first then CPU time to decide the cheaper plan.

Page 27: Query Optimization for Semistructured Data

Performance Result

A simple query

SELECT DBGroup.Movie.Title

-11 different query plans

- * the best plan uses Lore’s path index to quickly locate all the movie titles

- second plan is top-down strategy

- the worst plan uses Bindex operators and hash joins

Experiment 1

Page 28: Query Optimization for Semistructured Data

Performance Result

Same query with a Genere subobject having value ‘Comedy’

- point query

Experiment 2

Page 29: Query Optimization for Semistructured Data

Performance ResultExperiment 3

- Same point query

- all possible plans are not executed

- different plans were generated or disallowing the use of particular operator or indexes.

Page 30: Query Optimization for Semistructured Data

Performance ResultExperiment 4

Query selects movies with certain quality rating.

Page 31: Query Optimization for Semistructured Data

…….future Work

• Optimization techniques for branching path expression– a query rewrite that moves Where clause predicates into the From

clause and a transformation that introduces a Group-by clause when a large number of paths pass through a small number of objects.

• Partially correlated sub-plans– similar to correlated subqueries but rely on the bindings passed between

portions of the physical query plan rather than on the query itself.

• In the area of statistic– efficient statistics-gathering algorithms– statistic about the location of objects on disk– modification to the cost formulas to generate more accurate cost

estimates