Top Banner
Lore: A Database Management System for Semistructured Data
46

Lore: A Database Management System for Semistructured Data

Feb 02, 2016

Download

Documents

tavia

Lore: A Database Management System for Semistructured Data. Why?. Although data may exhibit some structure it may be too varied or irregular to map to a fixed schema. Relational DBMS might use null values in this case. May be difficult to decide in advance on a specific schema. - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Lore: A Database Management System for Semistructured Data

Lore: A Database Management System for Semistructured Data

Page 2: Lore: A Database Management System for Semistructured Data

Why?

• Although data may exhibit some structure it may be too varied or irregular to map to a fixed schema.– Relational DBMS might use null values in this case.

• May be difficult to decide in advance on a specific schema.– Data elements may change types.

– Structure changes a lot (lots of schema modifications).

Page 3: Lore: A Database Management System for Semistructured Data

Semistructured Data

• Examples:– Data from the web

• Overall site structure may change often.

• It would be nice to be able to query a web site.

– Data integrated from multiple, heterogeneous data sources.

• Information sources change, or new sources added.

Page 4: Lore: A Database Management System for Semistructured Data

Object Exchange Model (OEM)

• Data in this model can be thought of as a labeled directed graph.– Schema-less and self-describing.

• Vertices in graph are objects.– Each object has a unique object identifier (oid),

such as &5.– Atomic objects have no outgoing edges and are

types such as int, real, string, gif, java, etc.– All other objects that have outgoing edges are

called complex objects.

Page 5: Lore: A Database Management System for Semistructured Data

OEM (Cont.)

• Examples:– Object &3 is complex, and its subobjects are

&8, &9, &10, and &11.– Object &7 is atomic and has value “Clark”.

• DBGroup is a name that denotes object &1.(Names are entry points into the database).

Page 6: Lore: A Database Management System for Semistructured Data
Page 7: Lore: A Database Management System for Semistructured Data

OEM to XML• Example:

– <Member project=“&5 &6”><name>Jones</name><age>46</age><office>

<building>gates</building><room>252</room>

</office></member>

• This corresponds to rightmost member in the example OEM, where project is an attribute.

Page 8: Lore: A Database Management System for Semistructured Data

Lorel Query Language

• Need query language that supports path expressions for traversing graph data and handling of ‘typeless’ data.

• A simple path expression is a name followed by a sequence of labels.– DBGroup.Member.Office.– Set of objects that can be reached starting with

the DBGroup object, following edges labels member and then office.

Page 9: Lore: A Database Management System for Semistructured Data

Lorel (cont.)

• Example:– select DBGroup.Member.Office

where DBGroup.Member.Age < 30

• Result:– Office “Gates 252”– Office

Building “CIS”Room “411”

Page 10: Lore: A Database Management System for Semistructured Data

Lorel Query Rewrite

• Previous query rewritten to:– select O

from DBGroup.Member M, M.Office Owhere exists y in M.Age : y < 30

• Comparison on age transformed to existential condition.– Since all properties are set-valued in OEM.– A user can ask DBGroup.Member.Age < 30 regardless

of whether Age is single valued, set valued, or unknown.

Page 11: Lore: A Database Management System for Semistructured Data

Lorel Query Rewrite• Why?

– Breaking query into simple path expressions necessary for query optimization.

– Need to explicitly handle coercion.• Atomic objects and values.

0.5 < “0.9” should return true

• Comparing objects and sets of objects. DBGroup.Member.Age is a set of objects.

Page 12: Lore: A Database Management System for Semistructured Data

Lorel (cont.)

• General path expressions are loosely specified patterns for labels in the database.(‘|’ disjunction, ‘?’ label pattern optional)

• Example:– select DBGroup.Member.Name

where DBGroup.Member.Office(.Room%|.Cubicle)?like “%252”

• Result:– Name “Jones”

Name “Smith”

Page 13: Lore: A Database Management System for Semistructured Data

Query and Update Processing

• Query is parsed

• Parse tree is preprocessed and translated to new OQL-like query.

• Query plan constructed.

• Query optimization.

• Opt. query plan executed.

Page 14: Lore: A Database Management System for Semistructured Data

System Architecture

Page 15: Lore: A Database Management System for Semistructured Data

Iterators and Object Assignments

• Use recursive iterator approach:– execution begins at top of query plan– each node in the plan requests a tuple at a time

from its children and performs some operation on the tuple(s).

– pass result tuples up to parent.

Page 16: Lore: A Database Management System for Semistructured Data

Object Assignments (OAs)

• OA is a data structure containing slots for range variables with additional slots depending on the query.

• Each slot within an OA will holds the oid of a vertex on a path being considered by the query engine.

• Example: if OA1 holds oid for “Smith” then OA2 and OA3 can hold the oids for one of Smiths Office objects and Age objects.

Page 17: Lore: A Database Management System for Semistructured Data
Page 18: Lore: A Database Management System for Semistructured Data

Query Operators• For example, the Scan operator returns all oids that

are subobjects of a given object following a specified path expression.– Scan (StartingOASlot, Path_expression, TargetOASlot)

• For each oid in StartingOASlot, check to see if object satisfies path_expression and place oid into TargetOASlot.

• Other operators include Join, Project, Select, Aggregation, etc.

• Join node like nested-loop join in relational DBMS.

Page 19: Lore: A Database Management System for Semistructured Data

Query Optimization

• Does only a few optimizations:– Push selection ops down query tree.

– Eliminate/combine redundant query operators.

• Explores query plans that use indexes where possible.– Two kinds of indexes:

– Lindex (link index) provide parent pointers impl. as hashing.

– Vindex (value index) impl. as B+-trees

Page 20: Lore: A Database Management System for Semistructured Data

Indexes

• Because of non-strict typing system, have String Vindex, Real Vindex, and String-coerced-to-real Vindex.

• Separate B-Trees for each type are constructed.• Using Vindex for comparison (e.g. Age < 30)

consider the following:– If type is string, do lookup in String Vindex– If can convert to real the do lookup in String-coerced-

to-real Vindex.– If type is real?

Page 21: Lore: A Database Management System for Semistructured Data
Page 22: Lore: A Database Management System for Semistructured Data

Other issues• Update query operator example:

– Update(Create_Edge, OA1, OA5, “Member”)– Create edge from results in OA1 to OA5 labeled “Member”.

• Lore arranges objects in physical disk pages, each page with a number of slots with a single object in each slot.– Objects placed according to first-fit algorithm.– Supports large objects spanning multiple pages.– Objects clustered in depth-first manner (since Scan traverses

depth-first).– Garbage collector removes unreachable objects.

Page 23: Lore: A Database Management System for Semistructured Data

External Data Manager

• Enables retrieval of information from other data sources, transparent to the user.

• An external object in Lore is a “placeholder” for the external data and specifies how lore interacts with an external data source.

• The spec for an external object includes:– Location of a wrapper program to fetch and convert

data to OEM, time interval until fetched information becomes stale, and a set of arguments used to limit info fetched from external source.

Page 24: Lore: A Database Management System for Semistructured Data

Data Guides

• A DataGuide is a concise and accurate summary of the structure of an OEM database (stored as OEM database itself, kind of like the system catalog).

• Why?– No explicit schema, how do we formulate meaningful

queries?– Large databases (can’t just view graph structure).– What if a path expression doesn’t exist (waste).

• Each possible path expression is encoded once.

Page 25: Lore: A Database Management System for Semistructured Data

{9, 13}

Page 26: Lore: A Database Management System for Semistructured Data

DataGuides As Histograms

• Each object in the dataguide can have a link to its corresponding target set.– A target set is a set of oids reachable by that path.

• TS of DBGroup.Member.Age is {9, 13}.

– This is a path index. Can find set of objects reachable by a particular path.

– Can store statistics in DataGuide (more in next paper).• For example, the # of atomic objects of each type reachable by

p.

Page 27: Lore: A Database Management System for Semistructured Data

Conclusions

• Takes advantage of the structure where it exists.• Handles lack of structure well (data type coercion,

general path expressions).• Query language allows users to get and update

data from semistructured sources.– DataGuide allows users to determine what paths exist,

and gives useful statistical information

Page 28: Lore: A Database Management System for Semistructured Data

Query Optimization for Semistructured Data

Page 29: Lore: A Database Management System for Semistructured Data

OEM vs. XML

• OEM’s objects correspond to elements in XML• Sub-elements in XML are inherently ordered.• XML elements may optionally include a list of

attribute value pairs.• Graph structure for multiple incoming edges

specified in XML with references (ID, IDREF attributes). i.e. the Project attribute.

Page 30: Lore: A Database Management System for Semistructured Data

Indexes

• Vindex(op, value, l, x) places into x all atomic objects that satisfy the “op value” condition with an incoming edge labeled l.– Vindex(“Age”, <, 30,y) places into y objects

with age < 30.

• Lindex(x, l, y) places into x all objects that are parents of y via edge labeled l.– Lindex(x, “Age”, y) places into x all parents of

y via label “Age”.

Page 31: Lore: A Database Management System for Semistructured Data

Indexes (cont.)

• Bindex(l, x, y) finds all parent-child object pairs connected by a label l.– Bindex(“Age”, x, y) locates all parent-child pairs with

label Age.

• Pindex(PathExpression, x) placed into x all objects reachable via the path expression.– Pindex(“A.B x, x.C y”, y) places into y all objects

reachable by going from A to B to C.

– Uses DataGuide.

Page 32: Lore: A Database Management System for Semistructured Data

Simple Query• select O

from DBGroup.Member M, M.Office Owhere exists y in M.Age : y < 30

• Possible plans:– Top-down (similar to pointer-chasing, nested-loops

join)– Use Vindex to check y < 30, traverse backwards from

child to parent using Lindex(bottom-up).

– Hybrid, both top down and bottom up. Meet in middle.

Page 33: Lore: A Database Management System for Semistructured Data

Select xFrom A.B xWhere exists y in x.C: y = 5

Page 34: Lore: A Database Management System for Semistructured Data

Query Plan Generation (Overview)

• Logical query plan generator creates high-level execution strategy.

• Physical query plan enumerator uses statistics and a cost model to transform logical query plan into an estimated best physical plan that lies within their search space.

Page 35: Lore: A Database Management System for Semistructured Data

Logical Query Plans (cont.)

• Glue node represents a ‘rotation point’ that has as its children two independent subplans. – Rotating the order between independent components

yields different plans.

– Marks place where execution order is not fixed.

• Discover node chooses best way to bind variables x and y.

• Chain node chooses best evaluation of a path expression.

Page 36: Lore: A Database Management System for Semistructured Data

Logical query plan for:Select xFrom DBGroup.Member xWhere exists y in x.Age: y<30

from clause where clause

Page 37: Lore: A Database Management System for Semistructured Data

Physical Query Plans

Page 38: Lore: A Database Management System for Semistructured Data

Physical Query Plans (cont.)

• Scan(x, l, y) places into y all objects that are subojects of x via edge labeled l.– Top-down (pointer chasing).

• Lindex plan is bottom-up approach.• Bindex: Locate edges whose label appears

infrequently in database.• NLJ: left subplan passes variables to right

subplan.

Page 39: Lore: A Database Management System for Semistructured Data

Statistics

• I/O metric uses estimated # of objects fetched.• For every label subpath p of length <= k:

– # Of atomic objects of each type reachable by p– Min, and max values of all atomic objects of each type

reachable by p– # Of instances of path p, denoted |p|– # Of distinct objects reachable by p, denoted |p|d– # Of l-labeled subobjects of all objects reachable by p– # Of incoming l-labeled edges to any instance of p,

denoted |pl|

Page 40: Lore: A Database Management System for Semistructured Data

Plan Enumeration• Doesn’t consider joining two simple path

expressions together unless they share a common variable.

• Pindex is used only when path expression begins with a name and no variable except the last is used in the query.

• Select clause always executes last.• Doesn’t try to reorder multiple independent path

expressions.

Page 41: Lore: A Database Management System for Semistructured Data
Page 42: Lore: A Database Management System for Semistructured Data

Results

• Used XML database about movies. Database graph contained 62,256 nodes and 130,402 edges.

• Experiment 1: Select DB.Movie.Title– Best plan is Pindex, followed by top-down– Worst plan is Bindex, with hash joins.

Page 43: Lore: A Database Management System for Semistructured Data

Results (cont.)

• Experiment 2: All Movies with a Genre of “Comedy”

– Where clause is very selective, bottom-up does a Vindex for “Comedy” with incoming edge Genre

Page 44: Lore: A Database Management System for Semistructured Data

Results (cont.)

• Experiment 3: Query with two existentially quantified variables in the where clause.

• Errors due to bad estimates of atomic value distributions and set operation costs.

Page 45: Lore: A Database Management System for Semistructured Data

Results (cont.)

• Experiment 4: Select movies with certain quality rating.

• Quality ratings uncommon in database so optimizer chooses to find all ratings via Bindex, and then work bottom-up.

Page 46: Lore: A Database Management System for Semistructured Data

Conclusions

• Cost estimates are accurate and select the best plan most of the time

• Execution times of best and worst plans for a given query can differ by many orders of magnitude.

• Best strategy is highly dependent upon the query and database (Query optimization is good for XML data).