PARTIAL RESULTS AND STRUCTURAL AGGREGATION OVER XML DATA STREAMS Kristin Tufte PhD Defense Dec 17, 2004
Jan 03, 2016
PARTIAL RESULTS AND STRUCTURAL AGGREGATION OVER XML DATA STREAMS
Kristin Tufte
PhD Defense Dec 17, 2004
212/17/2004
Streams & XML
Nested, structured data (XML) Streams: network traffic
information, environmental sensor data, telephone call records, click streams
(Jones, Bob, 153 Fir St., Portland) lname:Jones fname:Bob
street:153 Fir St.
address
city: Portland
person
That was then…
…this is now.
312/17/2004
New Challenges
XML Data is nested New operators, query language
Streams Potentially infinite Produce results without waiting for end of stream/data Arrival rate not in control of database system
XML Streams Stock Data Data Exchange Intelligent Transportation Systems
412/17/2004
Talk Preview
Incremental Query Evaluation (IQE) Merge Operation Merge Theory Merge Performance
512/17/2004
Context for IQE Continuous Queries – Tapestry (Early 1990’s)
Monotonic queries, append-only databases Long-running Queries
Online aggregation (Hellerstein et al.), Nested Aggregates (Tan et al.)
Incremental Query Evaluation (IQE) (Partial Results) General solution for long-running queries over XML data
Stream Processing Potentially infinite streams of data STREAM, Aurora (Borealis), Niagara West
Triggers (Eric Hanson, NiagaraCQ)
612/17/2004
Incremental Query Evaluation*
Motivation: Internet queries (long-running, data in XML) Get results to users before all of the
data arrives
Non-monotonic (blocking) operators are problematic
Modify operators and system framework
countgroup by Subject
* Joint work with Jai Shanmugasundaram
(Title, Subject, DateTime)
selectDateTime ≥ “12/17/04:12AM”
712/17/2004
(Non-)monotonic Operators
An operator O is monotonic if: A B O(A) O(B) select, join (but often implemented with a
blocking algorithm) O is non-monotonic if it is not
monotonic aggregates, nest
On new input monotonic operators add to output, non-monotonic operators change output
countgroup by Subject
(Title, Subject, DateTime)
selectDateTime ≥ “12/17/04:12AM”
812/17/2004
Handling Non-monotonic Operators
Users issue partial result requests Re-evaluation – transmit full result on every partial result
request Differential – avoid retransmitting duplicate data
Operators produce and process tuple inserts, deletes, updates All tuples contain “old value” and “new value”
(Title, Subject, DateTime)
selectDateTime ≥ “12/17/04:12AM”
countgroup by Subject
top10(count) Old Value New Value
Subject, Count Subject, Count( null, null, Ukraine, 2)(Ukraine, 2, Ukraine, 3)
Title, Subject, D/T Title, Subject, D/T(null, null, null, Title1, Ukraine, 1AM)(null, null, null, Title2, Ukraine, 3AM)(null, null, null, Title3, Ukraine, 5AM)
912/17/2004
Re-evaluation vs. Differential
05
1015
2025
3035
9% 27% 45% 64% 82% 100%Percentage of Input Seen
Time (seconds)
No Partial (unordered) No Partial (ordered)Re-evaluation (unordered) Re-evaluation (ordered)Differential (unordered) Differential (ordered)
1012/17/2004
Skewed Data
0
10
20
30
40
0 0.5 1 1.5 2Skew
Time (seconds)
No Partial (unordered) No Partial (ordered)Reevaluation (unordered) Reevaluation (ordered)Differential (unordered) Differential (ordered)
1112/17/2004
Differential Nest
(Google, Title1),(Microsoft, Title2),(Microsoft, Title3)
(Google, Title4)
but what you’d really like to send is: (Google, {Title5})and “merge” it with: (Google, {Title1,Title4})
(Google, {Title1,Title4}, Google, {Title1, Title4, Title 5})
(Google, {Title1}, Google, {Title1, Title4})
produce partial result ( null, null, Google, {Title1}),
( null, null, Microsoft, {Title2, Title3})
Old Value New ValueSubject, Title Subject, Title
(Google, Title5)
Subject, Title
Subject: Google
Title: Title1 Title: Title4
Subject: Google
Title: Title5
Subject: Google
Title: Title1 Title:Title4 Title: Title5Merge
1212/17/2004
Talk Preview
Incremental Query Evaluation Merge Operation Merge Theory Merge Performance
1312/17/2004
Merge Operation
Flexible method for combining two XML (nested) documents-“recursive union” over similarly-structured XML documents
Merge Template guides the process “Keys” are used to indicate when elements
should be combined
1412/17/2004
Merge Example
auction
item
iid:501 desc: Trek Madone 5.9 Bike
bidder: Dave
bid
amt: $1500
item
iid:433 desc: 1971 Martin Guitar
item
iid:501
bidder: Sue
bid
amt: $1550
auction
auction
item
iid:501 desc: Trek Madone 5.9 Bike
bidder: Dave
bid
amt: $1500
item
iid:433 desc: 1971 Martin Guitar
bidder: Sue
bid
amt: $1550
Auction Document New Bid
Merged Document
CombinedInsertedUsed in Match
1512/17/2004
Merge Template (MT)
Merge Template is an XML document consisting of a tree of Element Merge Templates (EMT)
EMT is a triplet containing: (name, local key, content combine function)
(desc, [], ShallowContent - Replace)
(bidder, [], ExactMatch)
(item, [iid], NoContentNoAttrs)
(auction, [], NoContentNoAttrs)
(iid, [], ExactMatch)
(bid, [bidder, amt], NoContentNoAttrs)
(amt, [], ExactMatch)
item
iid:501
bidder: Sue
bid
amt: $1550
auction
1612/17/2004
Merge Template Features
Used as the basis for an Accumulate operator Repeated merge over a stream of XML documents to
create an Accumulator Accumulator is a view of the stream Performs structural aggregation
Keys used to identify elements to combine Keys external to document Content-Combine Functions
aggregate, deep replace Attributes – handled like elements without
children
1712/17/2004
Outline
Incremental Query EvaluationPartial results over XML data
Merge Operation Merge Theory Merge Performance
1812/17/2004
Theoretical Foundations Why a formal definition?
Prove Merge is deterministic (unique result) Unambiguous definition
Key results: Formal definition of Merge as the join of an
upper semi-lattice Merge is the least upper bound of two documents
(under some constraints)
Path Set Representation Good for reasoning about XML documents
1912/17/2004
auction
item
iid:501 desc: Trek Madone 5.9 Bike
bidder: Dave
bid
amt: $1500
item
iid:433 desc: 1971 Martin Guitar
bidder: Sue
bid
amt: $1550
Merged Document (D3)
View Merge as Least Upper Bound
D3 is “smallest” document that “contains” D1 and D2 auction
item
iid:501 desc: Trek Madone 5.9 Bike
bidder: Dave
bid
amt: $1500
item
id:433 desc: 1971 Martin Guitar
item
iid:501
bidder: Sue
bid
amt: $1550
auction
Auction Document (D1) New Bid (D2)
2012/17/2004
What can go wrong?
D4
D1
D3
D2
item
auction
item item
auction
item
auction
item
auction
iid:501 iid:433
iid:501 iid:433 iid:501 iid:433
No unique result (no Least Upper Bound (LUB))
Keys in Merge Template eliminate ambiguity
Know D4 is correct result if we know iid is a key for item
2112/17/2004
What is a lattice?
An Upper-Semi Lattice is: a partially ordered set, in which least upper bounds (LUBs) exist and are unique
A set of sets closed under union form an upper semi lattice.
implies
Ex 1 – Not Lattice
LUB of {1,2} and {2, 3} does not exist
Ex 2 – LatticeOrder: S1 S2 if S1 S2
Ex 3 – LatticeOrder: document containment
{1, 2} {2, 3} {1, 2} {2, 3}
{1, 2, 3}
D1 D2
D3
2212/17/2004
What do I need for a lattice?
Set of documents (LT) (T is a Merge Template)
Order (document containment) Show LT satisfies the properties of a lattice.
2312/17/2004
Document Containment Order
D1 is contained in D2 if there is a structure-preserving mapping from D1 into D2
item
auction
iid:501 desc: Trek Madone 5.9 Bike
item
auction
iid:433 desc:1971 Martin Guitar
item
iid:501 desc: Trek Madone 5.9 Bike
D1 D2
2412/17/2004
Merge Template (T) Defines LT
A Merge Template, T, is specific to a set of documents Auction MT specific to “auction” documents
LT is all documents that are “compatible” and “key-respecting” with respect to T
Different lattice for each Merge Template
T D5
D1 D3
D4
D10LT
D8
D2
Set of all documents
2512/17/2004
Non-Key-Respecting Documents
D4
D1
D3
D2
item
auction
item item
auction
item
auction
item
auction
iid:501 iid:433
iid:501 iid:433 iid:501 iid:433
means contained in. D is contained in D′ if there is a structure-preserving mapping from D into D′
D3 is not key-respecting with respect to T and should not be in LT.
(item, [iid], NoContentNoAttrs)
(auction, [], NoContentNoAttrs)
(iid, [], ExactMatch)
T
2612/17/2004
Merge-Lattice Theorem Overview
Associate each document D with a unique path set ρ(D) ρ(D1) ρ(D2) is a Least Upper Bound (LUB) for ρ(D1)
and ρ(D2) ρ(D1) ρ(D2) is the “smallest” set that contains both ρ(D1) and
ρ(D2)
Intuition: Merge of D1 and D2 should be the document associated with ρ(D1) ρ(D2)
D1 ρ(D1)
ρ(D1) ρ(D2)
ρ2
ρ1
D2
D3
ρ(D2)
LT
2712/17/2004
Document and Path Set
Use Merge Template + document to create path set One element in path set for each element in document Path comprised of rooted key value and element content Path set order (subset) identical to document
containment order
item
bidder: Dave
amt: $1500
bid
auction
iid:501 desc: Trek Madone 5.9 Bike
auction[]:auction[].item[id:501]:auction[].item[id:501].id[]:501auction[].item[id:501].desc[]:Trek Madone 5.9 Bikeauction[].item[id:501].bid[bidder:Dave,amt:$1500]: auction[].item[id:501].bid[bidder:Dave,amt:$1500].
bidder[]:Daveauction[].item[id:501].bid[bidder:Dave,amt:$1500].
amt[]:$1500
auction[].item[iid:501].desc[]:Trek Madone 5.9 Bike
rooted key value element content
2812/17/2004
Proof that D3 is in L
Construct D3 from ρ(D1) ρ(D2), show D3 is compatible and key-respecting with respect to T
D3
ρ(D1)
2
1
σ σ-1 (=ρ3)
D1
T
ρ(D2)ρ2
-1
ρ2
ρ1
3
ρ1-1
D2
2912/17/2004
Outline
Incremental Query EvaluationPartial results over XML data
Merge Operation Merge Theory Merge Performance
3012/17/2004
Implementation Highlights
Accumulate operator uses repeated binary Merges to combine a series of XML documents into one result document
Accumulate is implemented as a recursive walk over input docs and the Merge Template
Implemented in Niagara v1.0 (UW-Madison) Lazy construction of DOM nodes: SAXDOM General improvements to Niagara 1.0 code base
3112/17/2004
Performance Environment
866 MHz Pentium PIII, 512MB memory, Red Hat Linux 8.0
Sun JVM J2SE 1.4.2, maximum memory 412MB
3212/17/2004
Input Data - XMark
people
person*
name
email profile
education
phone?
id
site
Persons
site
open_auctions
open_auction*
bid
bidder
personref
time
id
Bids
open_auction*
seller interval
start end
open_auctions
id
site
reserve?
Items
person
person
* 0 or more
? optional
3312/17/2004
Structural Aggregation with Restructuring
amt
bid*
time
item* id
itemsbid
people
person* id
Q5.1 outputQ5.1 input (Bids)
Q5.1 – simple structural aggregation query
For each person produce a list of items they bid on and their bids on those items
site
open_auctions
open_auction*
bid
bidder
personref
time
id
person
3412/17/2004
Restructuring of Input
people
person id:53
site
open_auctions
open_auction
bid:$82
bidder
personref
time:5:00
iid:8 itemsbid
open_auction
bid:$82
bidder
personref
time:5:00
iid:8
amt:$82time:5:00
restructure accumulate
person:53
id:53
people
person
itemsbid
id:8item
bid
person:53
Q5.1 OutputQ5.1 Input Restructured Input
3512/17/2004
Q5.1 query plans
nest(bidderid)construct
(restructured document)
unnest(site.open_auctions.open_auction)
unnest(bidder.person_ref.person as bidderid)
accumulate
filescan
nest(“”)
unnest(time)
unnest(amt)
nest(itemid, bidderid)
nest(bidderid)
unnest(open_auction.id as itemid)
unnest(bidder)
unnest(person_ref.person as bidderid)
Merge Query Plan
unnest(site.open_auctions.open_auction)
filescan
Nest Query Plan
3612/17/2004
nest(bidderid)
nest(“”)
nest(itemid, bidderid)
nest(bidderid)
unnest(open_auction, open_auction.id, bidder,
person_ref.person, time, amt)
filescan
Nest Query Plan
Q5.1 Nest Query Plan
amt:$82time:5:00
id:53
people
person
itemsbid
id:8item
bid
Q5.1 Output
3712/17/2004
Q5.1 Execution Time
020406080
100120140160
0 10 20 30 40 50 60 70MB of Data
Seconds
MergeMagicMergeNest
3812/17/2004
Q5.2 Execution Time
0
50
100
150
200
0 20 40 60 80MB of Data
Seconds
MergeMagicMergeNest
items
item* id
bid*
bidder*
amttime
id
Q5.2: for every item list of bidders and their bidsQ5.1: for every person list of items sold and bids on those items
Q5.2 output
3912/17/2004
Merge Plan Nest Plan
Operator Avg Exececution Time (sec)
Operator Avg Execution Time (sec)
filescan 3.8 filescan 3.7
unnest: open_auction, itemid, bidder, bidderid
5.4(0.9, 1.2, 1.7, 1.6)
unnest: open_auction, itemid, bidder, bidderid
4.7(0.9, 1.1, 1.4, 1.3)
construct 4.9 unnest (time) 1.3
accumulate 6.3 unnest (amt) 1.6
Total 20.4 nest(itemid, bidderid) 5.2
nest(itemid) 4.9
nest(“”) 2.0
Total 23.4
Avg Query Exec Time 30.7 Avg Query Exec Time 42.5 sec
Avg GC Time 9.5 sec Avg GC Time 17 sec
Execution time breakdown Q5.2
4012/17/2004
Simplified Q5.4-A Output
people
person*
name email
profile
education
phone?
id
bid
bidder*
pesonref
time
open_auction*
seller interval
start end
itemssold
id
reserve?
open_auction* id
itemsbid
person
person
For each person, provide person information, list of items put up for auction (itemssold) and items bid on (itemsbid)
4112/17/2004
Simplified Q5.4-B Output
people
person*
name email
profile
education
phone?
id
time
interval
start end
itemssold
id
reserve?
id
itemsbid
renamed
Key:
deleted
seller person
personref person
bid
item* item*
amt
4212/17/2004
Q5.4-A and Q5.4-B Results
0
20
40
60
80
100
120
0 10 20 30 40 50 60MB of Data
Seconds MergeMagicMergeNest
Query 5.4-A
0
20
40
60
80
100
120
0 10 20 30 40 50 60MB of Data
Seconds MergeMagicMergeNest
Query 5.4-B
Q5.4-B is faster despite having to unnest the input more deeply
Key factor: Q5.4-B has fewer elements in the result
4312/17/2004
Merge-Ready Structural Aggregation
No restructuring; input structured similar to output Best case for Merge
0
20
40
60
80
100
0 10 20 30 40 50 60 70MB of Data
SecondsMerge
Nest
Q5.5 (small documents)
0
20
40
60
80
100
0 10 20 30 40 50 60 70MB of Data
SecondsMerge
Nest
Q5.6 (big documents)
4412/17/2004
Sliding Structural Aggregation
Extend accumulate to handle sliding windows
For each element, maintain range of windows
Test vs. sliding nest0
50
100
150
200
250
300
350
0 250 500 750 1000 1250Range
Seconds MergeMagicMergeNest
Q6.1 (group bids by item then person)
4512/17/2004
Conclusion
Studied processing of XML Streams IQE
General framework for partial results over initial portion of stream
MergeFlexible operator for combining XML
documentsFormal definition in terms of lattice theoryOutperforms nest-based alternatives
4612/17/2004
Extras/Deletes
4712/17/2004
Re-evaluation vs. differential
Query plan for re-evaluation vs. differential
Neston Author
(Author, Address)
(Author, Book)
Join on Author
4812/17/2004
Partially-Ordered Set (POSet)
Let P be a set. A partial order () on P is such that for all x, y, z P(i) x x(ii) x y and y x x = y(iii) x y and y z x z
{1, 2} {2, 3}
{1, 2, 3}
{1}
Example: Set of sets ( implies )
S1 S2 if S1 S2
4912/17/2004
Sliding Accum query plan Q6.1
bucket
sliding accumulate
construct
filescan + series of unnests
(document, timestamp, window-min, window-max)
( D1, 12:01 PM, 0, 7 ) t1′( D2, 12:20 PM, 1, 8 ) t2′( *, 2:00 PM, 0, 0 ) p1′
(document, timestamp)
( D1, 12:01 PM ) t1
( D2, 12:20 PM ) t2
( *, 2:00 PM ) p1
5012/17/2004
Sliding Nest Query Plan Q6.1
sliding nest(bidderid, windowid)
sliding nest(windowid)
sliding nest(itemid, bidderid,
windowid)
sliding nest(bidderid, windowid)
bucket
construct
filescan + series of unnests
(document, timestamp)
( D1, 12:01 PM ) t1
( D2, 12:20 PM ) t2
( *, 2:00 PM ) p1
(document, timestamp, window-min, window-max)
( D1, 12:01 PM, 0, 7 ) t1′( D2, 12:20 PM, 1, 8 ) t2′( *, 2:00 PM, 0, 0 ) p1′
5112/17/2004
Merge-Lattice Theorem
The Merge-Lattice Theorem states that given a Merge Template T, the set of XML documents that are “compatible” with and “key-respecting” with respect to a T is an upper semi-lattice under a specific ordering based on T.
5212/17/2004
Compatibility Mapping
item
bidder: Dave
amt:$1500
bid
auction
id:501
desc: Trek Madone 5.9 Bike
(desc, [])
(bidder, [])
(item, [id])
(auction,[])
(id, [])
(bid, [bidder, amt])
(amt, [])
Auction Status Document Auction Merge Template
bidder: Sue
amt:$1550
bid
(quantity, [])
5312/17/2004
Identify Document Containment
item
auction
iid:433 desc:1971 Martin Guitar
item
auction
iid:501 desc: Trek Madone 5.9 Bike
auction[]:auction[].item[id:501]:auction[].item[id:501].id[]:501auction[].item[id:501].desc[]:Trek Madone 5.9 Bike
auction[]:auction[].item[id:501]:auction[].item[id:501].id[]:501auction[].item[id:501].desc[]:Trek Madone 5.0 Bikeauction[].item[id:433]:auction[].item[id:433].id[]:433auction[].item[id:433].desc[]:1971 Martin Guitar
D1ρ(D1)
item
iid:501 desc: Trek Madone 5.9 Bike
D2
ρ(D2)
5412/17/2004
Term
Applies To Definition
compatibleandcompatibility mapping
Document – DMerge Template – T
D is compatible with T if there exists an operation-preserving function that maps elements in D to EMTs in T such that (D.root) = T.root, and for every element, E, in D, name(E) = name((E)), and (parent(E)) = parent((E)). is called a compatibility mapping. (Section 4.3)
key-exact Element – EEMT - (E)
E is key-exact with respect to an EMT (E) if for every path p in the Local Key in (E), patheval(E, D, p) is a singleton set. (Section 4.4)
key-exact D, T Compatibility Mapping –
D is key-exact with respect to T if every element E in D is key-exact with respect to (E). (Section 4.5)
key-respecting D, T, D is key-respecting with respect to T and if no two elements of D have the same rooted key value. D must be key-exact with respect to T and . (Section 4.5)
key-respecting Path set - P P is key-respecting if there do not exist p1 and p2 in P such that p1 and p2 differ only in the value string of the terminal element. If a document is key-respecting, its path set is key-respecting. (Section 4.5)
Path-Containment ordering
Documents – D1 and D2
D1 is contained in D2, (D1 ⊑ D2), if there exists a 1-1 homomorphism that maps D1 into D2 such that for every element E in D1, name(E) = name((E)), value(E) = value((E)) and (parent(E)) = parent((E)) and (D1.root) = D2.root. (Section 4.6)
key-consistent D1, D2, T D1 and D2 key-consistent with respect to T if the union of their path sets is key-respecting. (Section 4.7)
mergeable D1, D2, T D1 and D2, are mergeable if they are key-consistent and lkv(D1.root) = lkv(D2.root). (Section 4.7)
5512/17/2004
Nest Operator Example
Subject Title(Google, “New Google chapter…”)(Google, “Google pens new…”)(Microsoft, “Microsoft launches…”)(Google, “Google speaks volumes …”)(Microsoft, “MSN ships…”)
nest
Subject: Google
Title: New Google chapter…
result
Subject: Microsoft
Result in XML
Title:Google pens new…
Title:Microsoft launches…
Title:Google Speaks volumes…
Title:MSN ships…
Subject Title(Google, {“New Google chapter…”,
“Google pens new…”, “Google speaks volumes…”})
(Microsoft, {“Microsoft launches…”, “MSN ships…”})
Are the fonts OK? Smallest I used is 14 for the XML examples, and 16 for text. Is that OK??
5612/17/2004
Input 1 (I1) Result 1 (R1)[ (Google, Title1), [ (Google, {Title1}), (Microsoft, Title2), (Microsoft, {Title2, Title3}) ] (Microsoft, Title3) ]
Input 2 (I2) Result 2 (R2)[ (Google, Title1), [ (Google, {Title1, Title4}), (Microsoft, Title2), (Microsoft, {Title2, Title3})] (Microsoft, Title3), (Google, Title4) ]
Is Nest Monotonic?
An operator O is monotonic if: A “less than” B O(A) “less than” O(B)
Answer: it depends on how you define “less than” If “less than” is , answer is no If “less than” is substructure, answer is yes
nest on subject