Evaluating XML-Extended OLAP Queries Based on a Physical Algebra Xuepeng Yin and Torben B. Pedersen Department of Computer Science Aalborg University
Feb 12, 2016
Evaluating XML-Extended OLAP Queries Based on a Physical Algebra
Xuepeng Yin and Torben B. Pedersen
Department of Computer Science Aalborg University
# 2
Problem• OLAP-systems are good for complex analysis queries
Easy-to-use Fast Business, science ...
• Problems with physical integration in existing OLAP systems Integrating new data requires (partial) cube rebuild => too slow
• Problems arise with dynamic data Stock quotes, competitors prices, disease info...
• Data will often be available in Extended Markup Language (XML) format
Weather data, map info, price lists, ……
# 3
Solution
• Allows the use of external XML data as virtual dimensions
Decoration (extra info) Type information.
Selection Condition on XML data
Grouping Categories by XML data
Logicalfederation
OLAP
OLAP/XML query
OLAP query XML query
<?xml version=”1.0” ?><?xml version=”1.0” ?>
<?xml version=”1.0” ?><?xml version=”1.0” ?>
XML
• Goal: flexible access to XML data from OLAP systems
# 4
Overview• Contributions• Architecture of the federation• Linking OLAP and XML• The federation query semantics
The logical algebra The physical algebra Conversion from logical to physical plans
• Plan execution• Query optimization
The query optimizer Execution of an optimized plan
• Performance• Conclusion
# 5
Contributions of This Paper• Previous OLAP-XML federation efforts
A logical algebra A partial, straight-forward implementation
• Problems with previous work The logical algebra does not accurately reflect query execution tasks Query optimization is based on an abstract level Implementation is very limited
• Novelties of this paper A physical algebra and simplified query semantics Practical query optimization techniques A full-function, robust query engine Experiments with the query engine
# 6
Architecture of the federation• OLAP and XML components• Auxiliary components• Query engine
Query analyzer Query optimizer Query evaluator
# 7
Linking OLAP and XML• Links
Relation between a set of dimension values and a set of XML nodes
• Level expressions <level>/<link>/<XPath expression> specifies a concrete link usage Nation/Nlink/Population links nations to populations
NlinkTime Orders EC
Year
Quarter
Month
Customer
Order
Region
Nation
Supplier
Quantity
<Nations><Nation> <NationName>Denmark< / NationName >
<Population>5.3</ Population></ Nation>
</ Nations>
Man.
Brand
Part
Suppliers
Nlink={(DK, n1), (CN, n2), (UK, n3)}
# 8
The Federation Query Semantics• The logical algebra
Decoration, Federation selection, Federation Generalized projection,
• The federation query language: SQLXM
SELECT SUM(Quantities), Brand(Part), Nation/Nlink/Population
FROM TC WHERE
Nation/Nlink/Population<30GROUP BY Brand(Part),
Nation/Nlink/Population
)(]//),([ QSUMPNlNPartBrandFed
]30//[ PNlNFed
PNlN //
TCF
Fed
Fed
# 9
The Physical Algebra • Includes data retrieval and manipulation operators• A physical plan models real execution tasks
i.e., when, where and how data is processed
• Nine physical operators Querying the OLAP component
Cube selection and generalized projection Data transfer between components
Fact-, dimension- and XML- transfer operators Temporary data manipulations
Decoration, federation selection and generalized projection Inlining XML data
Inlining
# 10
Querying the OLAP Component• Cube selection
Has no references to XML data Performs selection over the OLAP cube Intuitively, a SQL SELECT statement
• Cube generalized projection Has no references to XML data Rolls up dimensions and aggregate specified measures at specified
levels Intuitively, a SQL SELECT statement with a GROUP BY clause
cube
cube
# 11
Data Transfer Between Components• Fact-transfer
Transfers the OLAP fact data to the temporary component The temporary facts then can be decorated Intuitively, a SQL SELECT INTO statement
• Dimension-transfer Transfers dimension data to the temporary component Used when higher level dimension data is required in the temporary
component
• XML-transfer Transfers XML data to the temporary component Uses XPath expressions to identify XML nodes with decoration
values
# 12
Temporary Data Manipulations• Decoration
Decorates the cube by adding a new dimension Intuitively, adds a table with dimension and decoration XML data SELECT * FROM t(supplier, nation) t1, t(nation, population) t2 WHERE t1.nation
=t2.nation
• Federation selection Performs selection over the cube in the temporary component Intuitively, a SQL selection over the temporary tables SELECT t1.* FROM tfact t1, t(supplier, population) t2 WHERE t1.supplier
=t2.supplier and population<30
• Federation generalized projection Rolls up and aggregates the cube in the temporary component Intuitively, a SQL selection with a GROUP BY clause SELECT SUM(Quantity), t2.population FROM tfact t1, t(supplier, population) t2
WHERE t1.supplier= t2.supplier GROUP BY t2.population
Fed
Fed
# 13
Inlining XML Data• Denoted as• Comparing federated data in the temporary component is
expensive• Inlining refers to integrating XML data into the OLAP
selections• A resulting predicate
Only references dimension levels and constants Can be evaluated in the OLAP component
Nation Population
DK 5.3
CN 1264.5
UK 19.1
Nation/Nlink/Population<30
Nation=‘DK’ OR Nation=‘UK’
+
# 14
From Logical to Physical Plans
PNlN //
TCF
]30//[ PNlNFed
)(]//),([ QSUMPNlNPBFed
PNlN // ],[ NS
],[ BP
extTC ,F
PNlN //
]30//[ PNlNFed
)(]//),([ QSUMPNlNPBFed
# 15
Plan Execution
PNlN //
PNlN // ],[ NS
],[ BP
extTC ,F } {
Quantity ExtPrice Supplier Part Order Day
17 17954 S1 P3 11 2/12/96
28 29983 S2 P4 42 30/3/94
2 2388 S3 P3 4 8/12/96
26 26374 S4 P2 20 10/11/93}{ FR
Nation Population
DK 5.3
CN 1264.5
UK 19.1
Supplier Nation
S1 DK
S2 DK
S3 CN
S4 UK
,
},,{ ],[ NSRRRF
PNlN //
5.3DK
1264.5CN
19.1UK
PopulationNation
UK
CN
DK
DK
Nation
S3
S1
S2
S4
Supplier
19.1
1264.5
5.3
5.3
Population
S3
S1
S2
S4
Supplier
},,,{ ],[ RRRR NSF
]30//[ PNlNFed
19.1
1264.5
5.3
5.3
Population
S3
S1
S2
S4
Supplier
S4 10/11/9320P2
S2 30/3/9442P4
8/12/964P3S3
2/12/96P3 11
26374
29983
2388
26
28
2
17954
DayPart OrderExtPrice
S117
SupplierQuantity
19.1 10/11/9320P2
5.3 30/3/9442P4
2/12/96P3 11
26374
29983
26
28
17954
DayPart OrderExtPrice
5.317
PopulationQuantity
}{ ,,, ],[ RRRR NSF
],[ BP
Part Brand
P2 B2
P3 B3
P4 B4
},,,,{ ],[],[ BPNS RRRRRF
)(]//),([ QSUMPNlNPBFed
Quantity Population Brand
17 5.3 B3
28 5.3 B4
26 19.1 B2
}{ FR
]30//[ PNlNFed
)(]//),([ QSUMPNlNPBFed
# 16
The Query Optimizer
Pl anRewri t i ng
Logi calPl an
Conversi on
Pl an SpacePruni ng
CostEsti mati on
I ni t i al pl an
Fi nalexecuti on
pl anlljl PP ,,1
),(,),,( 11 ll ljljll PPPPPP
):,(,),:,( 111 lll ljljljlll TPPPTPPP
):,(,),:,( lnlnln llllll TPPPTPPP lmlmlm
P
PP
• Based on the Volcano optimizer• Four phases optimization at one stage
Logical equivalent plan enumeration One-to-one logical to physical conversion Estimating cost of physical plans: Cost-based plan space pruning
),,( 1 nrootplan ttMaxtt
# 17
An Optimized Query Plan
TCF
)(]//),([ QSUMPNlNPBFed
])'30//[( PNlNFed
PNlN //
)(]),([ QSUMNPBFed
)(]//),([ QSUMPNlNPBFed
PNlN //
)(],[ QSUMNBCube
])'30//[( PNlNCube
]30//[ PNlNCube
PNlN //
extTC ,F
# 18
Execution of the Optimized Plan )(]//),([ QSUMPNlNPBFed
PNlN //
)(],[ QSUMNBCube
])'30//[( PNlNCube
]30//[ PNlN
PNlN //
extTC ,F} {
Nation Population
DK 5.3
CN 1264.5
UK 19.1
PNlN //}{ R
)]30//[( PNlN}{ R
]''''[ UK Nation DKNation=Cube
S4 10/11/9320P2
S2 30/3/9442P4
2/12/96P3 11
26374
29983
26
28
17954
DayPart OrderExtPrice
S117
SupplierQuantity
}{ R )(],[ QSUMNBCube
Quantity Nation Brand
17 DK B3
28 DK B4
26 UK B2}{ R
UK B2
DK B4
B3
26
28
Brand
DK17
NationQuantity
},{ RRF
PNlN //
UK B2
DK B4
B3
26
28
Brand
DK17
NationQuantity
5.3DK
1264.5CN
19.1UK
PopulationNation}{ , FRR
)(]//),([ QSUMPNlNPBFed
}{ FR
Quantity Population Brand
17 5.3 B3
28 5.3 B4
26 19.1 B2
# 19
Performance
1
10
100
1000
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16Query Type
Eval
uatio
n Ti
me
(in s
econ
ds)
Federated Cached Integrated
1
10
100
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16Query type
Eval
uatio
n Ti
me
(in s
econ
ds)
• One experiment compared: a. Our federated solution b. Physical integration c. Federating cached XML
data
• Data 100M fact data based on
TPC-H benchmark 11MB and 2KB XML data
• Queries• Result:
Comparable to b for small amounts of data
Use c for large amounts of data
# 20
Related Work• Generic data integration
Relational, XML, semi-structured, OO,… + combinations Do not consider OLAP DB properties such as automatic
aggregation, dimension hierarchies and correct aggregation
• OLAP-object federations Current solution offers much more general use of external data Current solution not restricted to rigid object schemas Current solution allows irregular data
• Previous OLAP-XML federation efforts A logical algebra A partial, straight-forward implementation
# 21
Conclusion• OLAP handles schema changes and dynamic data poorly• Solutions
Logical federation of OLAP and XML A physical algebra models actual execution tasks Optimized query evaluation
• Experiments suggest feasibility • Future work
More optimization techniques Advanced evaluation techniques Co-operative development with OLAP query tool vendor