A Paradigm Shift in A Paradigm Shift in Database Optimization: Database Optimization: From Indices to Aggregates From Indices to Aggregates Presented to: Presented to: The Data Warehousing & Data Mining mini-track – The Data Warehousing & Data Mining mini-track – AMCIS 2002 as AMCIS 2002 as Research-in-Progress Research-in-Progress Ryan LaBrie • Robert St. Louis • Ryan LaBrie • Robert St. Louis • Lin Ye Lin Ye Arizona State University Arizona State University [email protected] • [email protected] • [email protected] • [email protected] • [email protected][email protected]
18
Embed
A Paradigm Shift in Database Optimization: From Indices to Aggregates Presented to: The Data Warehousing & Data Mining mini-track – AMCIS 2002 as Research-in-Progress.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
A Paradigm Shift in Database A Paradigm Shift in Database Optimization:Optimization:
From Indices to AggregatesFrom Indices to Aggregates
Presented to: Presented to:
The Data Warehousing & Data Mining mini-track – AMCIS 2002 asThe Data Warehousing & Data Mining mini-track – AMCIS 2002 as Research-in-ProgressResearch-in-Progress
Ryan LaBrie • Robert St. Louis • Lin YeRyan LaBrie • Robert St. Louis • Lin Ye
A need for a shift in optimization A need for a shift in optimization strategystrategy
What our research is focusing onWhat our research is focusing on How we performed this researchHow we performed this research Update on our resultsUpdate on our results Next stepsNext steps
Why a Shift, Why Now?Why a Shift, Why Now?
HISTORICALLYHISTORICALLY Relational database technology is really Relational database technology is really
good at what it does…good at what it does… Transaction-oriented, operational systemsTransaction-oriented, operational systems Optimized for data INPUTOptimized for data INPUT
FOCUS: Storage of DATAFOCUS: Storage of DATA TODAY’S ENVIRONMENTTODAY’S ENVIRONMENT
Large Data WarehousesLarge Data Warehouses Used for decision supportUsed for decision support Need to be optimized for information OUTPUTNeed to be optimized for information OUTPUT
FOCUS: Retrieval of INFORMATIONFOCUS: Retrieval of INFORMATION
The Decision Support ProblemThe Decision Support Problem
Relational DBMS limitationsRelational DBMS limitations Too much dataToo much data
Tera- and petabytes, quickly approaching exabytesTera- and petabytes, quickly approaching exabytes
Too complex queries Too complex queries Structured Query LanguageStructured Query Language
Too long for resultsToo long for results Indexing limitationsIndexing limitations
Usage of (i.e. Table Scans)Usage of (i.e. Table Scans) B+ TreesB+ Trees
A Possible Decision Support SolutionA Possible Decision Support Solution
Multidimensional Databases Multidimensional Databases New effective storage techniquesNew effective storage techniques Simpler modeling techniquesSimpler modeling techniques Potential for easier query interfacesPotential for easier query interfaces
Appropriate use of redundancyAppropriate use of redundancy More effective indexing algorithmsMore effective indexing algorithms
Bitmapped indicesBitmapped indices
The Focus of Our ResearchThe Focus of Our Research CURRENT RESEARCHCURRENT RESEARCH
1.1. Cost comparisons of Relational vs. Cost comparisons of Relational vs. Multidimensional Decision Support SystemsMultidimensional Decision Support Systems
2.2. Working towards a multidimensional Working towards a multidimensional benchmarking systembenchmarking system
TPC-H is positioned as a Decision Support benchmark, TPC-H is positioned as a Decision Support benchmark, however it is based on relational technologieshowever it is based on relational technologies
GOAL: Vendor neutral benchmark for comparing GOAL: Vendor neutral benchmark for comparing multidimensional database productsmultidimensional database products
FUTURE RESEARCHFUTURE RESEARCH In the long term, show that decisions can be In the long term, show that decisions can be
made more easily with multidimensional made more easily with multidimensional technologytechnology
Why Develop a Multidimensional Why Develop a Multidimensional Benchmark?Benchmark?
Benchmarking is an established method for creating Benchmarking is an established method for creating vendor neutral testsvendor neutral tests Transaction Processing Performance Council (TPC)Transaction Processing Performance Council (TPC)
Benchmarking has been examine in other IS fields Benchmarking has been examine in other IS fields includingincluding Server Platforms: Johnson & Gray, 1993Server Platforms: Johnson & Gray, 1993 eCommerce: Menasce, 2002eCommerce: Menasce, 2002
It has been called for specifically in the data It has been called for specifically in the data warehousing academic communitywarehousing academic community Nemati et al., 2000Nemati et al., 2000
andand Has yet to be doneHas yet to be done
How Are We Building Our BenchmarkHow Are We Building Our Benchmark
Based on the TPC-H relational decision Based on the TPC-H relational decision support benchmarksupport benchmark
Create a relational dimensional model Create a relational dimensional model that forms the basis for the data martthat forms the basis for the data mart
Build a multidimensional cube off the Build a multidimensional cube off the dimensional modeldimensional model
Convert the SQL statement to the Convert the SQL statement to the equivalent MDXequivalent MDX
Run both the SQL query and the MDX Run both the SQL query and the MDX query, report resultsquery, report results
What We Have Done To DateWhat We Have Done To Date
Initially have mapped all 22 TPC-H Initially have mapped all 22 TPC-H relational queries to potential data martsrelational queries to potential data marts 3-4 data marts necessary3-4 data marts necessary
Built 2 TPC-H data sets (1GB and 10GB)Built 2 TPC-H data sets (1GB and 10GB) Converted TPC-H Query #4 to MDXConverted TPC-H Query #4 to MDX Ran comparisons on both data setsRan comparisons on both data sets In the process of converting a second In the process of converting a second
query (TPC-H Query #7) for additional query (TPC-H Query #7) for additional analysis/confirmation of gainsanalysis/confirmation of gains
For a very modest investment For a very modest investment organizations will be able to process organizations will be able to process very large data warehousesvery large data warehouses
The multidimensional data mart is the The multidimensional data mart is the only practical (speed, processing only practical (speed, processing time) way to support the end-user time) way to support the end-user decision maker.decision maker.
Aggregation truly is a substitute for Aggregation truly is a substitute for expensive hardwareexpensive hardware
Next StepsNext Steps
Acquire a larger serverAcquire a larger server Build 100GB and 300GB TPC-H data setsBuild 100GB and 300GB TPC-H data sets Benchmark both relational and dimensional Benchmark both relational and dimensional
queriesqueries Publish resultsPublish results Consider ROLAP, HOLAP, MOLAP issuesConsider ROLAP, HOLAP, MOLAP issues Possible extensions to some data mining Possible extensions to some data mining
researchresearch Possible extensions to decision making Possible extensions to decision making
through technology researchthrough technology research