© Copyright 2000 M. Rodriguez-Martinez, All Rights Reserved MOCHA : A Self-Extensible Database Middleware System for Distributed Data Sources Manuel Rodriguez-Martinez.
Post on 28-Mar-2015
217 Views
Preview:
Transcript
© Copyright 2000 M. Rodriguez-Martinez, All Rights Reserved
MOCHAMOCHA: A Self-Extensible : A Self-Extensible Database Middleware System for Database Middleware System for
Distributed Data SourcesDistributed Data SourcesManuel Rodriguez-Martinez
Nick Roussopoulos
SIGMOD 2000 M. Rodriguez-Martinez – N. Roussopoulos 2
MotivationMotivation
Data Sources are distributed and heterogeneous: Fact of Life ...
ClientClient
Oracle 8i Informix XML Data Text Data
Internet
SIGMOD 2000 M. Rodriguez-Martinez – N. Roussopoulos 3
Client-Server ConnectivityClient-Server Connectivity
2-tier architecture means FAT Clients
ClientClient
Oracle 8i Informix XML Data Text Data
Internet
Not a Good Idea
SIGMOD 2000 M. Rodriguez-Martinez – N. Roussopoulos 4
Middleware Integration ServiceMiddleware Integration Service
ClientClient
Oracle 8i Informix XML Data Text Data
Internet
Translator Translator Translator Translator
Middleware is a 3-tier connectivity solution – Thin Clients
IntegrationServer Catalog
SIGMOD 2000 M. Rodriguez-Martinez – N. Roussopoulos 5
Problem 1: Code DeploymentProblem 1: Code Deployment• User-defined types and functions
– Polygon – Composite() – image aggregation
• Porting and manual installation of code– Operating system– Hardware platform
• Expensive Software Maintenance– Updates– Version management
• Security – Software certification
SIGMOD 2000 M. Rodriguez-Martinez – N. Roussopoulos 6
Problem 1: Code DeploymentProblem 1: Code Deployment
ClientClient
Oracle 8i Informix XML Data Text Data
Internet
Translator Translator Translator Translator
Not Scalable – Expensive System Growth
IntegrationServer Catalog
SIGMOD 2000 M. Rodriguez-Martinez – N. Roussopoulos 7
Problem 2: Query ProcessingProblem 2: Query Processing• Operator placement options
– Limited by site-dependent software• Composite() – got to have it before using it!
• Most processing at Integration Server– Powerful Data Servers are under-utilized
• I/O Nodes
– Excessive data movement over the network• Network bottleneck • Unfeasible in WANs, Internet
SIGMOD 2000 M. Rodriguez-Martinez – N. Roussopoulos 8
Problem 2: Query ProcessingProblem 2: Query Processing
ClientClient
Oracle 8i Informix XML Data Text Data
Internet
Translator Translator Translator Translator
Not Scalable – Inefficient evaluation of queries
IntegrationServer Catalog
100MB
100MB
100MB
SIGMOD 2000 M. Rodriguez-Martinez – N. Roussopoulos 9
MOCHA Solution: Ship Code!MOCHA Solution: Ship Code!
Select location, Composite(image)From RastersWhere week BETWEEN t1 and t2Group By location
Client
Oracle Informix
DAP DAPQPC
CodeRepository
Catalog
Internet
Virginia
MarylandVirginiaTexas
SIGMOD 2000 M. Rodriguez-Martinez – N. Roussopoulos 10
MOCHA Solution: Filter Data!MOCHA Solution: Filter Data!
Select location, Composite(image)From RastersWhere week BETWEEN t1 and t2Group By location
Client
Oracle Informix
DAP DAPQPC
CodeRepository
Internet
Virginia
MarylandVirginiaTexas
Catalog200MB
tuples
100MB
tuples
results
200KB
results
150KB
results
150KB
results
200KBresults
150KB
results
200KB
results
350KB
results
350KB
SIGMOD 2000 M. Rodriguez-Martinez – N. Roussopoulos 11
MOCHA GoalsMOCHA Goals
Automatic Deployment of Code (self-extensible)– QPC ships compiled Java classes
• User-defined types and functions
– XML for their metadata (easy exchange)
Data processing at data source sites– Utilize powerful machines
• On-site data distillation
Processing based on data movement reduction– “Filter” data at the data sources– “Expand” data near the clients
SIGMOD 2000 M. Rodriguez-Martinez – N. Roussopoulos 12
The MOCHA ArchitectureThe MOCHA Architecture
Client
Client
Informix Oracle
QPC
DAP DAP
CodeRepository
Catalog
• Multi-threaded• Distributed Objects
Coordination Thread
Execution Thread
Execution Thread
SIGMOD 2000 M. Rodriguez-Martinez – N. Roussopoulos 13
QPC: The Integration ServerQPC: The Integration Server
Client API
Query Parser
Catalog Manager
Query Optimizer
Execution Engine
CodeLoader
SQL &XML
Proc.Interface
DAP Access API
XMLCatalog
CodeRepository
DAP
QPC Controls and Coordinates Query Execution
SIGMOD 2000 M. Rodriguez-Martinez – N. Roussopoulos 14
DAP: The Facilitator of DataDAP: The Facilitator of Data
DAP Provides QPC withRemote Access to the Data
Data Source
DAP Access API
Control Module
Execution Engine
CodeLoader
SQL &XML
Proc.Interface
Data Source Access Layer
JDBC I/O API DOM JNI
100MB
tuples
100MB
tuples
100MB
tuples
results
150KB
100MB
tuples
SIGMOD 2000 M. Rodriguez-Martinez – N. Roussopoulos 15
Road MapRoad Map
IntroductionProblem DefinitionMOCHA Architecture • Query Processing• Experiments• Summary
SIGMOD 2000 M. Rodriguez-Martinez – N. Roussopoulos 16
Processing The QueriesProcessing The Queries
• Issue 1: Placement and deployment of operators– Which operators go to QPC, and which go to the DAPs?
• Issue 2: How to determine this placement?– Dynamic programming [SAC+79], [ML86]
– But search space is enormous• Placement of UDF, joins, execution sites …
• Plenty of “bad” plans
In MOCHA: Query Optimization based on heuristics– Network usually is the critical factor optimize for it first
– CPU and I/O are cheaper optimize for them later
– Quickly converge to a “good” plan
SIGMOD 2000 M. Rodriguez-Martinez – N. Roussopoulos 17
Operator PlacementOperator Placement• Data-Reducing Operators
– “Filter” the data – Aggregates, predicates, projections, semi-joins
• Composite(), Overlaps() , AvgEnergy()
Push to the DAPs– Code Shipping policy (Unique to MOCHA)– Only send back distilled results+ Less data movement
• Cost:– Computation cost – Transfer of filtered results
SIGMOD 2000 M. Rodriguez-Martinez – N. Roussopoulos 18
Operator PlacementOperator Placement• Data-Inflating Operators
– “Expand” the data – projections, image processing, some joins …
• DoubleResolution(), RotateSolid()
Pull to the QPC– Data Shipping policy [FJK96]– Only send back raw arguments+ Less data movement
• Cost:– Computation cost – Transfer of raw argument values
SIGMOD 2000 M. Rodriguez-Martinez – N. Roussopoulos 19
Placement Metric: VRFPlacement Metric: VRF
Volume Reduction Factor: Given operator and relation R, then VDA
VDTVRF )(
•VDT - volume of data transmitted after applying to R•VDA - volume of data originally present in R
is Data-Reducing VRF < 1
Composite()
is Data-Inflating VRF 1
DoubleRes()
SIGMOD 2000 M. Rodriguez-Martinez – N. Roussopoulos 20
Goal: Plans with small CVRFGoal: Plans with small CVRF
Cumulative Volume Reduction Factor:Given a plan P to solve query Q over relations R1, …, Rn
CVDA
CVDTPCVRF )(
• CVDT - volume of data transmitted by applying all operators in P to R1, …, Rn• CVDA- volume of data originally present in R1, …, Rn
Search SpaceOptimizer searchesfor plans that move
minimal amount of data.
CVRF(Plan) [0,1]
SIGMOD 2000 M. Rodriguez-Martinez – N. Roussopoulos 21
Performance EvaluationPerformance Evaluation
• Goals of this study:– Measure how good code shipping can be– Validate heuristics being proposed
• VRF• CVRF
– Guide implementation of the optimizer
• Configured MOCHA with plans that place operators based on heuristics.
SIGMOD 2000 M. Rodriguez-Martinez – N. Roussopoulos 22
Experimental EnvironmentExperimental Environment
• Sequoia 2000 Benchmark– scientific data - points, polygons, satellite images– Distributed applications
• Software and Hardware: – JDK 1.2– QPC - Sun Ultra 60, Solaris 2.6– DAPs - Sun Ultra 1, Sun Ultra5, Solaris 2.6– Data Sources
• 2 Informix IUS 9.12 Server
– 10 Mpbs Ethernet
SIGMOD 2000 M. Rodriguez-Martinez – N. Roussopoulos 23
Reducing vs. InflatingReducing vs. Inflating
Ru
nnn
ing
Tim
e (s
ecs)
0
200
400
600
800
1000
1200
1400
1600
1800
DB CPU NET
QPC QPC
QPC
DAPDAP
DAP
Query Class
Q1 Q2 Q3
• Query classes– Composite of all images– Clipping and sub-setting– Double resolution of images
Performance gains– composites
• 99% data reduction
• 4-1 better performance
– clipping and expansion• 80% data reduction
• 3-1 better performance
Validates heuristics
SIGMOD 2000 M. Rodriguez-Martinez – N. Roussopoulos 24
VRF vs SelectivityVRF vs Selectivity
• Select graphs identifiers based
on number of vertices and arc
length
Selectivity [HS93] and
cardinality [HKWY97] are not
enough for distributed
predicate placement
• Need to also consider size of
arguments for predicates!
• Consider 50% selectivity
– DAP CVRF = 0.01
– QPC CVRF = 1
0
100
200
300
400
500
600
700
800
DB CPU NET
Ru
nnn
ing
Tim
e (s
ecs)
SelectivityQ
PC
DA
P
QP
C
DA
P
QP
C
DA
P
QP
C
DA
P
QP
C
DA
P
0 .25 .50 .75 1
VRF is a better metric
SIGMOD 2000 M. Rodriguez-Martinez – N. Roussopoulos 25
Implementation StatusImplementation Status
Operational System– SIGMOD 2000 Demo
Experimental deployment of MOCHA– NASA Earth Scientists
(ESIP Federation)– Goddard Space Flight
Center– NCSA
Land Cover Visualization Tool
SIGMOD 2000 M. Rodriguez-Martinez – N. Roussopoulos 26
Summary and ConclusionsSummary and Conclusions• Proposed a new Middleware Architecture: MOCHA
– Automatic Code Deployment (self-extensible)• Shipping Java classes
– Query processing based on data movement reduction
• Proposed VRF metric for placement of functions– Better than selectivity and result cardinality
• Future work– Deployment of MOCHA for NASA ESIP Federation– Full implementation of MOCHA Optimizer
• More Info:– http://mocha.umiacs.umd.edu/http://mocha.umiacs.umd.edu/
SIGMOD 2000 M. Rodriguez-Martinez – N. Roussopoulos 27
Problem 2: Query ProcessingProblem 2: Query Processing
ClientClient
Oracle 8i Informix XML Data Text Data
Internet
Translator Translator Translator Translator
Not Scalable – Inefficient evaluation of queries
IntegrationServer Catalog
100MB
100MB
100MB
200MB
200MB
200MB
top related