Denodo Data Virtualization (DV) Single View of Truth “By 2018, organizations with data virtualization capabilities will spend 40% less on building & managing data integration processes for connecting distributed data assets.” Source: SPA (Strategic Planning Assumption) Gartner published 2017 predictions research.
25
Embed
Denodo Data Virtualization (DV) Single View of TruthDenodo Data Virtualization (DV) Single View of Truth ... Teradata, Netezza, Oracle Exadata, Sybase IQ, Greenplum, ... Impalas and
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Denodo Data Virtualization (DV)
Single View of Truth
“By 2018, organizations with data virtualization capabilities will spend 40% less on building & managing data integration processes for connecting distributed data assets.”
Source: SPA (Strategic Planning Assumption) Gartner published 2017 predictions research.
2
What is Data Virtualization
• A method of Data Integration that does not physically move
data or create new copies of data
• A method of Data Integration that isolates users from the
format, location, technologies, and protocols for storing and
accessing data
• Real Time Data Access to any data type – structured,
unstructured, semi structured
• Many to 1 approach. Virtually join any number of data
sources and source types into a single view
3
Problem: IT Architecture is Unmanageable
Log files(.txt/.log files)
CRM(MySQL)
Billing System(Web Service - Rest)
Big Data, Cloud(Hadoop, Web)
Inventory System(MS SQL Server)
Product Catalog(Web Service -SOAP)
Customer Voice(Internet, Unstruc)
Product Data(CSV)
ETL
TraditionalIssues
Hi-Data Growth,
IT Complexity, Data Silos, Hi - Latency
New Trends
Real Time,Big Data,
Unstructured Data,
External Data,Move to Cloud
4
Solution: Virtual Data Layer
Log files(.txt/.log files)
CRM(MySQL)
Billing System(Web Service - Rest)
Big Data, Cloud(Hadoop, Web)
Inventory System(MS SQL Server)
Product Catalog(Web Service -SOAP)
Customer Voice(Internet, Unstruc)
Product Data(CSV)
ETL
TraditionalIssues
Hi-Data Growth,
IT Complexity, Data Silos, Hi - Latency
New Trends
Real Time,Big Data,
Unstructured Data,
External Data,Move to Cloud
Data Virtualization
5
Denodo DV: Connectivity to Any Data TypeRelational DB’s: Oracle, DB2, Sybase, MS SQL Server, MySQL, PostgreSQL, Informix, MS Access…
Cloud, SaaS: Salesforce, Google, Amazon, LinkedIn, Facebook, Twitter via APIs; Any Website, Form, any Web based Apps…
Enterprise Service Bus: JMS message queues, WebSphere MQ, Sonic, ActiveMQ…
Custom Connector SDK: access any application via API and procedural interfaces.
Semi-Structured Data: Web sites, Forms, applications, PDF, MS Word, MS Excel
Unstructured Data: websites, file systems, Email servers, databases, knowledgebase, indexes (Lucene, MS FAST, HP Autonomy…), RSS Feeds …
6
Denodo Platform Architecture
6
Da
ta V
irtu
aliza
tio
n Design Tools
Optimizer
Cache
Scheduler
Monitoring
Governance
Metadata
Security
Publish Real-time (Right-time) Data Services
Combine Transform, Improve Quality, Integrate
Connect Normalized Views of Disparate Data
Denodo Platform
Library of Wrappers Web Automation Any Data or Content Read and Write
Business SolutionsAccess Information-as-a-Service
Denodo PlatformRight Information at the Right Time
Disparate DataAny SourceAny Format
Denodo Platform
Publish Real-time (Right-time) Data Services
Combine Transform, Improve Quality, Integrate
Connect Normalized Views of Disparate Data Da
taV
irtu
aliza
tio
n Design Tools
Optimizer
Cache
Scheduler
Monitoring
Governance
Metadata
Security
7
Common Data Virtualization Use CasesData Virtualization
BIG DATA, CLOUD INTEGRATION
Advanced Analytics
Data Warehouse Offloading
Big Data for Enterprise
Cloud / SaaS Integration
AGILE BUSINESS INTELLIGENCE
Logical Data Warehouse
Virtual Data Marts
Self-Service BI
Operational BI / Analytics
SINGLE VIEW APPLICATIONS
Single Customer View - Call Centers, Portals
Single Product View - Catalogs
Single Inventory View - Inventory Reconciliation
Vertical Specific - Single View of Wells
DATA SERVICES
Unified Data Services Layer
Logical Data Abstraction
Agile Application Development
Linked Data Services
Analytical Operational
BusinessUse Cases
IT Use Cases
8
What are analysts saying?
“By 2018, organizations with data virtualization capabilities will spend 40% less on building & managing data integration processes for connecting distributed data assets.”
Source: Gartner Research, Predicts 2017: Data Distribution and Complexity Drive Information Infrastructure Modernization.
“Through 2020, 35% of enterprises will implement some form of data virtualization as one enterprise production option for data integration.”
Source: Gartner Research, Market Guide for Data Virtualization, 2016.
Performance & Security
10
Performance
Architecture designed for both Informational & Operational scenarios
Focused on 3 Core Concepts
1. Dynamic Multi-Source Query Execution Plans
Leverages processing power & architecture of data sources
Dynamic to support ad hoc queries
2. Selective Materialization
Intelligent Caching of only the most relevant and often used information
3. Optimized Resource Management
Smart allocation of resources to handle high concurrency
Throttling to control and mitigate source impact
Core Concepts
11
Performance
Ad Hoc querying requires an architecture that generate efficient plans in execution time.
Denodo borrows many techniques from traditional RDBMs such as:
Cost based execution plans
Based on statistics, indexes, transfer rates, etc.
Redundant filter detection, unnecessary JOIN pruning, etc.
But since data is stored in multiple heterogeneous sources, DV has to apply other techniques to minimize network traffic and minimize processing in the virtual layer:
Maximize query push down – Process at the source
Query rewriting to maximize delegation to sources
Data transformations push-up to maximize delegation
On-the-fly data movement (shipping)
Abstract source capabilities
Emulate in the virtual layer the operations that cannot be push down (e.g., a GROUP BY on a flat file)
Optimization techniques
Proven Performance in IBM Labs
Queries to single source
■ Denodo only adds 3-5% overhead
Source: Denodo testing in IBM labs – TPCDS Benchmark and DataShip Performance Tests
Join across multiple sources
■ Denodo optimization engine faster than in-house solution
When ‘Data Lakes’ become “Data Swamps”Uncontrolled dumping of Data in Hadoop leads to poor perf.
Denodo DV Query across Impalas and Exadata Vs.
MDM and Large data sets in Hadoop - Impala
ETL all data into Impala and run full query there
MDM data in Exadata (Oracle)
Large Data sets in Hadoop - Impala
Big Data Queries Run Faster using DV because: • DV automatically collects Statistics & Source capabilities, then• Rewrites optimized queries and pushes processing down to the sources• Thus, heavy processing is performed in the systems designed to do so:
• Impala Hadoop performs heavy aggregations on top of very large data sets• Oracle Exadata is faster than Impala to process dimensional queries
Big Data Queries Faster using DV
Impala
Hadoop-only
Runtime (s)
Denodo
Runtime (s)
Denodo
Runtime w/
Cache (s)
Data Volumes
Query 1199 120 68
Queries 1,2,3,5
•Exadata Row Count: ~5M
•Impala Row Count: ~500k
Query 4
•Exadata Row Count: ~5M
•Impala Row Count: ~2M
Query 2187 96 88
Query 3120 212 115
Query 4 timeout328 69
Query 546 91 56
Performance comparison of 5 different queries :
• DV delivers better performance & Saves replicating data into Hadoop
• DV leverages Data Source Architectures for what they are good at.
15
Performance
Denodo has done extensive testing using queries from the standard benchmarking test
TPC-DS* and the following scenario
Compares the performance of a federated approach in Denodo with an MPP system where
all the data has been replicated via ETL
Benchmarks: Logical Data Warehouse
Customer Dim.2 M rows
Sales Facts290 M rows
Items Dim.400 K rows
* TPC-DS is the de-facto industry standard benchmark for measuring the performance of decision support solutions including, but not limited to, Big Data systems.