Logical Data Warehouse and Data Lakes

New York City

9th June, 2016

Logical Data Warehouse, Data Lakes, and Data Services Marketplaces

Agenda1.Introductions

2.Logical Data Warehouse and Data Lakes

3.Coffee Break

4.Data Services Marketplaces

HEADQUARTERS

Palo Alto, CA.

DENODO OFFICES, CUSTOMERS, PARTNERS

Global presence throughout North America, EMEA, APAC, and Latin America.

CUSTOMERS

250+ customers, including many

F500 and G2000 companies across every major industry have gained significant business agility and ROI.

LEADERSHIP

Longest continuous focus on data

virtualization and data services.

Product leadership.

Solutions expertise.

THE LEADER IN DATA VIRTUALIZATION

Denodo provides agile, high performance data

integration and data abstraction across the broadest

range of enterprise, cloud, big data and unstructured

data sources, and real-time data services at half the

cost of traditional approaches.

Speakers

Paul Moxon

Senior Director of Product Management, Denodo

Pablo Álvarez

Principal Technical Account Manager, Denodo

Rubén Fernández

Technical Account Manager, Denodo

Logical Data Warehouse and Data LakesNew York City

June 2016

Agenda1.The Logical Data Warehouse

2.Different Types, Different Needs

3.Performance in a LDW

4.Customer Success Stories

What is a Logical Data Warehouse?

A logical data warehouse is a data system that follows

the ideas of traditional EDW (star or snowflake schemas)

and includes, in addition to one (or more) core DWs,

data from external sources.

The main motivations are improved decision making

and/or cost reduction

Logical Data Warehouse

Description:

“The Logical Data Warehouse (LDW) is a new data management architecture for analytics combining the strengths of traditional repository warehouses with alternative data management and access strategy. The LDW will form a new best practice by the end of 2015.”

“The LDW is an evolution and augmentation of DW practices, not a replacement”

“A repository-only style DW contains a single ontology/taxonomy, whereas in the LDW a semantic layer can contain many combination of use cases, many business definitions of the same information”

“The LDW permits an IT organization to make a large number of datasets available for analysis via query tools and applications.”

Gartner Definition

Gartner Hype Cycle for Enterprise Information Management, 2012

Description:

“The Logical Data Warehouse (LDW) is a new data management architecture for analytics combining the strengths of traditional repository warehouses with alternative data management and access strategy. The LDW will form a new best practice by the end of 2015.”

“The LDW is an evolution and augmentation of DW practices, not a replacement”

“A repository-only style DW contains a single ontology/taxonomy, whereas in the LDW a semantic layer can contain many combination of use cases, many business definitions of the same information”

“The LDW permits an IT organization to make a large number of datasets available for analysis via query tools and applications.”

Gartner Definition

Gartner Hype Cycle for Enterprise Information Management, 2012

Description:

A semantic layer on top of the data warehouse that keeps the business data definition.

Allows the integration of multiple data sources including enterprise systems, the data warehouse, additional processing nodes (analytical appliances, Big Data, …), Web, Cloud and unstructured data.

Publishes data to multiple applications and reporting tools.

Three Integration/Semantic Layer AlternativesGartner’s View of Data Integration

Application/BI Tool as Data Integration/Semantic Layer

EDW as Data Integration/Semantic Layer

Data Virtualization as Data Integration/Semantic Layer

Application/BI Tool Data Virtualization

ODS ODS EDW ODS

Application/BI Tool as the Data Integration Layer

Application/BI Tool as Data Integration/Semantic Layer

Application/BI Tool

EDW ODS

• Integration is delegated to end user tools

and applications

• e.g. BI Tools with ‘data blending’

• Results in duplication of effort – integration

defined many times in different tools

• Impact of change in data schema?

• End user tools are not intended to be

integration middleware

• Not their primary purpose or expertise

EDW as the Data Integration Layer

EDW as Data Integration/Semantic Layer

• Access to ‘other’ data (query federation) via EDW

• Teradata QueryGrid, IBM FluidQuery, SAP Smart Data Access, etc.

• Often coupled with traditional ETL replication of data into EDW

• EDW ‘center of data universe’

• Provides data integration and semantic layer

• Appears attractive to organizations heavily invested in EDW

• More than one EDW? EDW costs?

Data Virtualization as the Data Integration Layer

Data Virtualization as Data Integration/Semantic Layer

Data Virtualization

EDW ODS

• Move data integration and semantic layer to

independent Data Virtualization platform

• Purpose built for supporting data access

across multiple heterogeneous data sources

• Separate layer provides semantic models for

underlying data

• Physical to logical mapping

• Enforces common and consistent security

and governance policies

• Gartner’s recommended approach

EDW Hadoop Cluster

SalesHDFSFiles

Document Collections

NoSQLDatabase

Database Excel

Logical Data WarehouseReference Architecture by Denodo

The State and Future of Data Integration. Gartner, 25 may 2016

Physical data movement architectures that aren’t designed to

support the dynamic nature of business change, volatile

requirements and massive data volume are increasingly being

replaced by data virtualization.

Evolving approaches (such as the use of LDW architectures) include

implementations beyond repository-centric techniques

What about the Logical Data Lake?

A Data Lake will not have a star or snowflake schema, but rather a more

heterogeneous collection of views with raw data from heterogeneous

sources

The virtual layer will act as a common umbrella under which these

different sources are presented to the end user as a single system

However, from the virtualization perspective, a Virtual Data Lake shares

many technical aspects with a LDW and most of these contents also

apply to a Logical Data Lake

Different Types, Different NeedsCommon Patterns for a Logical Data Warehouse

Common Patterns for a Logical Data Warehouse

1. The Virtual Data Mart

2. DW + MDM

Data Warehouse extended with master data

3. DW + Cloud

Data Warehouse extended with cloud data

4. DW + DW

Integration of multiple Data Warehouse

5. DW historical offloading

DW horizontal partitioning with historical data in cheaper storage

6. Slim DW extension

DW vertical partitioning with rarely used data in cheaper storage

Virtual Data Marts

Business friendly models defined on top of one or multiple systems,

often “flavored” for a particular division

Motivation

Hide complexity of star schemas for business users

Simplify model for a particular vertical

Reuse semantic models and security across multiple reporting engines

Typical queries

Simple projections, filters and aggregations on top of curated “fat tables”

that merge data from facts and many dimensions

Simplified semantic models for business users

Virtual Data Marts

Time Dimension Fact table(sales)

Product

Retailer Dimension

EDW Others

Product

Prod. Details

DW + MDM

Slim dimensions with extended information maintained in an external

MDM system

Motivation

Keep a single copy of golden records in the MDM that can be reused across

systems and managed in a single place

Typical queries

Join a large fact table (DW) with several MDM dimensions, aggregations on

Example

Revenue by customer, projecting the address from the MDM

DW + MDM dimensions

Time Dimension Fact table(sales) Product Dimension

Retailer Dimension

EDW MDM

DW + Cloud dimensional data

Fresh data from cloud systems (e.g. SFDC) is mixed with the EDW, usually

on the dimensions. DW is sometimes also in the cloud.

Motivation

Take advantage of “fresh” data coming straight from SaaS systems

Avoid local replication of cloud systems

Typical queries

Dimensions are joined with cloud data to filter based on some external attribute

not available (or not current) in the EDW

Example

Report on current revenue on accounts where the potential for an expansion is

higher than 80%

DW + Cloud dimensional data

Customer Dimension

SFDC Customer

Multiple DW integration

Motivation

Merges and acquisitions

Different DWs by department

Transition to new EDW Deployments (migration to Spark, Redshift, etc.)

Typical queries

Joins across fact tables in different DW with aggregations before or after the JOIN

Example

Get customers with a purchases higher than 100 USD that do not have a fidelity

card (purchases and fidelity card data in different DW)

Use of multiple DW as if it was only one

Multiple DW integration

Time Dimensi

Sales fact

Product Dimension

Region

Finance EDW

Marketing EDW

Customer Fidelity factsProduct Dimension

*Real Examples: Nationwide POC, IBM tests

DW Historical Partitioning

Only the most current data (e.g. last year) is in the EDW. Historical data is offloaded to a Hadoop cluster

Motivations

Reduce storage cost

Transparently use the two datasets as if they were all together

Typical queries

Facts are defined as a partitioned UNION based on date

Queries join the “virtual fact” with dimensions and aggregate on top

Example

Queries on current date only need to go to the DW, but longer timespans need to merge with Hadoop

Horizontal partitioning

DW Historical offloading Horizontal partitioning

Retailer Dimension

Current Sales Historical Sales

Slim DW extension

Minimal DW, with more complete raw data in a Hadoop cluster

Motivation

Reduce cost

Transparently use the two datasets as if they were all together

Typical queries

Tables are defined virtually as 1-to-1 joins between the two systems

Queries join the facts with dimensions and aggregate on top

Example

Common queries only need to go to the DW, but some queries need attributes or measures from Hadoop

Vertical partitioning

Slim DW extensionVertical partitioning

Retailer Dimension

Slim Sales Extended Sales

Performance in a LDW

It is a common assumption that a virtualized solution will

be much slower than a persisted approach via ETL:

1. There is a large amount of data moved through the

network for each query

2. Network transfer is slow

But is this really true?

Debunking the myths of virtual performance

1. Complex queries can be solved transferring moderate data volumes when

the right techniques are applied

Operational queries

Predicate delegation produces small result sets

Logical Data Warehouse and Big Data

Denodo uses characteristics of underlying star schemas to apply

query rewriting rules that maximize delegation to specialized sources

(especially heavy GROUP BY) and minimize data movement

2. Current networks are almost as fast as reading from disk

10GB and 100GB Ethernet are a commodity

Denodo has done extensive testing using queries from the standard benchmarking test

TPC-DS* and the following scenario

Compares the performance of a federated approach in Denodo with an MPP system where

all the data has been replicated via ETL

Customer Dim.2 M rows

Sales Facts290 M rows

Items Dim.400 K rows

* TPC-DS is the de-facto industry standard benchmark for measuring the performance of decision support solutions including, but not limited to, Big Data systems.

vs.Sales Facts290 M rows

Items Dim.400 K rows

Customer Dim.2 M rows

Performance ComparisonLogical Data Warehouse vs. Physical Data Warehouse

Performance Comparison

Query DescriptionReturned

RowsTime Netezza

Time Denodo (Federated Oracle,

Netezza & SQL Server)

Optimization Technique (automatically selected)

Total sales by customer 1,99 M 20.9 sec. 21.4 sec. Full aggregation push-down

Total sales by customer and year between 2000 and 2004

5,51 M 52.3 sec. 59.0 sec Full aggregation push-down

Total sales by item brand 31,35 K 4.7 sec. 5.0 sec. Partial aggregation push-down

Total sales by item where sale price less than current

list price17,05 K 3.5 sec. 5.2 sec On the fly data movement

Logical Data Warehouse vs. Physical Data Warehouse

Performance and optimizations in DenodoFocused on 3 core concepts

Dynamic Multi-Source Query Execution Plans

Leverages processing power & architecture of data sources

Dynamic to support ad hoc queries

Uses statistics for cost-based query plans

Selective Materialization

Intelligent Caching of only the most relevant and often used information

Optimized Resource Management

Smart allocation of resources to handle high concurrency

Throttling to control and mitigate source impact

Resource plans based on rules

Performance and optimizations in DenodoComparing optimizations in DV vs ETL

Although Data Virtualization is a data integration platform, architecturally speaking it is more similar to a RDBMs

Uses relational logicMetadata is equivalent to that of a databaseEnables ad hoc querying

Key difference between ETL engines and DV:ETL engines are optimized for static bulk movements

Fixed data flowsData virtualization is optimized for queries

Dynamic execution plan per query

Therefore, the performance architecture presented here resembles that of a RDBMS

Query Optimizer

Step by Step

Metadata Query Tree

• Maps query entities (tables, fields) to actual metadata

• Retrieves execution capabilities and restrictions for views involved in the query

Static Optimizer

• Query delegation

• SQL rewriting rules (removal of redundant filters, tree pruning, join reordering, transformation push-up, star-schema rewritings, etc.)

• Data movement query plans

Cost Based Optimizer

• Picks optimal JOIN methods and orders based on data distribution statistics, indexes, transfer rates, etc.

Physical Execution Plan

• Creates the calls to the underlying systems in their corresponding protocols and dialects (SQL, MDX, WS calls, etc.)

How Dynamic Query Optimizer Works

Example: Total sales by retailer and product during the last month for the brand ACME

Retailer Dimension

EDW MDM

SELECT retailer.name,

product.name,

SUM(sales.amount)

sales JOIN retailer ON

sales.retailer_fk = retailer.id

JOIN product ON sales.product_fk =

product.id

JOIN time ON sales.time_fk = time.id

WHERE time.date < ADDMONTH(NOW(),-1)

AND product.brand = ‘ACME’

GROUP BY product.name, retailer.name

Example: Non-optimized

1,000,000,000 rows

GROUP BYproduct.name, retailer.name

100 rows 10 rows 30 rows

10,000,000 rows

SELECT

sales.retailer_fk,

sales.product_fk,

sales.time_fk,

sales.amount

FROM sales

SELECT

retailer.name,

retailer.id

FROM retailer

SELECT

product.name,

product.id

FROM product

produc.brand =

‘ACME’

SELECT time.date,

time.id

FROM time

WHERE time.date <

add_months(CURRENT_

TIMESTAMP, -1)

Step 1: Applies JOIN reordering to maximize delegation

100,000,000 rows

100 rows 10 rows

10,000,000 rows

GROUP BYproduct.name, retailer.name

SELECT sales.retailer_fk,

sales.product_fk,

sales.amount

FROM sales JOIN time ON

sales.time_fk = time.id WHERE

time.date <

add_months(CURRENT_TIMESTAMP, -1)

SELECT

retailer.name,

retailer.id

FROM retailer

SELECT product.name,

product.id

FROM product

produc.brand = ‘ACME’

Step 2

10,000 rows

100 rows 10 rows

1,000 rowsGROUP BYproduct.name, retailer.name

Since the JOIN is on foreign keys

(1-to-many), and the GROUP BY is

on attributes from the dimensions,

it applies the partial aggregation

push down optimization

sales.product_fk,

SUM(sales.amount)

time.date <

GROUP BY sales.retailer_fk,

sales.product_fk

SELECT

retailer.name,

retailer.id

FROM retailer

product.id

FROM product

Step 3

Selects the right JOIN

strategy based on costs for

data volume estimations

1,000 rows

NESTED JOIN

HASH JOIN

100 rows10 rows

1,000 rowsGROUP BYproduct.name, retailer.name

sales.product_fk,

SUM(sales.amount)

time.date <

GROUP BY sales.retailer_fk,

sales.product_fk

WHERE product.id IN (1,2,…)

SELECT

retailer.name,

retailer.id

FROM retailer

product.id

FROM product

1. Automatic JOIN reordering

Groups branches that go to the same source to maximize query delegation and reduce processing in the DV

End users don’t need to worry about the optimal “pairing” of the tables

2. The Partial Aggregation push-down optimization is key in those scenarios. Based on

PK-FK restrictions, pushes the aggregation (for the PKs) to the DW

Leverages the processing power of the DW, optimized for these aggregations

Reduces significantly the data transferred through the network (from 1 b to 10 k)

3. The Cost-based Optimizer picks the right JOIN strategies based on estimations on data

volumes, existence of indexes, transfer rates, etc.

Denodo estimates costs in a different way for parallel databases (Vertica, Netezza, Teradata) than for regular

databases to take into consideration the different way those systems operate (distributed data, parallel

processing, different aggregation techniques, etc.)

Summary

Automatic data movement

Creation of temp tables in one of the systems to enable complete delegation

Only considered as an option if the target source has the “data movement” option

enabled

Use of native bulk load APIs for better performance

Execution Alternatives

If a view exist in more than one system, Denodo can decide in execution time which one

to use

The goal is to maximize query delegation depending on the other tables involved in the

Other relevant optimization techniques for LDW and Big Data

Optimizations for Virtual Partitioning

Eliminates unnecessary queries and processing based on a pre-execution analysis of the views and the queries

Pruning of unnecessary JOIN branches

Relevant for horizontal partitioning and “fat” semantic models when queries do not need attributes for all the tables

Pruning of unnecessary UNION branches

Enables detection of unnecessary UNION branches in vertical partitioning scenarios

Push down of JOIN under UNION views

Enables the delegation of JOINs with dimensions

Automatic Data movement for partition scenarios

Enables the delegation of JOINs with dimensions

Other relevant optimization techniques for LDW and Big Data

Caching

Sometimes, real time access & federation not a good fit:

Sources are slow (ex. text files, cloud apps. like Salesforce.com)

A lot of data processing needed (ex. complex combinations, transformations,

matching, cleansing, etc.)

Limited access or have to mitigate impact on the sources

For these scenarios, Denodo can replicate just the relevant data in

the cache

Real time vs. caching

Caching

Denodo’s cache system is based on an external relational database

Traditional (Oracle, SQLServer, DB2, MySQL, etc.)

MPP (Teradata, Netezza, Vertica, Redshift, etc.)

In-memory storage (Oracle TimesTen, SAP HANA)

Works at view level.

Allows hybrid access (real-time / cached) of an execution tree

Cache Control (population / maintenance)

Manually – user initiated at any time

Time based - using the TTL or the Denodo Scheduler

Event based - e.g. using JMS messages triggered in the DB

Overview

References

Logical Data Warehouse and Data Lakes

Technology

Four critical concepts to planning the Logical Data...

Sizing Logical Data in a Data Warehouse - IFPUG...

Designing a Logical Data Warehouse -...

Performance Considerations in Logical Data Warehouse

Oracle Insurance Data Foundation User Guide · The Oracle.....

Denodo DataFest 2017: A Customer Case Study in Logical Data....

ANALYTIC SOLUTIONS WITH DISPARATE DATA -...

TDWI Chicago Presentation: Is the Logical Data Warehouse,...

1 Sharif University Data Warehouse. 2 Sharif University...

Oracle Data Warehouse Pack - sematec · Oracle Data...

Data Warehouse Services Data Warehouse Tip Sheet

Sizing Logical Data in a Data Warehouse - A Consistent and.....

Data Warehouse. DATA WAREHOUSE 20 de Octubre 2006.

SAP Data Warehouse Strategy & Data Warehouse Roadmap

The Data Warehouse Environment - Building the Data WareHouse

Warehouse Replication Using Oracle Data Guard … · Eli...