Top Banner
Scalable DWH Architecture Rohit Chatter Architect at Yahoo! [email protected] Web footprint: http://searchbusinessintelligence.techtarget.in/expert/Rohit-Chatter http://rentmylens.blogspot.com http://www.slideshare.net/tdwiindia/4-rohit-yahoo-tdwi
16

I4E Scalable DWH Architecture, Rohit Chatter, Yahoo

Jul 12, 2015

Download

Technology

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: I4E Scalable DWH Architecture, Rohit Chatter, Yahoo

Scalable DWH Architecture

Rohit Chatter Architect at Yahoo!

[email protected]

Web footprint:

http://searchbusinessintelligence.techtarget.in/expert/Rohit-Chatter

http://rentmylens.blogspot.com

http://www.slideshare.net/tdwiindia/4-rohit-yahoo-tdwi

Page 2: I4E Scalable DWH Architecture, Rohit Chatter, Yahoo

Agenda What Fits Where? ------------------------ Judgemental Big Data ------------------------------------ Information Biz Case ------------------------------------ Experience Architectural Principles ------------------ Theory Hour Glass model ------------------------ As I see End to End view -------------------------- As I see Live use cases ----------------------------- Experience What if it didn’t work as expected? –- When does it? Comparative Study –--------------------- IMO Questions ---------------------------------- Rag me

Page 3: I4E Scalable DWH Architecture, Rohit Chatter, Yahoo

What Fits Where?

Traditional RDBMS Oracle, MySQL, SQL Server

Appliance based DBs Teradata, Exadata, Netezza

DaaS Amazon AWS, SalesForce

Proprietary InHouse built DB

Cloud based DB Hadoop, Cassandra

10 – 20 GB

< 100TB

< 1 PB

< Few PB

>= PB

Small corporates [Mostly transactional]

Medium corporates [OLTP & Batch]

Medium to Large corporates [Batch]

Large corporates [Batch]

Large corporates [Batch]

Page 4: I4E Scalable DWH Architecture, Rohit Chatter, Yahoo

Big Data – Three Vs Volume – Need no explanation

Velocity – collecting, processing and using data

Variety – Video, Images, Text, Audio

The combination of above presents the challenge of scalable DW

Page 5: I4E Scalable DWH Architecture, Rohit Chatter, Yahoo

Biz Case – Search Marketing

Daily, Weekly, Monthly & Yearly

Daily, Weekly, Monthly & Yearly

Daily, Hourly, Weekly, Monthly & Yearly

Daily, Weekly, Monthly & Yearly

Daily, Hourly, Weekly, Monthly & Yearly

Performance, Credit Summary

Performance, Budget Headroom, AM performance, competitive analysis

Performance, Feature Adoption

Competitive analysis, cross sell, upsell, performance

Page 6: I4E Scalable DWH Architecture, Rohit Chatter, Yahoo

Architectural Principles Confirm Dimensions: e.g Customer, Account, Campaign

Data Model – Grain: Term/Keyword

Answers known business questions: Tactical & Operational

Allow exploration for unknown: Strategy & Mining

Design for futuristic growth with incremental development

Identify right platform

Based on access pattern, load [Data & report concurrency], BI tool, throughput requirement

Retention & Archival policy

SLAs – ETL & Report

DQ framework with data lineage built in [ in-process & out-process]

Page 7: I4E Scalable DWH Architecture, Rohit Chatter, Yahoo

Business Perfomance monitoring

RDBMS Facts

Home Grown App Level 1 & 2 analysis

Granular aggregates

Home Grown App What if analysis and deep dive data analysis

Most granular data- event level model

Hour Glass model

Tactical & Operational reporting

Improvement & Alignment

Excellence & Strategic

Page 8: I4E Scalable DWH Architecture, Rohit Chatter, Yahoo

End to End

BI Apps Drill Dow

n ‐ Detail level BI Apps Founda6on (W

orkflow based) 

Input Connector/Adaptors 

Common Data Structure 

Transformer &

 Aggregates 

Output Connector/Adaptors 

Enterprise 

Wareh

ouse 

Data Store 

Data [Text,  Audio, Video, Etc] 

ETL Framework 

Ver6cal M

art A

ware 

Data Access Layer 

Seman6c Layer 

SOA (Web Services, Data APIs) 

Ext Data 

Event Pipeline

Ver6cal   Fed

erated

   Mart(s) 

Vertical 1 (Search Insights)

Vertical 2 (Display Insights)

Ver6cal 3 

Ver6cal 4 

Ver6cal 5 

Enterprise Wide R

eporting (Search &

Display &

others)

Page 9: I4E Scalable DWH Architecture, Rohit Chatter, Yahoo

Use Case - Avoid recording data that is not analyzed at all

Data recorded but never used for any analysis

Record and analyze data that business needs

Page 10: I4E Scalable DWH Architecture, Rohit Chatter, Yahoo

10

Use case – Get more juice out

Few properties data

ETL – data daily frequency close to 15GB

Grain – Link

Database – Oracle 10g Enterprise

Reports approx 400 and growing [scheduled and online]

Users – approx 50

Page 11: I4E Scalable DWH Architecture, Rohit Chatter, Yahoo

11

Use case – Problem at hand

Page 12: I4E Scalable DWH Architecture, Rohit Chatter, Yahoo

12

Use case - Facts

1 Year Back 6 Months Back Now

Input Data Size 15GB 25GB 50GB

Elapsed Time * 35hrs 16hrs 4 hrs

No. of Properties 14 15 21

CPU Utilization 50% 80% 30%

Mart Size 5TB (Uncompressed) 7TB (Uncompressed) 8.5TB (Compressed)

16TB (Uncompressed)

Scheduled Report Execution Time 6 hrs 8 hrs < 1hr

* Explanation of elapsed time > 24 hours for 1 day data processing in Appendix

Page 13: I4E Scalable DWH Architecture, Rohit Chatter, Yahoo

13

What all have been done? ▶ Proper networking – VLANs & Switches ▶ Multiple volumes on filer for different data ▶ Backup optimization – Snapmirror backup only that is required

▶ Leverage Oracle Features:   Use of external tables   Use of Materialized view & Query Rewrite for commonly joined huge dimension tables (Loads

new dimension during ETL)   Proper Index on Materialized views   Worked closely with MSTR folks to educate them about the physical schema and therefore, used

proper joins to take advantage of partitioned dimension   Use other features like parallel loading, exchange partition, Instance parameter tuning, proper

blocksize, daily tablespace, optimized archival

Page 14: I4E Scalable DWH Architecture, Rohit Chatter, Yahoo

What if it didn’t work as expected? •  What are the symptoms?

a.  Customers/Users complaining b.  Operations/SLA issues

•  What is the outcome of diagnosis? a.  Architectural Issue b.  Design issue c.  Growth/HW limitation issue

•  What are feasible and cost effective solutions?

•  Outcome

Page 15: I4E Scalable DWH Architecture, Rohit Chatter, Yahoo

Comparative Study - IMO DaaS RDBMS Appliance Based InHouse

Proprietary Grid/Cloud based

Mature Low High High Low Low

Cost Low Medium High High Medium

SQL Access No Yes SQL Like No No

Large Joins No Slow Faster Fast Fastest

Complex data types

No No No Yes Yes

Type Transactional & Light DW

Transactional & Medium DW

Medium DW Large DW Large DW

Skill Availability High High to Medium Medium Low Low

TCO Low Medium High High Low

Page 16: I4E Scalable DWH Architecture, Rohit Chatter, Yahoo

Questions?