Scalable DWH Architecture Rohit Chatter Architect at Yahoo! [email protected] Web footprint: http://searchbusinessintelligence.techtarget.in/expert/Rohit-Chatter http://rentmylens.blogspot.com http://www.slideshare.net/tdwiindia/4-rohit-yahoo-tdwi
Scalable DWH Architecture
Rohit Chatter Architect at Yahoo!
Web footprint:
http://searchbusinessintelligence.techtarget.in/expert/Rohit-Chatter
http://rentmylens.blogspot.com
http://www.slideshare.net/tdwiindia/4-rohit-yahoo-tdwi
Agenda What Fits Where? ------------------------ Judgemental Big Data ------------------------------------ Information Biz Case ------------------------------------ Experience Architectural Principles ------------------ Theory Hour Glass model ------------------------ As I see End to End view -------------------------- As I see Live use cases ----------------------------- Experience What if it didn’t work as expected? –- When does it? Comparative Study –--------------------- IMO Questions ---------------------------------- Rag me
What Fits Where?
Traditional RDBMS Oracle, MySQL, SQL Server
Appliance based DBs Teradata, Exadata, Netezza
DaaS Amazon AWS, SalesForce
Proprietary InHouse built DB
Cloud based DB Hadoop, Cassandra
10 – 20 GB
< 100TB
< 1 PB
< Few PB
>= PB
Small corporates [Mostly transactional]
Medium corporates [OLTP & Batch]
Medium to Large corporates [Batch]
Large corporates [Batch]
Large corporates [Batch]
Big Data – Three Vs Volume – Need no explanation
Velocity – collecting, processing and using data
Variety – Video, Images, Text, Audio
The combination of above presents the challenge of scalable DW
Biz Case – Search Marketing
Daily, Weekly, Monthly & Yearly
Daily, Weekly, Monthly & Yearly
Daily, Hourly, Weekly, Monthly & Yearly
Daily, Weekly, Monthly & Yearly
Daily, Hourly, Weekly, Monthly & Yearly
Performance, Credit Summary
Performance, Budget Headroom, AM performance, competitive analysis
Performance, Feature Adoption
Competitive analysis, cross sell, upsell, performance
Architectural Principles Confirm Dimensions: e.g Customer, Account, Campaign
Data Model – Grain: Term/Keyword
Answers known business questions: Tactical & Operational
Allow exploration for unknown: Strategy & Mining
Design for futuristic growth with incremental development
Identify right platform
Based on access pattern, load [Data & report concurrency], BI tool, throughput requirement
Retention & Archival policy
SLAs – ETL & Report
DQ framework with data lineage built in [ in-process & out-process]
Business Perfomance monitoring
RDBMS Facts
Home Grown App Level 1 & 2 analysis
Granular aggregates
Home Grown App What if analysis and deep dive data analysis
Most granular data- event level model
Hour Glass model
Tactical & Operational reporting
Improvement & Alignment
Excellence & Strategic
End to End
BI Apps Drill Dow
n ‐ Detail level BI Apps Founda6on (W
orkflow based)
Input Connector/Adaptors
Common Data Structure
Transformer &
Aggregates
Output Connector/Adaptors
Enterprise
Wareh
ouse
Data Store
Data [Text, Audio, Video, Etc]
ETL Framework
Ver6cal M
art A
ware
Data Access Layer
Seman6c Layer
SOA (Web Services, Data APIs)
Ext Data
Event Pipeline
Ver6cal Fed
erated
Mart(s)
Vertical 1 (Search Insights)
Vertical 2 (Display Insights)
Ver6cal 3
Ver6cal 4
Ver6cal 5
Enterprise Wide R
eporting (Search &
Display &
others)
Use Case - Avoid recording data that is not analyzed at all
Data recorded but never used for any analysis
Record and analyze data that business needs
10
Use case – Get more juice out
Few properties data
ETL – data daily frequency close to 15GB
Grain – Link
Database – Oracle 10g Enterprise
Reports approx 400 and growing [scheduled and online]
Users – approx 50
11
Use case – Problem at hand
12
Use case - Facts
1 Year Back 6 Months Back Now
Input Data Size 15GB 25GB 50GB
Elapsed Time * 35hrs 16hrs 4 hrs
No. of Properties 14 15 21
CPU Utilization 50% 80% 30%
Mart Size 5TB (Uncompressed) 7TB (Uncompressed) 8.5TB (Compressed)
16TB (Uncompressed)
Scheduled Report Execution Time 6 hrs 8 hrs < 1hr
* Explanation of elapsed time > 24 hours for 1 day data processing in Appendix
13
What all have been done? ▶ Proper networking – VLANs & Switches ▶ Multiple volumes on filer for different data ▶ Backup optimization – Snapmirror backup only that is required
▶ Leverage Oracle Features: Use of external tables Use of Materialized view & Query Rewrite for commonly joined huge dimension tables (Loads
new dimension during ETL) Proper Index on Materialized views Worked closely with MSTR folks to educate them about the physical schema and therefore, used
proper joins to take advantage of partitioned dimension Use other features like parallel loading, exchange partition, Instance parameter tuning, proper
blocksize, daily tablespace, optimized archival
What if it didn’t work as expected? • What are the symptoms?
a. Customers/Users complaining b. Operations/SLA issues
• What is the outcome of diagnosis? a. Architectural Issue b. Design issue c. Growth/HW limitation issue
• What are feasible and cost effective solutions?
• Outcome
Comparative Study - IMO DaaS RDBMS Appliance Based InHouse
Proprietary Grid/Cloud based
Mature Low High High Low Low
Cost Low Medium High High Medium
SQL Access No Yes SQL Like No No
Large Joins No Slow Faster Fast Fastest
Complex data types
No No No Yes Yes
Type Transactional & Light DW
Transactional & Medium DW
Medium DW Large DW Large DW
Skill Availability High High to Medium Medium Low Low
TCO Low Medium High High Low
Questions?