Data Integration for Big Data
Post on 25-Feb-2016
62 Views
Preview:
DESCRIPTION
Transcript
11
Data Integration for Big Data
Pierre SkowronskiPrague le 23.04.2013
2Informatica Corporation Confidential – Do Not Distribute
2
IT is struggling with the cost of Big Data
• Growing data volume is quickly consuming capacity
• Need to onboard, store, & process new types of data
• High expense and lack of big data skills
3Informatica Corporation Confidential – Do Not Distribute
3
Delivery: Innovate Faster With Big Data
(onboard, discover, operationalize)
Risk: Minimize Risk of New Technologies
(design once, deploy anywhere)
Cost: Lower Big Data Project Costs
(helps self-fund big data projects)
Prove the Value with Big Data Deliver Value Along the Way
4Informatica Corporation Confidential – Do Not Distribute
4
5Informatica Corporation Confidential – Do Not Distribute
5
INTRODUCING THE INFORMATICA POWERCENTER BIG DATA EDITION
6Informatica Corporation Confidential – Do Not Distribute
6
PowerCenter Big Data EditionLower Costs
Transactions,OLTP, OLAP
Social Media, Web Logs
Machine Device, Scientific
Documents and Emails
EDW
ODS
MDM
Traditional Grid
Optimize processing with low cost commodity hardware
Increase productivity up to 5X
7Informatica Corporation Confidential – Do Not Distribute
7
7
Hadoop complements Existing Infrastructureon low cost commodity hardware
7
8Informatica Corporation Confidential – Do Not Distribute
8
8
5 x better productivity for similar performance
8
Project domain Clustersize Processing
Compare to Expert Hand-coding
Finance 3 Cleanse, Transform, sort, group 40% faster than PIG
Extract, process, load 50% faster than PIGFinance 10 Extract, process, load 20% slower than PIG
In the worst, only 20% slower the hand-codingMostly, equal or faster
Inormatica 1 week vs hand-coding 5-6 weeks
9Informatica Corporation Confidential – Do Not Distribute
9
Traditional Grid
PowerCenter Big Data EditionMinimize Risk
Deploy On-Premise or in the Cloud
Pushdown to RDBMS or DW Appliance
Quickly staff projects with trained data integration experts
Design once and deploy anywhere
10Informatica Corporation Confidential – Do Not Distribute
10
10
Graphical Processing LogicTest on Native, Deploy on Hadoop
10
Partial records only
Separate partial records from completed records
Completed records only
Separate incomplete and complete partial records
Select incomplete partial records
Aggregate all completed and partial-completed records
Sort records by Calling number
11Informatica Corporation Confidential – Do Not Distribute
11
11
Run it simple on Hadoop
11
Choose execution environment
View hive query
Press Run
12Informatica Corporation Confidential – Do Not Distribute
12
Minimaize Risk with Informatica Partners and Certified Developer Community
Global Systems Integrators Informatica Developers
• 45,000+ developers in Informatica TechNet
• 3x more developers than any other vendor*
0
200
400
600
800
1000
1200
Ab InitioBusiness ObjectsIBMInformatica
9,000+ trained developers
* Source: U.S. resume search on dice.com, December 2008
People
AchievingOperationalEfficiency
With InformaticaBest
practices & reusability
Technology
Expertise & best
practices
13Informatica Corporation Confidential – Do Not Distribute
13
WHAT ARE CUSTOMERS DOING WITH INFORMATICA AND BIG DATA?
14Informatica Corporation Confidential – Do Not Distribute
14
The Challenge Data warehouse exploding with over 200TB of data. User activity generating up to 5 million queries a day impacting query performance
The Solution The Result
• Saved 100TBs of space over past 2 ½ years
• Reduced rearchitecture project from 6 months to 2 weeks
• Improved performance by 25%
• Return on investment in less than 6 months
Lower Costs of Big Data ProjectsSaved $20M + $2-3M On-going by Archiving & Optimization
ERP
CRM
Custom
Business Reports
EDW
Archived DataInteraction Data
Large Global Financial Institution
Phase 2
15Informatica Corporation Confidential – Do Not Distribute
15
Web Logs
Traditional Grid
Near Real-Time
The Challenge. Increasing demand for faster data driven decision making and analytics as data volumes and processing loads rapidly increase
The Solution The Result
• Cost-effectively scale performance
• Lower hardware costs• Increased agility by
standardizing on one data integration platform
RDBMS
RDBMS
RDBMS
Datamarts
Datamarts
DataWarehouse
Phase 2
Large Global Financial InstitutionLower Costs of Big Data Projects
Phase 2
16Informatica Corporation Confidential – Do Not Distribute
16
Large Government AgencyFlexible Architecture to Support Rapidly Changing Business Needs
The Challenge Data volumes growing at 3-5 times over the next 2-3 years
The Solution The Result• Manage data
integration and load of 10+ billion records from multiple disparate data sources
• Flexible data integration architecture to support changing business requirements in a heterogeneous data management environment
EDW
DW
DWMainframe
Dat
a Vi
rtua
lizat
ion
RDBMS
Unstructured Data
Business Reports
Traditional Grid
Phase 2
Phase 2
17Informatica Corporation Confidential – Do Not Distribute
17
17
Why PowerCenter Big Data Edition
• Repeatability• Predictable, repeatable deployments and methodology
• Reuse of existing assets• Apply existing integration logic to load data to/from Hadoop• Reuse existing data quality rules to validate Hadoop data
• Reuse of existing skills• Enable ETL developers to leverage the power of Hadoop
• Governance• Enforce and validate data security, data quality and regulatory policies• Manageability
17
18Informatica Corporation Confidential – Do Not Distribute
18
top related