Top Banner
Informatica & Big Data Sanjeev Kumar VP & MD, Informatica India Apache Hadoop India Summit 2011
19

Apache Hadoop India Summit 2011 talk "Informatica and Big Data" by Snajeev Kumar

Jan 27, 2015

Download

Documents

 
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Apache Hadoop India Summit 2011 talk "Informatica and Big Data" by Snajeev Kumar

1

Informatica & Big Data

Sanjeev KumarVP & MD, Informatica India

Apache Hadoop India Summit 2011

Page 2: Apache Hadoop India Summit 2011 talk "Informatica and Big Data" by Snajeev Kumar

2

Agenda

• Big Data

• Big Data in Enterprise

• Informatica & Data

• Informatica & Big Data

Page 3: Apache Hadoop India Summit 2011 talk "Informatica and Big Data" by Snajeev Kumar

3

Source: An IDC White Paper - sponsored by EMC. As the Economy Contracts, the Digital Universe Expands. May 2009.

.

Why “Big Data” Now? : Exploding Data Volumes

Relational

Complex, Unstructured

• 2,500 exabytes of new information in 2012 with Internet as primary driver• Digital universe grew by 62% last year to 800K petabytes and will grow to 1.2 “zettabytes” this year

Page 4: Apache Hadoop India Summit 2011 talk "Informatica and Big Data" by Snajeev Kumar

4

Why Now? Exploding Data Volumes

• Explosion in user-generated content• e.g. Blogs, Twitter, Facebook etc.

• Proliferation of web-connected devices• Smartphone interactions with the web

• Increased consumption of digital content• Netflix, HULU, Pandora etc.

• Internet of things• Smart-grid and smart-meters• Machine-generated data via the web

Page 5: Apache Hadoop India Summit 2011 talk "Informatica and Big Data" by Snajeev Kumar

5

Why Now? : New Apps/Use-cases

• Analyze customer/market sentiment• Text analytics on Social Media, blogs

• Achieve Operational Efficiency• e.g. Analyze CDRs to optimize cell tower placements

• Make Recommendations• Data mining on click-stream, purchase history

• Predict the future• e.g. Flightcast predicts flight delays

Page 6: Apache Hadoop India Summit 2011 talk "Informatica and Big Data" by Snajeev Kumar

6

Big Data Challenges

• Storage• Cost-effective Scalability: to multi-terabytes and petabytes

• Non-traditional data models: complex, semi-structured data

• Processing• Data mining, collaborative filtering for structured data• Text Analytics, classification etc. for unstructured data

• Regulatory Compliance• Data Privacy / Masking• Data Archival

Page 7: Apache Hadoop India Summit 2011 talk "Informatica and Big Data" by Snajeev Kumar

7

Addressing Big Data Challenges

• Storage• Parallel Databases

• Greenplum(EMC), Vertica, AsterData

• Distributed Key/Value Stores • Hbase, Google’s BigTable, Amazon’s SimpleDB

• Distributed File Systems• HDFS, GFS, ParAccel

• Analytics• SQL with extensions• Map Reduce• DataFlow Languages : PIG, Sawzall etc

Page 8: Apache Hadoop India Summit 2011 talk "Informatica and Big Data" by Snajeev Kumar

8

Hadoop Technology Stack

HDFS

HBase

Map/Reduce

Pig Hive CascadingZ

oo

Kee

per

Page 9: Apache Hadoop India Summit 2011 talk "Informatica and Big Data" by Snajeev Kumar

9

Hadoop Momentum

Search Volume Index

News Reference Volume

Job Trends from Indeed.com

Page 10: Apache Hadoop India Summit 2011 talk "Informatica and Big Data" by Snajeev Kumar

10

Big Data in the Enterprise – Hadoop Usage

Page 11: Apache Hadoop India Summit 2011 talk "Informatica and Big Data" by Snajeev Kumar

11

Big Data in the EnterpriseCase Studies: Hadoop World 2009

• Yahoo!: Social Graph Analysis

• VISA: Large Scale Transaction Analysis

• China Mobile: Data Mining Platform for Telecom Industry

• JP Morgan Chase: Data Processing for Financial Services

• eHarmony: Matchmaking in the Hadoop Cloud

• Rackspace: Cross Data Center Log Processing

• Visible Technologies: Real-Time Business Intelligence

• Booz Allen Hamilton: Protein Alignment using Hadoop

Slides and Videos at http://www.cloudera.com/hadoop-world-nyc

Page 12: Apache Hadoop India Summit 2011 talk "Informatica and Big Data" by Snajeev Kumar

12

• eBay: Hadoop at eBay

• Twitter: The Hadoop Ecosystem at Twitter

• General Electric: Sentiment Analysis powered by Hadoop

• Yale University: MapReduce and Parallel Database Systems

• AOL: AOL’s Data Layer

• Facebook: Hbase in Production

• Bank of America: The Business of Big Data

• StumbleUpon: Mixing Real-Time and Batch Processing

• Raytheon: SHARD: Storing and Querying Large-Scale Data

More info at - http://www.cloudera.com/company/press-center/hadoop-world-nyc/

Big Data in the EnterpriseCase Studies: Hadoop World 2010

Page 13: Apache Hadoop India Summit 2011 talk "Informatica and Big Data" by Snajeev Kumar

13

Agenda

• Big Data

• Big Data in Enterprise

• Informatica & Data

• Informatica & Big Data

Page 14: Apache Hadoop India Summit 2011 talk "Informatica and Big Data" by Snajeev Kumar

14

We enable organizations to gain a competitive advantage

from all their information assets to drive their

top business imperatives

Informatica – Our Singular Mission Enabling The Information Economy

Page 15: Apache Hadoop India Summit 2011 talk "Informatica and Big Data" by Snajeev Kumar

15

Application Partner Data

SWIFT NACHA HIPAA …

Cloud Computing Unstructured

Informatica – What We DoComprehensive, Unified, Open and Economical platform

Database

Data Warehouse

DataMigration

Test DataManagement& Archiving

Master DataManagement

Data Synchronization

B2B DataExchange

DataConsolidation

ComplexEventProcessing

UltraMessaging

Page 16: Apache Hadoop India Summit 2011 talk "Informatica and Big Data" by Snajeev Kumar

16

INFA = Data + [ Archival | As a Service | Cleansing | Clustering | Consolidation |

Conversion | De-duping | Exchange | Extraction | Federation |

Hub | Identity | Integration | Life-cycle Management |

Loading | Masking | Mastering | Matching | Migration | On Demand |

Privacy | Profiling | Provisioning | Quality | Quality Assessment |

Registry | Replication | Retirement | Services | Stewardship |

Sub-setting | Synchronization | Test Management | Transformation |

Validation | Virtualization | Warehousing |

]

Informatica & DataVerbs on Data – We do things to data!

Page 17: Apache Hadoop India Summit 2011 talk "Informatica and Big Data" by Snajeev Kumar

17

Informatica & Big Data

• HDFS as a source and a target - Enable universal data connectivity for Hadoop developers

• Enable Hadoop developers to leverage prebuilt Data Transformation and Data Quality logic

• Lower the barrier to Hadoop-entry by using Informatica Developer as a development tool

• Support virtualized access to data split across HDFS and (relational) data-warehouses

Page 18: Apache Hadoop India Summit 2011 talk "Informatica and Big Data" by Snajeev Kumar

18

HDFS

Data Node

HDFSName Node

HDFSJob Tracker

Hadoop Cluster

Weblogs

Enterprise Applications

Databases

Semi-structuredUn-structured

BI

DW/DM

Informatica & Hadoop – Big Picture

MetadataRepository

Graphical IDE for

Hadoop Development

Enterprise

Connectivity for

Hadoop programs Transformation

Engine for custom

data processing

Page 19: Apache Hadoop India Summit 2011 talk "Informatica and Big Data" by Snajeev Kumar

19