Hadoop India Summit, Feb 2011 - Informatica

1

Informatica & Big Data

Sanjeev KumarVP & MD, Informatica India

Apache Hadoop India Summit 2011

2

Agenda

• Big Data

• Big Data in Enterprise

• Informatica & Data

• Informatica & Big Data

3

Source: An IDC White Paper - sponsored by EMC. As the Economy Contracts, the Digital Universe Expands. May 2009.

.

Why “Big Data” Now? : Exploding Data Volumes

Relational

Complex, Unstructured

• 2,500 exabytes of new information in 2012 with Internet as primary driver• Digital universe grew by 62% last year to 800K petabytes and will grow to 1.2 “zettabytes” this year

4

Why Now? Exploding Data Volumes

• Explosion in user-generated content• e.g. Blogs, Twitter, Facebook etc.

• Proliferation of web-connected devices• Smartphone interactions with the web

• Increased consumption of digital content• Netflix, HULU, Pandora etc.

• Internet of things• Smart-grid and smart-meters

• Machine-generated data via the web

5

Why Now? : New Apps/Use-cases

• Analyze customer/market sentiment• Text analytics on Social Media, blogs

• Achieve Operational Efficiency• e.g. Analyze CDRs to optimize cell tower placements

• Make Recommendations• Data mining on click-stream, purchase history

• Predict the future• e.g. Flightcast predicts flight delays

6

Big Data Challenges

• Storage• Cost-effective Scalability: to multi-terabytes and petabytes

• Non-traditional data models: complex, semi-structured data

• Processing• Data mining, collaborative filtering for structured data

• Text Analytics, classification etc. for unstructured data

• Regulatory Compliance• Data Privacy / Masking

• Data Archival

7

Addressing Big Data Challenges

• Storage• Parallel Databases

• Greenplum(EMC), Vertica, AsterData

• Distributed Key/Value Stores • Hbase, Google’s BigTable, Amazon’s SimpleDB

• Distributed File Systems• HDFS, GFS, ParAccel

• Analytics• SQL with extensions

• Map Reduce

• DataFlow Languages : PIG, Sawzall etc

8

Hadoop Technology Stack

HDFS

HBase

Map/Reduce

Pig Hive Cascading

9

Hadoop Momentum

Search Volume Index

News Reference Volume

Job Trends from Indeed.com

10

Big Data in the Enterprise – Hadoop Usage

11

Big Data in the EnterpriseCase Studies: Hadoop World 2009

• Yahoo!: Social Graph Analysis

• VISA: Large Scale Transaction Analysis

• China Mobile: Data Mining Platform for Telecom Industry

• JP Morgan Chase: Data Processing for Financial Services

• eHarmony: Matchmaking in the Hadoop Cloud

• Rackspace: Cross Data Center Log Processing

• Visible Technologies: Real-Time Business Intelligence

• Booz Allen Hamilton: Protein Alignment using Hadoop

Slides and Videos at http://www.cloudera.com/hadoop-world-nyc

12

• eBay: Hadoop at eBay

• Twitter: The Hadoop Ecosystem at Twitter

• General Electric: Sentiment Analysis powered by Hadoop

• Yale University: MapReduce and Parallel Database Systems

• AOL: AOL’s Data Layer

• Facebook: Hbase in Production

• Bank of America: The Business of Big Data

• StumbleUpon: Mixing Real-Time and Batch Processing

• Raytheon: SHARD: Storing and Querying Large-Scale Data

More info at - http://www.cloudera.com/company/press-center/hadoop-world-nyc/

Big Data in the EnterpriseCase Studies: Hadoop World 2010

13

Agenda

• Big Data

• Big Data in Enterprise

• Informatica & Data

• Informatica & Big Data

14

We enable organizations to gain a competitive advantage

from all their information assets to drive their

top business imperatives

Informatica – Our Singular Mission Enabling The Information Economy

15

Application Partner Data

SWIFT NACHA HIPAA …

Cloud Computing Unstructured

Informatica – What We DoComprehensive, Unified, Open and Economical platform

Database

Data Warehouse

DataMigration

Test DataManagement& Archiving

Master DataManagement

Data Synchronization

B2B DataExchange

DataConsolidation

ComplexEventProcessing

UltraMessaging

16

INFA = Data + [ Archival | As a Service | Cleansing | Clustering | Consolidation |

Conversion | De-duping | Exchange | Extraction | Federation |

Hub | Identity | Integration | Life-cycle Management |

Loading | Masking | Mastering | Matching | Migration | On Demand |

Privacy | Profiling | Provisioning | Quality | Quality Assessment |

Registry | Replication | Retirement | Services | Stewardship |

Sub-setting | Synchronization | Test Management | Transformation |

Validation | Virtualization | Warehousing |

]

Informatica & DataVerbs on Data – We do things to data!

17

Informatica & Big Data

• HDFS as a source and a target - Enable universal data connectivity for Hadoop developers

• Enable Hadoop developers to leverage prebuilt Data Transformation and Data Quality logic

• Lower the barrier to Hadoop-entry by using Informatica Developer as a development tool

• Support virtualized access to data split across HDFS and (relational) data-warehouses

18

HDFS

Data Node

HDFSName Node

HDFSJob Tracker

Hadoop Cluster

Weblogs

Enterprise Applications

Databases

Semi-structuredUn-structured

BI

DW/DM

Informatica & Hadoop – Big Picture

MetadataRepository

Graphical IDE for

Hadoop Development

Enterprise

Connectivity for

Hadoop programs Transformation

Engine for custom

data processing

19

Hadoop India Summit, Feb 2011 - Informatica

Technology

agenda big data big

largescale data

informatica big data

data mining platform

recommendations data

data volumes source

data volumes explosion

relational datawarehouses