Top Banner
Informatica & Big Data Sanjeev Kumar VP & MD, Informatica India Apache Hadoop India Summit 2011
19

Hadoop India Summit, Feb 2011 - Informatica

Jan 27, 2015

Download

Technology

Sanjeev Kumar

Lightening talk by Sanjeev Kumar, from Informatica India. Presented at the Hadoop India Summit on Feb 16, 2011.
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Hadoop India Summit, Feb 2011 - Informatica

1

Informatica & Big Data

Sanjeev KumarVP & MD, Informatica India

Apache Hadoop India Summit 2011

Page 2: Hadoop India Summit, Feb 2011 - Informatica

2

Agenda

• Big Data

• Big Data in Enterprise

• Informatica & Data

• Informatica & Big Data

Page 3: Hadoop India Summit, Feb 2011 - Informatica

3

Source: An IDC White Paper - sponsored by EMC. As the Economy Contracts, the Digital Universe Expands. May 2009.

.

Why “Big Data” Now? : Exploding Data Volumes

Relational

Complex, Unstructured

• 2,500 exabytes of new information in 2012 with Internet as primary driver• Digital universe grew by 62% last year to 800K petabytes and will grow to 1.2 “zettabytes” this year

Page 4: Hadoop India Summit, Feb 2011 - Informatica

4

Why Now? Exploding Data Volumes

• Explosion in user-generated content• e.g. Blogs, Twitter, Facebook etc.

• Proliferation of web-connected devices• Smartphone interactions with the web

• Increased consumption of digital content• Netflix, HULU, Pandora etc.

• Internet of things• Smart-grid and smart-meters

• Machine-generated data via the web

Page 5: Hadoop India Summit, Feb 2011 - Informatica

5

Why Now? : New Apps/Use-cases

• Analyze customer/market sentiment• Text analytics on Social Media, blogs

• Achieve Operational Efficiency• e.g. Analyze CDRs to optimize cell tower placements

• Make Recommendations• Data mining on click-stream, purchase history

• Predict the future• e.g. Flightcast predicts flight delays

Page 6: Hadoop India Summit, Feb 2011 - Informatica

6

Big Data Challenges

• Storage• Cost-effective Scalability: to multi-terabytes and petabytes

• Non-traditional data models: complex, semi-structured data

• Processing• Data mining, collaborative filtering for structured data

• Text Analytics, classification etc. for unstructured data

• Regulatory Compliance• Data Privacy / Masking

• Data Archival

Page 7: Hadoop India Summit, Feb 2011 - Informatica

7

Addressing Big Data Challenges

• Storage• Parallel Databases

• Greenplum(EMC), Vertica, AsterData

• Distributed Key/Value Stores • Hbase, Google’s BigTable, Amazon’s SimpleDB

• Distributed File Systems• HDFS, GFS, ParAccel

• Analytics• SQL with extensions

• Map Reduce

• DataFlow Languages : PIG, Sawzall etc

Page 8: Hadoop India Summit, Feb 2011 - Informatica

8

Hadoop Technology Stack

HDFS

HBase

Map/Reduce

Pig Hive Cascading

Page 9: Hadoop India Summit, Feb 2011 - Informatica

9

Hadoop Momentum

Search Volume Index

News Reference Volume

Job Trends from Indeed.com

Page 10: Hadoop India Summit, Feb 2011 - Informatica

10

Big Data in the Enterprise – Hadoop Usage

Page 11: Hadoop India Summit, Feb 2011 - Informatica

11

Big Data in the EnterpriseCase Studies: Hadoop World 2009

• Yahoo!: Social Graph Analysis

• VISA: Large Scale Transaction Analysis

• China Mobile: Data Mining Platform for Telecom Industry

• JP Morgan Chase: Data Processing for Financial Services

• eHarmony: Matchmaking in the Hadoop Cloud

• Rackspace: Cross Data Center Log Processing

• Visible Technologies: Real-Time Business Intelligence

• Booz Allen Hamilton: Protein Alignment using Hadoop

Slides and Videos at http://www.cloudera.com/hadoop-world-nyc

Page 12: Hadoop India Summit, Feb 2011 - Informatica

12

• eBay: Hadoop at eBay

• Twitter: The Hadoop Ecosystem at Twitter

• General Electric: Sentiment Analysis powered by Hadoop

• Yale University: MapReduce and Parallel Database Systems

• AOL: AOL’s Data Layer

• Facebook: Hbase in Production

• Bank of America: The Business of Big Data

• StumbleUpon: Mixing Real-Time and Batch Processing

• Raytheon: SHARD: Storing and Querying Large-Scale Data

More info at - http://www.cloudera.com/company/press-center/hadoop-world-nyc/

Big Data in the EnterpriseCase Studies: Hadoop World 2010

Page 13: Hadoop India Summit, Feb 2011 - Informatica

13

Agenda

• Big Data

• Big Data in Enterprise

• Informatica & Data

• Informatica & Big Data

Page 14: Hadoop India Summit, Feb 2011 - Informatica

14

We enable organizations to gain a competitive advantage

from all their information assets to drive their

top business imperatives

Informatica – Our Singular Mission Enabling The Information Economy

Page 15: Hadoop India Summit, Feb 2011 - Informatica

15

Application Partner Data

SWIFT NACHA HIPAA …

Cloud Computing Unstructured

Informatica – What We DoComprehensive, Unified, Open and Economical platform

Database

Data Warehouse

DataMigration

Test DataManagement& Archiving

Master DataManagement

Data Synchronization

B2B DataExchange

DataConsolidation

ComplexEventProcessing

UltraMessaging

Page 16: Hadoop India Summit, Feb 2011 - Informatica

16

INFA = Data + [ Archival | As a Service | Cleansing | Clustering | Consolidation |

Conversion | De-duping | Exchange | Extraction | Federation |

Hub | Identity | Integration | Life-cycle Management |

Loading | Masking | Mastering | Matching | Migration | On Demand |

Privacy | Profiling | Provisioning | Quality | Quality Assessment |

Registry | Replication | Retirement | Services | Stewardship |

Sub-setting | Synchronization | Test Management | Transformation |

Validation | Virtualization | Warehousing |

]

Informatica & DataVerbs on Data – We do things to data!

Page 17: Hadoop India Summit, Feb 2011 - Informatica

17

Informatica & Big Data

• HDFS as a source and a target - Enable universal data connectivity for Hadoop developers

• Enable Hadoop developers to leverage prebuilt Data Transformation and Data Quality logic

• Lower the barrier to Hadoop-entry by using Informatica Developer as a development tool

• Support virtualized access to data split across HDFS and (relational) data-warehouses

Page 18: Hadoop India Summit, Feb 2011 - Informatica

18

HDFS

Data Node

HDFSName Node

HDFSJob Tracker

Hadoop Cluster

Weblogs

Enterprise Applications

Databases

Semi-structuredUn-structured

BI

DW/DM

Informatica & Hadoop – Big Picture

MetadataRepository

Graphical IDE for

Hadoop Development

Enterprise

Connectivity for

Hadoop programs Transformation

Engine for custom

data processing

Page 19: Hadoop India Summit, Feb 2011 - Informatica

19