Tackling Big Data with Hadoop and Graphical Open Source Integration Michaël Hirt Data Integration Product Manager
May 12, 2015
Tackling Big Data with Hadoop and Graphical Open Source Integration
Michaël Hirt Data Integration Product Manager
© Talend 2011 2
Agenda
1.What is Big Data ?
2.Talend’s Goal
3.What’s next ? Big Data Quality and Big Data management
4.Talend Open Studio for Big Data in action
What is Big Data?
© Talend 2011 4
2015
What Is BIG Data?
"Big data" is information of extreme size, diversity, complexity and need for rapid processing.Ted Friedman - Information Infrastructure and Big Data Projects Key Initiative Overview - July 2011
2020
275 exabytesof data flowing over the Internet each day275,000,000,000,000,000,000
200 billionintelligent devices200,000,000,000
50 gigabytes of dataper person on Earth50,000,000,000300 exabytes total
2,300 tweets per second(June 2011)
© Talend 2011© Talend 2011 – Stri2y Private & Confidential 5
volume, variety, velocity
How to define Big data is….
Key Takeaway #1
Hans Rosling – uses big data to analyze world health trends
© Talend 2011 6
The 6 Dimensions of BIG Data
Primary challenges
Volume Velocity Variety Complexity
And also Validation Lineage
© Talend 2011© Talend 2011 – Stri2y Private & Confidential 7
Forces us to think differently
Key Takeaway #2
© Talend 2011© Talend 2011 – Stri2y Private & Confidential 8
CRM
ERP
Finance
ETLData
Quality
Normalized Data
Traditional Data Warehouse
Business Analyst
Business User
Warehouse Administra
tor
Traditional Data Flows
• Scheduled–daily or weekly, sometimes more frequently.
• Volumes rarely exceed terabytes
Executives
© Talend 2011© Talend 2011 – Stri2y Private & Confidential 9
CRM
ERP
Finance
The new world of big data
Social Networking
Big Data
© Talend 2011© Talend 2011 – Stri2y Private & Confidential 10
CRM
ERP
Finance
The new world of big data
Social Networking
Mobile Devices
Big Data
© Talend 2011© Talend 2011 – Stri2y Private & Confidential 11
CRM
ERP
Finance
The new world of big data
Social Networking
Mobile Devices
Transactions
Network Devices
SensorsBig Data
© Talend 2011© Talend 2011 – Stri2y Private & Confidential 12
Data driven business
data
decisions
supports
Your business
drivesInformation provides value to the businessIf you can't rely on your information then the result can be missed opportunities, or higher costs.
Matthew West and Julian Fowler (1999). Developing High Quality Data Models. The European Process Industries STEP Technical Liaison Executive (EPISTLE).
information
enablesgovernance
© Talend 2011 13
Big Data Production
Big Data Management
Big Data Consumption
Storage ProcessingFiltering
Mining
Analytics
Search
Enrichment
RDBMSAnalytical DBNoSQL DBERP/CRMSaaSSocial MediaWeb AnalyticsLog FilesRFIDCall Data RecordsSensorsMachine-Generated
Big Data Integration
Big Data Quality
BIG Data Management
Turn Big Data into actionable information
© Talend 2011© Talend 2011 – Stri2y Private & Confidential 14
BIG data driven business
BIG data
BIGinformati
on
BIGbusiness
supports
drives
enables
Matthew West and Julian Fowler (1999). Developing High Quality Data Models. The European Process Industries STEP Technical Liaison Executive (EPISTLE).
governance
BIGdecisions
Information provides value to the businessIf you can't rely on your information then the result can be missed opportunities, or higher costs.
Our goal
© Talend 2011 16
Talend – The Market Leading Unified Integration Platform
Open source license Free of charge Optional support
Commercial license Subscription model
DataQuality
DataIntegration MDM ESB
Talend Open Studio for
MonitoringExecutionDeploymentRepositoryStudio
DataQuality
DataIntegration
MDM ESB BPM
Talend Enterprise
Talend Unified Platform
Recognized as the open source leader in each of its market category by all industry analysts
© Talend 2011© Talend 2011 – Stri2y Private & Confidential 17
Trying to get from this…
© Talend 2011© Talend 2011 – Stri2y Private & Confidential 18
to this…
ONLY Talend generates code that is executed within map reduce. This open approach removes the limitation of a proprietary “engine” to provide a truly unique and powerful set of tools for big data.
Why Talend…
“Big Data for the Masses”
© Talend 2011© Talend 2011 – Stri2y Private & Confidential 20
…an open source ecosystem
Talend Open Studio for Big Data “Big Data for the Masses”
Improves efficiency of big data job design with graphic interface
Abstracts and generates code Run transforms inside Hadoop Native support for HDFS, Pig, HBase,
Sqoop and Hive Apache License 2.0 Embedded in Hortonworks Data
Platform Certifed with Cloudera, MapR and
Grenplum
Goal: Democratize Big Data
Pig
Big Data – How about Data Quality?
© Talend 2012
© Talend 2011 23
In big data…poor data quality can be magnified at huge scale
Poor Data Quality + Big Data = Big ProblemsPoor Data Quality * Big Data = Big Problems^2
Key Takeaway #3
© Talend 2011 24
1. Pipelining: as part of the load process
2. Load the cluster then implement and execute a data quality map reduce job
Two methods for inserting data quality into a big data job
© Talend 2011© Talend 2011 – Stri2y Private & Confidential 25
Extract – Transform - LoadE-T-L
© Talend 2011© Talend 2011 – Stri2y Private & Confidential 26
DQExtract – Improve/Cleanse - Load
E- -L
© Talend 2011© Talend 2011 – Stri2y Private & Confidential 27
CRM
ERP
Finance
Social Networking
Mobile Devices
Big Data
DQ
DQ
Pipelining: data quality with big data
• Use traditional data quality tools
• No new programming, no PHDs• Once and done
© Talend 2011© Talend 2011 – Stri2y Private & Confidential 28
Big data alternative: Load and improve within the cluster
• Load first, improve later• Really complex to build, limited
tools• Constant on, increments• Insane performance
CRM
ERP
Finance
Social Networking
Mobile Devices
Big Data
DQ
DQ
Let us show you…
© Talend 2012
What’s next for Talend Big Data?
© Talend 2012
© Talend 2011 31
Talend Open Studio for Big Data
4.0: HDFS 4.1: Hive & Sqoop 4.2: Pig 5.0:
Hbase
5.1:HCatalog & Oozie
© Talend 2011© Talend 2011 – Stri2y Private & Confidential 32
Talend Open Studio for Big DataPackaged within Hortonworks Data Platform
…Eclipse tools for HIVE, HDFS, PIG, SCOOP
…supports Oozie, Hcatalog, Kerberos
Free to download and use under the Apache license
…democratizing big data through intuitive tools
databignow Q42012 2013
Questions / Thanks for attendingmhirt_at_talend.com