© Copyright 2000-2014 TIBCO Software Inc. Hadoop and Data Warehouse – Friends, Enemies or Profiteers? What about Real Time? Kai Wähner [email protected] @KaiWaehner www.kai-waehner.de
Jan 27, 2015
© Copyright 2000-2014 TIBCO Software Inc.
Hadoop and Data Warehouse – Friends, Enemies or Profiteers? What about Real Time? Kai Wähner [email protected] @KaiWaehner www.kai-waehner.de
© Copyright 2000-2014 TIBCO Software Inc.
Disclaimer
!
These opinions are my own and do not necessarily represent my employer
© Copyright 2000-2014 TIBCO Software Inc.
Key Messages
Big Data is not just Hadoop, concentrate on Business Value!
A good Big Data Architecture combines DWH, Hadoop and Real Time!
The Integration Layer is getting even more important in the Big Data Era!
© Copyright 2000-2014 TIBCO Software Inc.
Agenda
• Terminology • Data Warehouse and Business Intelligence • Big Data Processing with Hadoop • Big Data Processing in Real Time
© Copyright 2000-2014 TIBCO Software Inc.
Agenda
• Terminology • Data Warehouse and Business Intelligence • Big Data Processing with Hadoop • Big Data Processing in Real Time
© Copyright 2000-2014 TIBCO Software Inc.
Big Data Architecture
DWH / BI
Hadoop
Real Time
Big Data Architecture
© Copyright 2000-2014 TIBCO Software Inc.
DWH means analyzing OLAP Cubes
h9p://www.exforsys.com/tutorials/msas/data-‐warehouse-‐database-‐and-‐oltp-‐database.html
© Copyright 2000-2014 TIBCO Software Inc.
Big Data means analyzing Everything
h9p://blogs.teradata.com/internaDonal/tag/hadoop/
• Store everything • Even without structure • Use whatever you need (now or later)
© Copyright 2000-2014 TIBCO Software Inc.
Big Data: Three shifts in the Way we analyze Information
• Messiness: Using ALL data, not just samples • Also bad data (e.g. Word spell checker, Google auto-‐complete and „did
you mean...“ recommendaDon
• Correla-ons: Instead of causaliDes • May not tell us WHY something is happening, but THAT it is happening • In many situaDons, this is good enough • What drug substance cures cancer? When should I buy an airplane Dcket?
• Datafica-on: Store, process, combine, reuse, enhance all data! • DigitalisaDon (Amazon Kindle à Read) vs. DataficaDon (Google Books à
Read, Search, Process, ...) • Words becomes data: Google books: not just read, but also search,
analyse, etc. • LocaDons becomes data: GPS: not just navigaDon, but also insurance
costs, economic routes, etc.
© Copyright 2000-2014 TIBCO Software Inc.
What is Big Data? The combined Vs of Big Data
Volume (terabytes, petabytes)
Variety (social networks, blog posts, logs, sensors, etc.)
Velocity (realDme)
Value
X
© Copyright 2000-2014 TIBCO Software Inc.
Real Time
Wikipedia Definition: • Real time programs must guarantee response within strict time constraints, often referred to as
"deadlines”. Real time responses are often understood to be in the order of milliseconds, and sometimes microseconds.
• The term "near real time” refers to the time delay introduced, by automated data processing or network transmission.
• The distinction between the terms "near real time" and "real time" is somewhat nebulous and must be defined for the situation at hand.
Hereby, for this talk, I define: – Real time == response in nanoseconds || microseconds || milliseconds || <= one second – Near real time == (response time > one second)
© Copyright 2000-2014 TIBCO Software Inc.
Agenda
• Terminology • Data Warehouse and Business Intelligence • Big Data Processing with Hadoop • Big Data Processing in Real Time
© Copyright 2000-2014 TIBCO Software Inc.
Big Data Architecture
DWH / BI
Hadoop
Real Time
Big Data Architecture
© Copyright 2000-2014 TIBCO Software Inc.
DWH vs. BI
• Data Warehouse (DWH) à Storage
• Business Intelligence (BI) à Analytics • Both terms are often used as synonym, i.e. when someone talks
about a DWH, this might include analytics
• BI can be used without a DWH
© Copyright 2000-2014 TIBCO Software Inc.
Typical DWH Process
h9p://wikibon.org/blog/not-‐your-‐fathers-‐data-‐analyDcs/
A DWH is „Business Case driven“: • ReporDng • Dashboards • Drill Down AnalyDcs
Different DWH OpDons: • Enterprise DWH ( == EDW) • Department / Project DWH • Embedded BI (into ApplicaDons)
© Copyright 2000-2014 TIBCO Software Inc.
BI == Reporting + Statistics + Data Discovery
DWH
BI
© Copyright 2000-2014 TIBCO Software Inc.
BI Visualization
© Copyright 2000-2014 TIBCO Software Inc.
Products
DWH • SQL: e.g. MySQL • MPP: e.g. Teradata, EMC Greenplum, IBM Netezza
– Scale very well (almost linear), very high performance, hardware / software costs also increase a lot
BI • Microsoft Excel • BI Tools: e.g. TIBCO Spotfire, Tableau, MicroStrategy Hint: Good BI tools • allow data discovery / visualization using different sources, not just DWH • are easy to use
© Copyright 2000-2014 TIBCO Software Inc.
BI Tool Example: TIBCO Spotfire
© Copyright 2000-2014 TIBCO Software Inc.
BI Tool Example: TIBCO Spotfire
The whole team needs analyDcs. Spo`ire is for everyone, helping users with a variety of skill levels to visualize, explore and share informaDon: It has • At-‐a-‐glance business facts for managers • Dashboards for front-‐line decision-‐makers • Visual discovery for business users • Deep data exploraDon for analysts • Advanced predicDve analyDcs for
staDsDcians • And beauDful visualizaDons to empress
your execuDves
© Copyright 2000-2014 TIBCO Software Inc.
Example: TIBCO Spotfire
© Copyright 2000-2014 TIBCO Software Inc.
Live Demo
„TIBCO Spo`ire“ in acDon...
© Copyright 2000-2014 TIBCO Software Inc.
DWH Real World Use Case
h9p://spo`ire.Dbco.com/resources/content-‐center?Content%20Type=Case%20Studies
© Copyright 2000-2014 TIBCO Software Inc.
DWH Real World Use Case
h9p://spo`ire.Dbco.com/resources/content-‐center?Content%20Type=Case%20Studies
© Copyright 2000-2014 TIBCO Software Inc.
Embedded BI Real World Use Case
h9ps://www.jaspersod.com/embeddedShowcase/periscope.html
© Copyright 2000-2014 TIBCO Software Inc.
Problems of a DWH
No flexibility / agility • Just structured data • Just some (maybe aggregated) history data • Just good for already known business cases
Low speed • ETL is batch, usually takes hours or sometimes even days • No proactive reactions possible à “too late architecture”
High costs (per GB) • Just selected data • Too old data is often outsourced to archives
© Copyright 2000-2014 TIBCO Software Inc.
Classic BI vs. Big Data BI
© Copyright 2000-2014 TIBCO Software Inc.
Agenda
• Terminology • Data Warehouse and Business Intelligence • Big Data Processing with Hadoop • Big Data Processing in Real Time
© Copyright 2000-2014 TIBCO Software Inc.
Big Data Architecture
DWH / BI
Hadoop
Real Time
Big Data Architecture
© Copyright 2000-2014 TIBCO Software Inc.
Why no longer DWH, but Hadoop?
Hadoop was built to solve problems of RDBMS and DWH… Benefits of Hadoop: • Store and analyze all data
– all data == not just selected (maybe aggregated) data – all data == structured + semi-structured + unstructured à be more flexible, adapt to changing business cases
• Better performance (massively parallel) • Ad hoc data discovery – also for big data volumes • Save money (commodity hardware, open source software)
© Copyright 2000-2014 TIBCO Software Inc.
What is Hadoop?
Apache Hadoop, an open-source software library, is a framework that allows for the distributed processing of large data sets across clusters of commodity hardware using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage.
© Copyright 2000-2014 TIBCO Software Inc.
MapReduce
Simple example:
• Input: (very large) text files with lists of strings, such as: „318, 0043012650999991949032412004...0500001N9+01111+99999999999...“
• We are interested just in some content: year and temperate (marked in red) • The Map Reduce funcDon has to compute the maximum temperature for every year
Example from the book “Hadoop: The DefiniDve Guide, 3rd EdiDon”
© Copyright 2000-2014 TIBCO Software Inc.
Hadoop Products
MapReduce
HDFS Ecosystem
Features included
few many
Apache Hadoop
© Copyright 2000-2014 TIBCO Software Inc.
Hadoop Ecosystem
© Copyright 2000-2014 TIBCO Software Inc.
Hadoop Products
MapReduce
HDFS Ecosystem
Features included
Hadoop DistribuDon
few many
Apache Hadoop
Packaging Deployment-Tooling
Support
+
© Copyright 2000-2014 TIBCO Software Inc.
Hadoop Distributions
(… some more arising)
EMR
© Copyright 2000-2014 TIBCO Software Inc.
Hadoop Products
MapReduce
HDFS Ecosystem
Features included
Hadoop DistribuDon
Big Data Suite
few many
Apache Hadoop
Packaging Deployment-Tooling
Support
+ Tooling / Modeling Code Generation
Scheduling Integration
+
© Copyright 2000-2014 TIBCO Software Inc.
Big Data Integration Suite: TIBCO BusinessWorks
© Copyright 2000-2014 TIBCO Software Inc.
Live Demo
„TIBCO BusinessWorks“ in acDon...
© Copyright 2000-2014 TIBCO Software Inc.
Hadoop Real World Use Case: Replace ETL to improve Performance
“The advantage of their new system is that they can now look at their data [from their log processing system] in anyway they want: • Nightly MapReduce jobs collect statistics about their mail system such as spam counts by
domain, bytes transferred and number of logins. • When they wanted to find out which part of the world their customers logged in from, a quick
[ad hoc] MapReduce job was created and they had the answer within a few hours. Not really possible in your typical ETL system.”
http://highscalability.com/how-rackspace-now-uses-mapreduce-and-hadoop-query-terabytes-data
( no TIBCO reference)
© Copyright 2000-2014 TIBCO Software Inc.
• A lot of data must be stored „forever“ • Numbers increase exponentially • Goal: As cheap as possible • Problem: Queries must still be possible (compliance!) • Solution: Commodity servers and „Hadoop querying“
Global Parcel Service
h9p://archive.org/stream/BigDataImPraxiseinsatz-‐SzenarienBeispieleEffekte/Big_Data_BITKOM-‐Lei`aden_Sept.2012#page/n0/mode/2up
Hadoop Real World Use Case: Storage to reduce Costs
( no TIBCO reference)
© Copyright 2000-2014 TIBCO Software Inc.
DWH or Hadoop?
DWH Hadoop
Data Structured All data
Maturity Established in Enterprise New concepts
Tooling Installed, good knowledge and experience
New tools, coding required, business can sDll use SQL-‐similar queries or same BI tool
Costs High (per GB) Low (per GB)
© Copyright 2000-2014 TIBCO Software Inc.
DWH plus Hadoop?
DWH and Hadoop complement each other very well • Store all data in Hadoop (cheap per GB) • ETL from Hadoop to DWH (expensive per GB) • Create specific reports / dashboards in DWH (leverage existing products and knowledge) • Do Ad Hoc (Big) Data Discovery directly in Hadoop, no DWH needed Good BI tools support both, DWH and Hadoop! For example, TIBCO Spotfire has connectors to: • RDBMS (e.g. MySQL) • MPP (e.g. Teradata, IBM Netezza, Greenplum) • Hadoop (e.g. Hive, Impala) • In-Memory (e.g. TIBCO ActiveSpaces, SAP HANA)
© Copyright 2000-2014 TIBCO Software Inc.
Recommendation DWH vs. Hadoop vs. XYZ
• Short term: Use Hadoop (only) when you can save (a lot of) money or when you can not solve your business problem without Hadoop. A lot of things have to be improved, e.g. governance, security, performance, and tool support. • Long term: Hadoop can replace DWH (as you can create a DWH on top of Hadoop with SQL interface already today)! • Be aware: A lot of other opDons emerge for analyzing big data besides Hadoop, e.g.
-‐ AnalyDcal databases with SQL interface (MemSQL, Citus Data) -‐ Log AnalyDcs (Splunk, TIBCO LogLogic) -‐ Graph databases (Neo4j, InfiniteGraph)
© Copyright 2000-2014 TIBCO Software Inc.
Vendors Strategy...
Hadoop vendors push Hadoop as DWH replacement à Called e.g. „Enterprise Data Hub“ (Cloudera) or „Data Lake“ (Hortonworks)
h9p://gigaom.com/2013/10/29/clouderas-‐plan-‐to-‐become-‐the-‐center-‐of-‐your-‐data-‐universe/ h9p://hortonworks.com/wp-‐content/uploads/downloads/2013/04/Hortonworks.ApacheHadoopPa9ernsOfUse.v1.0.pdf
© Copyright 2000-2014 TIBCO Software Inc.
Vendors Strategy...
MPP / DWH vendors add Hadoop support as complementary addon to their DWH
à Reason (probably): Market pressure! à Benefit: One platform (including tooling and support) for DWH and Hadoop
© Copyright 2000-2014 TIBCO Software Inc.
Example: EMC combines DWH and Hadoop
h9p://wikibon.org/wiki/v/EMC_Integrates_Greenplum_DB_and_Hadoop_with_Pivotal_HD h9p://www.gopivotal.com/big-‐data/pivotal-‐hd
© Copyright 2000-2014 TIBCO Software Inc.
Example: Teradata combines DWH and Hadoop
h9p://www.teradata.com/Teradata-‐Enterprise-‐Access-‐for-‐Hadoop/
h9p://gigaom.com/2014/04/07/teradata-‐says-‐hadoop-‐is-‐good-‐for-‐business-‐but-‐for-‐how-‐long/
© Copyright 2000-2014 TIBCO Software Inc.
Hadoop evolving from Batch to Near Real Time
Hadoop is MapReduce == Batch (== hours, minutes, seconds) • Good for complex transformations / computations of big data volumes • Not so good for ad hoc data exploration • Improvements: Hive Stinger (Hortonworks) etc. Non-MapReduce processing engines added in the meantime (YARN makes it possible) • Ad hoc data discovery (== seconds) • Hive / Pig with Apache Tez replacing MapReduce under the hood for data processing • New Query engines, e.g. Impala (Cloudera) or Apache Drill (MapR) MPP vendors (e.g. Teradata, EMC Greenplum) also add own query engines • Offer fast data exploration (without MapReduce)
Some Hadoop problems remain • No good, easy tooling (Hadoop ecosystem) à might be solved next years • Missing maturity (alpha / beta versions) à might be solved next years • No “real time” (== ms, ns), but “near real time” (> 1 sec) à “too late architecture”
© Copyright 2000-2014 TIBCO Software Inc.
Agenda
• Terminology • Data Warehouse and Business Intelligence • Big Data Processing with Hadoop • Big Data Processing in Real Time
© Copyright 2000-2014 TIBCO Software Inc.
Big Data Architecture
DWH / BI
Hadoop
Real Time
Big Data Architecture
© Copyright 2000-2014 TIBCO Software Inc.
Real Time: “The Two-Second Advantage”
“A li&le bit of the right informa2on, just a li&le bit beforehand – whether it is a couple of seconds, minutes or hours – is more valuable than all of the informa2on in the world six months later… this is the two-‐second advantage.” Vikek Ranadivé, Founder and CEO of TIBCO
© Copyright 2000-2014 TIBCO Software Inc.
The Value of Data decreases over Time
© Copyright 2000-2014 TIBCO Software Inc.
What is Big Data? The combined Vs of Big Data
Volume (terabytes, petabytes)
Variety (social networks, blog posts, logs, sensors, etc.)
Velocity (realDme)
X
Fast Data
© Copyright 2000-2014 TIBCO Software Inc.
Real Time Architecture?
EVENTS
Mainframe/ERP/DB/App
ACTION
TransacDon Based Architectures
EVENTS
Mainframe/ERP/DB/App
ACTION
Behavior Based Architectures
TransacDon
Data, Event and AnalyDcs
Not ElasDc, Doesn’t Scale, “Always Late” architecture and analyDcs
ElasDc, Scales, Real Dme architecture (Events, Data and AnalyDcs)
© Copyright 2000-2014 TIBCO Software Inc.
Complex Event / Stream Processing / In-Memory
Concepts • Streams: Monitoring millions of events in a specific time window to react proactively • Stateful: Collect, filter and correlate events with state to anticipate outcomes and react proactively • Transactional: Highly performant transactional event processing Products vs. Frameworks • Products are mature, mission-critical, in production, e.g. TIBCO StreamBase, IBM InfoSphere Streams • Open Source Frameworks, e.g. “Apache Spark” and “Apache Storm”
– Future will tell us about performance, tooling, support, etc. – Can be combined with Hadoop – Are complementary to Products such as TIBCO StreamBase
In-Memory • Can also be used for “big data” (Terabytes possible!) • Usually complementary, i.e. they can be / have to be combined with stream processing / complex event
processing
© Copyright 2000-2014 TIBCO Software Inc.
Stream Processing Architecture
LiveView Datamart Con-nuous Query
Continuous Query Processor
Ad Hoc Query
Alerts
CEP
Messaging (low latency)
Messaging (JMS)
Social Media Data
Market Data
In-‐Memory
ESB Integra-on
Sensor Data Historical Data
JDBC Ac-veSpaces
Enterprise data
© Copyright 2000-2014 TIBCO Software Inc.
Stream Processing Architecture (Example: TIBCO StreamBase)
TIBCO StreamBase Con-nuous Query
Continuous Query Processor
Ad Hoc Query
Alerts
Active Tables
Trading Signal
Transac-on Cost
Orders / Execu-ons
Market Data
Alert SeMng
TIBCO LiveView Snapshot AND always-‐live updates
Quickly connect to streams
An;cipate opportuni;es, proac;ve ac;on
© Copyright 2000-2014 TIBCO Software Inc.
Example: TIBCO StreamBase Tooling
StreamBase Development Studio • Visual Development • Visual Debugging • Feed Simulation • Unit Testing
StreamBase LiveView • Real Time Analytics and Visualization • Ad hoc queries • Alerts and Notifications • Web, Mobile and API Integration
© Copyright 2000-2014 TIBCO Software Inc.
Real World: Real-Time Trade Surveillance
Applica-ons IntegraDon
NormalizaDon AggregaDon CorrelaDon
Rules Alerts
AutomaDon
Adapters and
Handlers
Adapters and
Handlers
StreamBase Server(s)
StreamBase Studio for Developing EventFlow Applica-ons
Data Management
Persistence Stores
Logs
Market Data
Trade Data
Sta-c Data
Systems Data
Performance Benchmarks
Automa-on
Desktop
Alerts
Inputs Outputs
© Copyright 2000-2014 TIBCO Software Inc.
Real Time (Stream Processing) Real World Use Case
Real-‐Time Fraud DetecDon “The firm needs to monitor machine-‐driven algorithms, and look for suspicious pa9erns. Sounds simple, right? Not so simple! In this case, the pa9erns of interest required correlaDon of 5 streams of real-‐Dme data. Pa9erns happen within 15-‐30 second windows, during which thousands of dollars could be lost. A9acks come in bursts. The data required to find these pa9erns was loaded into a data warehouse and reports were checked each day. Decisions to act were made every day. LiveView now intercepts the data before it hit the warehouse by connecDng LiveView to the source of data. It took 3 days to integrate these sources because it took that long to find someone who knew where 3 of the data streams came from! StreamBase detects fraud pa9erns in milliseconds. But the really interesDng part came next. Once this firm could see pa9erns of fraud, they were faced with a new challenge: what to DO about it? How many Dmes did the pa9ern need to be repeated unDl acDve surveillance is started? Should the acDon be quaranDned for a period, or halted immediately? All these quesDons were new, and the answers to them keeps changing. The fact that the answers keep changing highlights the importance of ease of use. AnalyDcs must be changed quickly and be made available to fraud experts -‐ in some cases, in hours -‐ as understanding deepens, and as the bad guys change their tacDcs. Be9er, higher value-‐add customer service for highly automated industries. Knowledge workers who anDcipate sales opportuniDes. Spowng fraud in high-‐speed transacDons streams and taking acDon.“ Some more use cases: h9p://streambase.typepad.com/streambase_stream_process/2012/04/streambase-‐liveview-‐10-‐3-‐stories-‐from-‐the-‐trenches.html
© Copyright 2000-2014 TIBCO Software Inc.
Real Time (CEP + In-Memory) Real World Use Case
“With 38 million fans, MGM knows how to put its customers first, it takes more than a smile too. Customers want a personalized, tailored experience, one that knows their name and can anDcipate their needs. With the help of TIBCO technologies that leverage big data and give customers a digital idenDty, MGM can send personalized offers directly to customers, save them a seat, and have their favorite drink on the way. With mulDple customer touch points and channels, MGM can reach customers in more ways, and in more places, than ever before.”
h9ps://www.youtube.com/watch?v=X-‐7S3kCOx9k
CEP: • Correlate • Analyze • AcDon
In-‐Memory: • Enable Real Time • Only customers that have checked in
© Copyright 2000-2014 TIBCO Software Inc.
Live Demo
„TIBCO StreamBase“ in acDon...
© Copyright 2000-2014 TIBCO Software Inc.
Hadoop: • Storage • Complex computing (MapReduce)
Real Time: • Immediate (proactive) reactions – automated or manually by user • Monitor streaming data in Real Time
Example: TIBCO StreamBase and its Apache Flume connector for reading streaming data from Hadoop / HDFS or to send streaming data to Hadoop / HDFS
Real Time plus Hadoop?
© Copyright 2000-2014 TIBCO Software Inc.
Use Case: • Predict pricing movement in live bets Hadoop: • Store all history information about all past bets • Use MapReduce to precompute odds for new
matches, based on all history data
TIBCO StreamBase: • Compute new odds in real time to react within a live
game after events (e.g. when a team scores a goal) • Monitor stream data in real time dashboards
Real Time plus Hadoop Real World Use Case
h9p://www.casestudyu.com/news/2014/04/04/7762652.htm
h9p://vimeo.com/91461315
© Copyright 2000-2014 TIBCO Software Inc.
Recap: Big Data Architecture
DWH / BI
Hadoop
Real Time
Big Data Architecture
© Copyright 2000-2014 TIBCO Software Inc.
Off Topic
What about Integration?
© Copyright 2000-2014 TIBCO Software Inc.
Off Topic
Integration is no talking point in this session… However: It gets even more important in the future! The number of different data sources and technologies increases even more than in the past
– CRM, ERP, Host, B2B, etc. will not disappear – DWH, Hadoop cluster, event / streaming server, In-
Memory DB have to communicate – Cloud, Mobile, Internet of Things are no option, but our
future!
© Copyright 2000-2014 TIBCO Software Inc.
Recap: Key Messages
Big Data is not just Hadoop, concentrate on Business Value!
A good Big Data Architecture combines DWH, Hadoop and Real Time!
The Integration Layer is getting even more important in the Big Data Era!
© Copyright 2000-2014 TIBCO Software Inc.
Questions? Kai Wähner [email protected], @KaiWaehner, www.kai-waehner.de