1 ©MapR Technologies - Confidential The Power of Hadoop to Transform Business
Aug 20, 2015
2©MapR Technologies - Confidential
My Background
University, Startups– Aptex, MusicMatch, ID Analytics, Veoh– big data since before it was big
Open source– even before the internet– Apache Hadoop, Mahout, Zookeeper, Drill– bought the beer at first HUG
MapR Founding member of Apache Drill
3©MapR Technologies - Confidential
MapR Technologies
Silicon Valley Startup– Top investors– Top technical and management team• Google, Microsoft, EMC, NetApp, Oracle
Enterprise quality distribution for Hadoop
Many extensions to basic Hadoop function Strong supporter of Apache Drill
22©MapR Technologies - Confidential
The Old Future of Hadoop
Map-reduce and HDFS– more and more, but not really different
Eco-system additions– Simpler programming (Hive and Pig)– Key-value store– Ad hoc query
Stands apart from other computing– Required by HDFS and other limitations
23©MapR Technologies - Confidential
The New Future of Hadoop
Real-time processing– Combines real-time and long-time
Integration with traditional IT– No need to stand apart
Integration with new technologies– Solr, Node.js, Twisted all should interface directly
Fast and flexible computation– Drill logical plan language
26©MapR Technologies - Confidential
Recommendation based on cooccurrence
Cooccurrence gives item-item mapping
One row and column per thing
28©MapR Technologies - Confidential
SolRIndexerSolR
IndexerSolrindexing
Cooccurrence(Mahout)
Item meta-data
Indexshards
Complete history
29©MapR Technologies - Confidential
SolRIndexerSolR
IndexerSolrsearchWeb tier
Item meta-data
Indexshards
User history
30©MapR Technologies - Confidential
Objective Results
At a very large credit card company
History is all transactions, all web interaction
Processing time cut from 20 hours per day to 3
Recommendation engine load time decreased from 8 hours to 3 minutes
34©MapR Technologies - Confidential
Presentation tier (d3 + node.js)
Analytic output
Browser query
Raw logs
35©MapR Technologies - Confidential
Objective Results
Real-time + long-time analysis is seamless
Web tier can be rooted directly on Hadoop cluster
No need to move data
37©MapR Technologies - Confidential
Big Data Processing – Hadoop
Batch processing
Query runtime Minutes to hours
Data volume TBs to PBs
Programming model
MapReduce
Users Developers
Google project MapReduce
Open source project
Hadoop MapReduce
38©MapR Technologies - Confidential
Big Data Processing – Hadoop and Storm
Batch processing Stream processing
Query runtime Minutes to hours Never-ending
Data volume TBs to PBs Continuous stream
Programming model
MapReduce DAG (pre-programmed)
Users Developers Developers
Google project MapReduce
Open source project
Hadoop MapReduce
Storm or Apache S4
39©MapR Technologies - Confidential
Big Data Processing – The missing part
Batch processing Interactive analysis Stream processing
Query runtime Minutes to hours Never-ending
Data volume TBs to PBs Continuous stream
Programming model
MapReduce DAG (pre-programmed)
Users Developers Developers
Google project MapReduce
Open source project
Hadoop MapReduce
Storm and S4
40©MapR Technologies - Confidential
Big Data Processing – The missing part
Batch processing Interactive analysis Stream processing
Query runtime Minutes to hours Milliseconds to minutes
Never-ending
Data volume TBs to PBs GBs to PBs Continuous stream
Programming model
MapReduce Queries(ad hoc)
DAG (pre-programmed)
Users Developers Analysts and developers
Developers
Google project MapReduce
Open source project
Hadoop MapReduce
Storm and S4
41©MapR Technologies - Confidential
Big Data Processing
Batch processing Interactive analysis Stream processing
Query runtime Minutes to hours Milliseconds to minutes
Never-ending
Data volume TBs to PBs GBs to PBs Continuous stream
Programming model
MapReduce Queries DAG
Users Developers Analysts and developers
Developers
Google project MapReduce Dremel
Open source project
Hadoop MapReduce
Storm and S4
42©MapR Technologies - Confidential
Big Data Processing
Batch processing Interactive analysis Stream processing
Query runtime Minutes to hours Milliseconds to minutes
Never-ending
Data volume TBs to PBs GBs to PBs Continuous stream
Programming model
MapReduce Queries DAG
Users Developers Analysts and developers
Developers
Google project MapReduce Dremel
Open source project
Hadoop MapReduce
Storm and S4
Apache Drill
43©MapR Technologies - Confidential
Design Principles
Flexible• Pluggable query languages• Extensible execution engine• Pluggable data formats
• Column-based and row-based• Schema and schema-less
• Pluggable data sources
Easy• Unzip and run• Zero configuration• Reverse DNS not needed• IP addresses can change• Clear and concise log messages
Dependable• No SPOF• Instant recovery from crashes
Fast• C/C++ core with Java support
• Google C++ style guide• Min latency and max throughput
(limited only by hardware)
46©MapR Technologies - Confidential
query:[ { op:"sequence", do:[ { op: "scan", memo: "initial_scan", ref: "donuts", source: "local-logs", selection: {data: "activity"} }, { op: "transform", transforms: [ { ref: "donuts.quanity", expr: "donuts.sales”} ] }, { op: "filter", expr: "donuts.ppu < 1.00" }, …
Logical Plan Syntax:
47©MapR Technologies - Confidential
Logical Streaming Example
{ @id: <refnum>, op: “window-frame”, input: <input>, keys: [ <name>,... ], ref: <name>, before: 2, after: here}
0 1 2 3 4
0 0 10 1 2 1 2 32 3 4
50©MapR Technologies - Confidential
Representing a DAG
{ @id: 19, op: "aggregate", input: 18, type: <simple|running|repeat>, keys: [<name>,...], aggregations: [ {ref: <name>, expr: <aggexpr> },... ]}
52©MapR Technologies - Confidential
Design Principles
Flexible• Pluggable query languages• Extensible execution engine• Pluggable data formats
• Column-based and row-based• Schema and schema-less
• Pluggable data sources
Easy• Unzip and run• Zero configuration• Reverse DNS not needed• IP addresses can change• Clear and concise log messages
Dependable• No SPOF• Instant recovery from crashes
Fast• C/C++ core with Java support
• Google C++ style guide• Min latency and max throughput
(limited only by hardware)
56©MapR Technologies - Confidential
Get Involved!
Download these slides– http://www.mapr.com/company/events/hcj-01-21-2013
Join the Drill project– [email protected] – #apachedrill
Contact me:– [email protected]– [email protected]– @ted_dunning
Join MapR (in Japan!)– [email protected]