1 ©MapR Technologies - Confidential MapR: The Next Generation Big Data Platform
2©MapR Technologies - Confidential
Big is the next big thing
Big data and Hadoop are exploding
Companies are being funded
Books are being written
Applications sprouting up everywhere
2
5©MapR Technologies - Confidential
Why Now?
But Moore’s law has applied for a long time
Why is Hadoop exploding now?
Why not 10 years ago?
Why not 20?
56/1/2012
6©MapR Technologies - Confidential
Size Matters, but …
If it were just availability of data then existing big companies would adopt big data technology first
6
7©MapR Technologies - Confidential
Size Matters, but …
If it were just availability of data then existing big companies would adopt big data technology first
They didn’t
7
8©MapR Technologies - Confidential
Or Maybe Cost
If it were just a net positive value then finance companies should adopt first because they have higher opportunity value / byte
8
9©MapR Technologies - Confidential
Or Maybe Cost
If it were just a net positive value then finance companies should adopt first because they have higher opportunity value / byte
They didn’t
9
10©MapR Technologies - Confidential
Backwards adoption
Under almost any threshold argument startups would not adopt big data technology first
10
11©MapR Technologies - Confidential
Backwards adoption
Under almost any threshold argument startups would not adopt big data technology first
They did
11
12©MapR Technologies - Confidential
Everywhere at Once?
Something very strange is happening
– Big data is being applied at many different scales
– At many value scales
– By large companies and small
12
13©MapR Technologies - Confidential
Everywhere at Once?
Something very strange is happening
– Big data is being applied at many different scales
– At many value scales
– By large companies and small
Why?
13
14©MapR Technologies - Confidential
More data is being produced more quickly
Data sizes are bigger than even a very large computer can hold
Cost to create and store continues to decrease
The Conventional Answer
15©MapR Technologies - Confidential
Analytics Scaling Laws
Analytics scaling is all about the 80-20 rule
– Big gains for little initial effort
– Rapidly diminishing returns
The key to net value is how costs scale
– Old school – exponential scaling
– Big data – linear scaling, low constant
Cost/performance has changed radically
– IF you can use many commodity boxes
16©MapR Technologies - Confidential
We knew that
We should have known that
We didn’t know that!
You’re kidding, people do that?
17©MapR Technologies - Confidential
2,0000 500 1000 1500
1
0
0.25
0.5
0.75
Scale
Va
lue
Anybody with eyes
Intern with a spreadsheet
In-house analytics
Industry-wide data consortium
NSA, non-proliferation
18©MapR Technologies - Confidential
2,0000 500 1000 1500
1
0
0.25
0.5
0.75
Scale
Va
lue
2,0000 500 1000 1500
1
0
0.25
0.5
0.75
Scale
Va
lue Net value optimum has a
sharp peak well before maximum effort
20©MapR Technologies - Confidential
2,0000 500 1000 1500
1
0
0.25
0.5
0.75
Scale
Va
lue
More than just a little
21©MapR Technologies - Confidential
2,0000 500 1000 1500
1
0
0.25
0.5
0.75
Scale
Va
lue
They are changing a LOT!
26©MapR Technologies - Confidential
2,0000 500 1000 1500
1
0
0.25
0.5
0.75
Scale
Va
lue
Initially, linear cost scaling actually makes things worse
A tipping point is reached and things change radically …
27©MapR Technologies - Confidential
Pre-requisites for Tipping
To reach the tipping point,
Algorithms must scale out horizontally
– On commodity hardware
– That can and will fail
Data practice must change
– Denormalized is the new black
– Flexible data dictionaries are the rule
– Structured data becomes rare
30©MapR Technologies - Confidential
For startups
History is always small
The future is huge
Must adopt new technology to survive
Compatibility is not as important
– In fact, incompatibility is assumed
31©MapR Technologies - Confidential
Startup phase
Absolute growth still very large
Physics of large companies
32©MapR Technologies - Confidential
For large businesses
Present state is always large
Relative growth is much smaller
Absolute growth rate can be very large
Must adopt new technology to survive
– Cautiously!
– But must integrate technology with legacy
Compatibility is crucial
33©MapR Technologies - Confidential
The startup technology picture
Old computersand software
Current computersand software
Expected hardwareand software growth
No compatibility requirement
34©MapR Technologies - Confidential
The large enterprise picture
Proof of concept Hadoop cluster
Long-term Hadoop cluster
Current hardwareand software
?
Must worktogether
36©MapR Technologies - Confidential
So that is why, and why now
What can you do with it?
And how?
36
37©MapR Technologies - Confidential
Scale-free Computing
Map-reduce
– pure functions for practical batch parallel computation
– high level languages like Hive and Pig available
– MapR provides standard access systems via NFS and ODBC
BSP
– pure functions for synchronous iterative actor-based compute
– Apache Giraph provides practical implementation
Actors
– tuple passing with transformations
– Storm provides practical implementation
38©MapR Technologies - Confidential
Future Proof Schemas
Denormalize data where possible to avoid seeks
– use embedded lists
– duplicate data
Flexible Schemas
– use standard system for data serialization
– must provide protocol migration without versioning
– Protobufs (Google), Avro (Apache) and Thrift can all be used
39©MapR Technologies - Confidential
Open Compute and Storage
Big data has mass and inertia
– once it lands, it should not move
Computation must move to the data
– map-reduce, Storm, Giraph … all OK
– conventional relational models … not OK
One model is not enough
– must allow access by multiple models of computation
40©MapR Technologies - Confidential
More Information
Contact:
– @ted_dunning
Slides and such:
– http://info.mapr.com/ted-paris-05-2012