Big data & frameworks: no book for you anymore
Post on 18-Jul-2015
74 Views
Preview:
Transcript
2
WHAT WE WANT
CHEAPER No bike reinventions anymore
FASTER time to marked — part of job
is done
BETTER Quality of proven approaches
FRAMEWORKS
4
CAN CHIMPS DO BIG DATA?
Real shocking title book available for pre-order. This is exactly what happens now in Big Data industry.
Roses are red.
Violets are blue.
We do Hadoop
What about YOU?
8
FRAMEWORKIs an essential supporting structure of a building, vehicle, or object.
In computer programming, a software framework is an abstraction in which software providing generic functionality can be selectively changed by additional user-written code, thus providing application-specific software.
9
FRAMEWORKS DICTATE APPROACH
Frameworks are to lower amount of job by reusing. The more you can reuse the better. But complex framework are
too massive to be flexible. They limit your solutions.
Doing Big Data you usually build unique solution.
13
OPEN SOURCE framework for big data. Both distributed storage and processing.
Provides RELIABILITY and fault tolerance by SOFTWARE design. Example — File system as replication factor 3 as default one.Horisontal scalability from
single computer up to thousands of nodes.
INFRASTRUCTURE
3 SIMPLE HADOOP PRINCIPLES
14
HADOOP INFRASTRUCTURE AS
A FRAMEWORK
● Is formed from large number of unified nodes.
● Nodes are replaceable.
● Simple hardware without sophisticated I/O.
● Reliability by software.
● Horizontal scalability.
16
How everyone (who usually sells something) depicts
Hadoop complexity
GREAT BIG INFRASTRUCTURE AROUND
SMALL CUTE CORE
YOUR APPLICATION
SAFE and FRIENDLY
17
How it looks from the real user point of view
Feeling of something wrong
CORE HADOOPC
OM
PLETELY
UN
KN
OW
N
INFR
AS
TR
UC
TU
RE
SO
METH
ING
YO
U
UN
DER
STA
ND
YOUR APPLICATION
FEAR OF
19
WHAT BRICKS SHOULD WE TAKE TO BUILD BIG DATA SOLUTION?
● We should build unique solutions using the same approaches.
● So bricks are to be flexible.
20
WHAT BRICKS SHOULD WE TAKE TO BUILD BIG DATA SOLUTION?
● We should build robust solution with high reliability.
● Bricks are to be simple and replacable.
21
WHAT BRICKS SHOULD WE TAKE TO BUILD BIG DATA SOLUTION?
● We should be able to change our solution over the time.
● Bricks are to be small.
22
WHAT BRICKS SHOULD WE TAKE TO BUILD BIG DATA SOLUTION?
● As flexible as it is possible.
● Focused on specific aspect without large infrastructure required.
● Simple and interchangable.
23
HADOOP 2.x CORE AS A FRAMEWORK BASIC BLOCKS
● ZooKeeeper as coordinational service.● HDFS as file system layer.● YARN as resource management.● MapReduce as basic distributed processing option.
25
PACKAGING ...
RUBIK's CUBE
STYLE
● Hadoop packaging is non-trivial task.
● It gets more complex when you add Apache Spark, SOLR or Hbase indexer.
26
Hadoop: don't do it yourself
REUSE AS IS● BASIC infrastructure is pretty reusable to build
with it. At least unless you know it well.
● Do you have manpower to re-implement it? You'd beeeter contribute in this case.
30
WHAT DO WE USUALLY EXPECT FROM NEW FRAMEWORK?
BETTER
CHEAPER
FASTER frameworks provide
higher layer of abstraction so
coding go faster
some part of work is
already done
top framework contributors are
usually top engineers
31
OOOPS...
BETTER
CHEAPER
FASTER frameworks provide
higher layer of abstraction so
coding go faster
some part of work is
already done
top framework contributors are
usually top engineersAdditional cost of
new framework maintenance
Additional time of learning new approach
Lot of defects due to lack of experience with new framework
32
BETTER
CHEAPER
FASTER frameworks provide
higher layer of abstraction so
coding go faster
some part of work is
already done
top framework contributors are
usually top engineersAdditional cost of
new framework maintenance
Additional time of learning new approach
Lot of defects due to lack of experience with new frameworkNONEXISTENT
ONLY TWO?
33
JUST FEW EXAMPLES
● Spring batch — main thread who started spring context forgot to check task accomplishment status.
● Apache Spark — persistence to disk was limited to 2GB due to ByteBuffer int limitation.
● Apaceh Hbase has by now no effective guard against client RPC timeout.
● What about binary data like hashes? No effective out-of-the-box support by now.
ONLY R
EAL
EXPERIENCE
NEW FRAMEWORKS ARE ALWAYS HEADACHE
35
JUST LONGER PERSPECTIVE?
When you use the same approach for a long time you do it more and more
effective.
36
JAVA MESSAGE SERVICE
APACHE SPARK
1.0.2b (June 25, 2001)
1.1 (April 12, 2002)
2.0 (May 21, 2013)
0.9.0 (Feb 2, 2014)
1.0 (May 30, 2014)
1.1 (Sep 11, 2014)
1.2 (Dec 18, 2014)
JUST FEEL SPEED DIFFERENCEBUT
38
SO BIG DATA TECHNOLOGY BOOKS ARE ALWAYS OUTDATED
Great books but when they are printed they are already old. Read original E-books with updates.
40
FRAMEWORKS IN BIG DATA HAMSTERS vs HIPSTERS
We hate frameworks! Only
hardcore, only JDK!
Give me framework for every step!
41
FRAMEWORKS IN BIG DATA HAMSTERS vs HIPSTERS
Significant overhead even comparing to MapReduce
access
Most simple way to access your Hbase data for analytics.
Apache Hbase is top OLTP solution for Hadoop. Hive can provide SQL connector to it.
Hbase direct RPC for OLTP, MapReduce or Spark when you need performance and Hive when you need faster implementation.
Crazy idea: Hive running over Hbase table snapshots.
42
FAST FEATURE DEVELOPMENT
ACTIVE COMMUNITY
STABLE REUSABLE ARCHITECTURE
OUR BIG DATA FRAMEWORKS CRITERIA
43
ETL: FRAMEWORKS COST
● We do object transformations when we do ETL from SQL to NoSQL objects.
● Practically any ORM framework eats at least 10% of CPU resource.
● Is it small or big amount? Depends who pays...
SQLserver
JOIN
Table1
Table2
Table3
Table4 BIG DATA shard
BIG DATA shard
BIG DATA shardETL stream
ETL stream
ETL stream
ETL stream
44
10% overhead...
● Single desktop application - computers usually have unused CPU power. 10% overhead is not so notable for user so user accepts it.
● User pays for electricity and hardware.
45
● Lot of mobile clients. Can tolerate 10% performance degradation. Application still works.
● All users pay for your 10% performance overhead.
10% overhead...
46
● Single server solution. OK, usually you have 10% spare.
● So you pay for overhead but you don't notice it before it is needed. You have the same 1 server.
10% overhead...
47
● 10% overhead of 1000 servers with properly distributed job means up to 100 servers additionaly needed.
● This is your direct maintenance costs.
10% overhead...
IN CLUSTERS YOU DIRECTLY PAY FOR OVERHEAD WITH ADDITIONAL
CLUSTER NODES.
48
WHAT FRAMEWORK IS REALLY GOOD FOR YOU?
● If you know amount (and cost) of job to replace framework, this is really good for you.
49
MAKING YOUR OWN FRAMEWORK
● Most common reason for your own framework is … growing complexity and support cost.
● New framework development and migration can be cheeper than support of existing solutions.
● You don't want to depend on existing framework development.
50
MAKING FRAMEWORK LAZY STYLE
● First do multiple solutions than integrate them into single approach.
● GOOD You only integrate what is already used so less unused work.
● BAD Your act reactive.
51
MAKING FRAMEWORK PROACTIVE STYLE
● You improve framework before actual need.
● GOOD You are guided by approach, not need, so usually you have more clear design.
● BAD Your have more probability to do not needed things.
52
OUTSIDE YOUR TEAM
● Great, you have additional workforce. But from now you have external support tickets.
● Usually you can control your users so major changes are yet possible but harder.
● Pay more attention to documentation and trainings for other teams. It pays back.
53
OUTSIDE YOUR COMPANY
● You receive additional workforce. People start contributing into your framwork. Don't be so optimistic.
● Community support is good but you need to support community applications.
● You are no longer flexible. You don't control users of your framework.
54
LESSONS LEARNEDCORE
● Avoid inventing unique approach for every Big Data solution. It is critical to have good relatively stable ground.
● Your Big Data CORE architecture is to be layered infrastructure constructed from small, simple, unified, replaceable components (UNIX way).
● Be ready for packaging issues but try to reuse as maximum as possible on CORE layer.
55
LESSONS LEARNED● Selecting frameworks to extend your big
data core prefer solutions with stable approach, flexible functionality and healthy community. Revise your approaches as world changes fast.
● Prefer to contribute to good existing solution rather than start your own.
● The more frequent you change something, the more higher layer tool you need for this. But in big data you directly pay for any performance overhead.
● If you have started your own framework, the more popular it is, the fewer freedom to modify you have so the only flexibility is bad reason to start.
BEYOND THE
CORE
top related