Big data & frameworks: no book for you anymore

Roman Nikitchenko, 22.02.2015

SUBJECTIVE

BIG DATANO BOOK FOR YOU

ANYMORE

FRAMEWORKS

WHAT WE WANT

CHEAPER No bike reinventions anymore

FASTER time to marked — part of job

is done

BETTER Quality of proven approaches

FRAMEWORKS

WHAT WE GETFRAMEWORKS

CAN CHIMPS DO BIG DATA?

Real shocking title book available for pre-order. This is exactly what happens now in Big Data industry.

Roses are red.

Violets are blue.

We do Hadoop

What about YOU?

SCALEBIG DATA IS ABOUT...

GET CHMIPS OUT OF

DATACENTER

BIG DATASO HOW TO DO FRAMEWORKING...

WHEN YOU DO

YARNwe do Big Data with Hadoop

FRAMEWORKIs an essential supporting structure of a building, vehicle, or object.

In computer programming, a software framework is an abstraction in which software providing generic functionality can be selectively changed by additional user-written code, thus providing application-specific software.

FRAMEWORKS DICTATE APPROACH

Frameworks are to lower amount of job by reusing. The more you can reuse the better. But complex framework are

too massive to be flexible. They limit your solutions.

Doing Big Data you usually build unique solution.

SO DO I NEED UNIQUE FRAMEWORKS

FOR EVERY BIG DATA PROJECT?

x MAX+

BIG DATA

HADOOP as INFRASTRUCTURE

LOOKS LIKE THIS

OPEN SOURCE framework for big data. Both distributed storage and processing.

Provides RELIABILITY and fault tolerance by SOFTWARE design. Example — File system as replication factor 3 as default one.Horisontal scalability from

single computer up to thousands of nodes.

INFRASTRUCTURE

3 SIMPLE HADOOP PRINCIPLES

HADOOP INFRASTRUCTURE AS

A FRAMEWORK

● Is formed from large number of unified nodes.

● Nodes are replaceable.

● Simple hardware without sophisticated I/O.

● Reliability by software.

● Horizontal scalability.

FRAMEWORKS INFRASTRUCTURE

APPROACH COMPLEXITY

LIMITATIONSOVERHEAD

How everyone (who usually sells something) depicts

Hadoop complexity

GREAT BIG INFRASTRUCTURE AROUND

SMALL CUTE CORE

YOUR APPLICATION

SAFE and FRIENDLY

How it looks from the real user point of view

Feeling of something wrong

CORE HADOOPC

PLETELY

YOUR APPLICATION

FEAR OF

But... imagine we have BIG DATA bricks. How should they look like?

WHAT BRICKS SHOULD WE TAKE TO BUILD BIG DATA SOLUTION?

● We should build unique solutions using the same approaches.

● So bricks are to be flexible.

● We should build robust solution with high reliability.

● Bricks are to be simple and replacable.

● We should be able to change our solution over the time.

● Bricks are to be small.

● As flexible as it is possible.

● Focused on specific aspect without large infrastructure required.

● Simple and interchangable.

HADOOP 2.x CORE AS A FRAMEWORK BASIC BLOCKS

● ZooKeeeper as coordinational service.● HDFS as file system layer.● YARN as resource management.● MapReduce as basic distributed processing option.

HADOOP HAS LAYERS

RESOURCE MANAGEMENT

DISTRIBUTED PROCESSING

FILE SYSTEM

COORDINATION

HADOOP 2.x CORE

PACKAGING ...

RUBIK's CUBE

● Hadoop packaging is non-trivial task.

● It gets more complex when you add Apache Spark, SOLR or Hbase indexer.

Hadoop: don't do it yourself

REUSE AS IS● BASIC infrastructure is pretty reusable to build

with it. At least unless you know it well.

● Do you have manpower to re-implement it? You'd beeeter contribute in this case.

WHERE TO GO FROM HERE?

HERE PEOPLE START TO ADD EVERY FRAMEWORK THEY

KNOW ABOUT...

YARNAT LEAST WE DO IT ONE BY ONE

WHAT DO WE USUALLY EXPECT FROM NEW FRAMEWORK?

BETTER

CHEAPER

FASTER frameworks provide

higher layer of abstraction so

coding go faster

some part of work is

already done

top framework contributors are

usually top engineers

OOOPS...

BETTER

CHEAPER

coding go faster

already done

usually top engineersAdditional cost of

new framework maintenance

Additional time of learning new approach

Lot of defects due to lack of experience with new framework

BETTER

CHEAPER

coding go faster

already done

usually top engineersAdditional cost of

new framework maintenance

Additional time of learning new approach

Lot of defects due to lack of experience with new frameworkNONEXISTENT

ONLY TWO?

JUST FEW EXAMPLES

● Spring batch — main thread who started spring context forgot to check task accomplishment status.

● Apache Spark — persistence to disk was limited to 2GB due to ByteBuffer int limitation.

● Apaceh Hbase has by now no effective guard against client RPC timeout.

● What about binary data like hashes? No effective out-of-the-box support by now.

ONLY R

EXPERIENCE

NEW FRAMEWORKS ARE ALWAYS HEADACHE

%^#@#^&@#&#%@ !!!

JUST LONGER PERSPECTIVE?

When you use the same approach for a long time you do it more and more

effective.

JAVA MESSAGE SERVICE

APACHE SPARK

1.0.2b (June 25, 2001)

1.1 (April 12, 2002)

2.0 (May 21, 2013)

0.9.0 (Feb 2, 2014)

1.0 (May 30, 2014)

1.1 (Sep 11, 2014)

1.2 (Dec 18, 2014)

JUST FEEL SPEED DIFFERENCEBUT

FULL DATA PROCESSING PLATFORM SUPPORTING YARN

SO BIG DATA TECHNOLOGY BOOKS ARE ALWAYS OUTDATED

Great books but when they are printed they are already old. Read original E-books with updates.

DO NOT HIDE YOUR EXPERIENCE

FRAMEWORKS IN BIG DATA HAMSTERS vs HIPSTERS

We hate frameworks! Only

hardcore, only JDK!

Give me framework for every step!

FRAMEWORKS IN BIG DATA HAMSTERS vs HIPSTERS

Significant overhead even comparing to MapReduce

access

Most simple way to access your Hbase data for analytics.

Apache Hbase is top OLTP solution for Hadoop. Hive can provide SQL connector to it.

Hbase direct RPC for OLTP, MapReduce or Spark when you need performance and Hive when you need faster implementation.

Crazy idea: Hive running over Hbase table snapshots.

FAST FEATURE DEVELOPMENT

ACTIVE COMMUNITY

STABLE REUSABLE ARCHITECTURE

OUR BIG DATA FRAMEWORKS CRITERIA

ETL: FRAMEWORKS COST

● We do object transformations when we do ETL from SQL to NoSQL objects.

● Practically any ORM framework eats at least 10% of CPU resource.

● Is it small or big amount? Depends who pays...

SQLserver

Table1

Table2

Table3

Table4 BIG DATA shard

BIG DATA shard

BIG DATA shardETL stream

ETL stream

10% overhead...

● Single desktop application - computers usually have unused CPU power. 10% overhead is not so notable for user so user accepts it.

● User pays for electricity and hardware.

● Lot of mobile clients. Can tolerate 10% performance degradation. Application still works.

● All users pay for your 10% performance overhead.

10% overhead...

● Single server solution. OK, usually you have 10% spare.

● So you pay for overhead but you don't notice it before it is needed. You have the same 1 server.

10% overhead...

● 10% overhead of 1000 servers with properly distributed job means up to 100 servers additionaly needed.

● This is your direct maintenance costs.

10% overhead...

IN CLUSTERS YOU DIRECTLY PAY FOR OVERHEAD WITH ADDITIONAL

CLUSTER NODES.

WHAT FRAMEWORK IS REALLY GOOD FOR YOU?

● If you know amount (and cost) of job to replace framework, this is really good for you.

MAKING YOUR OWN FRAMEWORK

● Most common reason for your own framework is … growing complexity and support cost.

● New framework development and migration can be cheeper than support of existing solutions.

● You don't want to depend on existing framework development.

MAKING FRAMEWORK LAZY STYLE

● First do multiple solutions than integrate them into single approach.

● GOOD You only integrate what is already used so less unused work.

● BAD Your act reactive.

MAKING FRAMEWORK PROACTIVE STYLE

● You improve framework before actual need.

● GOOD You are guided by approach, not need, so usually you have more clear design.

● BAD Your have more probability to do not needed things.

OUTSIDE YOUR TEAM

● Great, you have additional workforce. But from now you have external support tickets.

● Usually you can control your users so major changes are yet possible but harder.

● Pay more attention to documentation and trainings for other teams. It pays back.

OUTSIDE YOUR COMPANY

● You receive additional workforce. People start contributing into your framwork. Don't be so optimistic.

● Community support is good but you need to support community applications.

● You are no longer flexible. You don't control users of your framework.

LESSONS LEARNEDCORE

● Avoid inventing unique approach for every Big Data solution. It is critical to have good relatively stable ground.

● Your Big Data CORE architecture is to be layered infrastructure constructed from small, simple, unified, replaceable components (UNIX way).

● Be ready for packaging issues but try to reuse as maximum as possible on CORE layer.

LESSONS LEARNED● Selecting frameworks to extend your big

data core prefer solutions with stable approach, flexible functionality and healthy community. Revise your approaches as world changes fast.

● Prefer to contribute to good existing solution rather than start your own.

● The more frequent you change something, the more higher layer tool you need for this. But in big data you directly pay for any performance overhead.

● If you have started your own framework, the more popular it is, the fewer freedom to modify you have so the only flexibility is bad reason to start.

BEYOND THE

Questions and discussion

Big data & frameworks: no book for you anymore

big data bricks

big data solution

big data project

big data industry

scalebig data

software framework

big databig databig

robust solution

Software

Integration with popular Big Data Frameworks in Statistica.....

Big Data Everywhere Chicago: Apache Spark Plus Many Other...

The Evolution of Big Data Frameworks

DSP Frameworks - ce.uniroma2.it · DSP Frameworks Corso di....

Optimizing Big Data Frameworks for Multi-core Systems ·...

Why apache Flink is the 4G of Big Data Analytics Frameworks

David Freriks Principal Solution Architect · Big Data...

Eckert_Scott_Not Anymore certificate

Benchmarking Big Data Architectures for Social Networks...

WELCOME TO · WELCOME TO POPULATION 1,688,526,106 Three...

Tackling Big Data - NIST · –Big data infrastructure is.....

Optimizing Big Data Analytics Frameworks in Geographically.....

The Big 6 - Amazon Web...

Communication Frameworks for HPC and Big Data

Technology Primer: Hey IT—Your Big Data Infrastructure...

Commodity Market | Karvy Commodities - Commodity Trading…....