Top Banner
Stay Curious Accelerated Innovation
17

The Acceleration of Innovation in Big Data // Stefan Groschupf, Datameer [FirstMark's Data Driven]

Apr 14, 2017

Download

Technology

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: The Acceleration of Innovation in Big Data // Stefan Groschupf, Datameer [FirstMark's Data Driven]

Stay Curious

Accelerated Innovation

Page 2: The Acceleration of Innovation in Big Data // Stefan Groschupf, Datameer [FirstMark's Data Driven]

Working With Over 200 Customers

2

Page 3: The Acceleration of Innovation in Big Data // Stefan Groschupf, Datameer [FirstMark's Data Driven]

Gordon Moore

3

Page 4: The Acceleration of Innovation in Big Data // Stefan Groschupf, Datameer [FirstMark's Data Driven]

Moore's Law

4

10^4

10^6

10^8

10^10

10^12

10^14

10^16

10^2

0

1950

1955

1960

1965

1970

1975

1980

1985

1990

1995

2000

2005

2010

2015

2020

2025

Dec PDP 1

Altair 8800

Pentium

Core i7 Quad

Core 2 Duo

Compaq Deskpro 386

Apple 2

Calculations / Sec / $1000

Page 5: The Acceleration of Innovation in Big Data // Stefan Groschupf, Datameer [FirstMark's Data Driven]

Ray Kurzweil

5

Page 6: The Acceleration of Innovation in Big Data // Stefan Groschupf, Datameer [FirstMark's Data Driven]

Accelerated Innovation

6

10^4

10^6

10^8

10^10

10^12

10^14

10^16

10^2

0

Ston

eAg

e

1950

1955

1960

1965

1970

1975

1980

1985

1990

1995

2000

2005

2010

2015

2020

2025

Dec PDP 1

Altair 8800

Pentium

Core i7 Quad

Core 2 Duo

Compaq Deskpro 386

Apple 2

Calculations / Sec / $1000

Page 7: The Acceleration of Innovation in Big Data // Stefan Groschupf, Datameer [FirstMark's Data Driven]

Acceleration of Innovation in Big Data

StormParquetSentrySparkImpala

SolrKafkaFlumeBigtopOozie

MRUnitHCatalog

SqoopWhirrAvroHivePig

MahoutHBase

ZooKeeperCore Hadoop

SparkImpala

SolrKafkaFlumeBigtopOozie

MRUnitHCatalog

SqoopWhirrAvroHivePig

MahoutHBase

ZooKeeperCore Hadoop

FlumeBigtopOozie

MRUnitHCatalog

SqoopWhirrAvroHivePig

MahoutHBase

ZooKeeperCore Hadoop

SqoopWhirrAvroHivePig

MahoutHBase

ZooKeeperCore Hadoop

HivePig

MahoutHBase

ZooKeeperCore Hadoop

HBaseZooKeeper

Core Hadoop

FlinkDrill

RangerAmbariIgniteStorm

ParquetSentrySparkImpalaSolr

KafkaFlumeBigtopOozie

MRUnitHCatalog

SqoopWhirrAvroHivePig

MahoutHBase

ZooKeeperCore HadoopCore Hadoop

HDFS, MR

SamzaKudu

SamsaraAtlasApexNiFiFlinkDrill

RangerAmbariIgniteStorm

ParquetSentrySparkImpala

SolrKafkaFlumeBigtopOozie

MRUnitHCatalog

SqoopWhirrAvroHivePig

MahoutHBase

ZooKeeperCore Hadoop

2006 2008 2009 2010 2011 2012 2013 2014 Present

Page 8: The Acceleration of Innovation in Big Data // Stefan Groschupf, Datameer [FirstMark's Data Driven]

Acceleration of Innovation in Big Data

StormParquetSentrySparkImpala

SolrKafkaFlumeBigtopOozie

MRUnitHCatalog

SqoopWhirrAvroHivePig

MahoutHBase

ZooKeeperCore Hadoop

SparkImpala

SolrKafkaFlumeBigtopOozie

MRUnitHCatalog

SqoopWhirrAvroHivePig

MahoutHBase

ZooKeeperCore Hadoop

FlumeBigtopOozie

MRUnitHCatalog

SqoopWhirrAvroHivePig

MahoutHBase

ZooKeeperCore Hadoop

SqoopWhirrAvroHivePig

MahoutHBase

ZooKeeperCore Hadoop

HivePig

MahoutHBase

ZooKeeperCore Hadoop

HBaseZooKeeper

Core Hadoop

FlinkDrill

RangerAmbariIgniteStorm

ParquetSentrySparkImpalaSolr

KafkaFlumeBigtopOozie

MRUnitHCatalog

SqoopWhirrAvroHivePig

MahoutHBase

ZooKeeperCore HadoopCore Hadoop

HDFS, MR

SamzaKudu

SamsaraAtlasApexNiFiFlinkDrill

RangerAmbariIgniteStorm

ParquetSentrySparkImpala

SolrKafkaFlumeBigtopOozie

MRUnitHCatalog

SqoopWhirrAvroHivePig

MahoutHBase

ZooKeeperCore Hadoop

2006 2008 2009 2010 2011 2012 2013 2014 Present

Acceleration of Complexity

Page 9: The Acceleration of Innovation in Big Data // Stefan Groschupf, Datameer [FirstMark's Data Driven]

Hadoop - Build for NoSQL

9

DBETL Reporting

HadoopRaw Load View

Schema on Write

Schema on Read

Page 10: The Acceleration of Innovation in Big Data // Stefan Groschupf, Datameer [FirstMark's Data Driven]

Hadoop - Disruptive

0 s

7,500 s

15,000 s

22,500 s

30,000 s

25x40GB 50x20GB 200x10GB

Hadoop DBMS-X

https://wiki.umiacs.umd.edu/ccc/images/8/8c/CLuE-Madden.pdf

Page 11: The Acceleration of Innovation in Big Data // Stefan Groschupf, Datameer [FirstMark's Data Driven]

Spark - Disruptive?

0 min

15 min

30 min

45 min

60 min

Join Machine Learning

Map Reduce Tez Spark

Datameer Benchmark

Page 12: The Acceleration of Innovation in Big Data // Stefan Groschupf, Datameer [FirstMark's Data Driven]

Flink - Already Faster

http://www.slideshare.net/FlinkForward/dongwon-kim-a-comparative-performance-evaluation-of-flink

0 s

1250 s

2500 s

3750 s

5000 s

10GB/Node 20GB/Node 40GB/Node 80GB/Node 160GB/Node

Flink Spark

Page 13: The Acceleration of Innovation in Big Data // Stefan Groschupf, Datameer [FirstMark's Data Driven]

Stack

Hardware

Data Center OS (e.g. Mesosphere)

Storage

Compute

Mor

e Diffi

cult

to C

hang

e

Hadoop Killer!

Page 14: The Acceleration of Innovation in Big Data // Stefan Groschupf, Datameer [FirstMark's Data Driven]

Yarn vs. Mesosphere

Yarn Mesosphere

Unix Process Linux Containers

Resources Requested Resources Offered

Batch Centric Flexible on Job

Internal Scheduling Client Scheduling

VS

Page 15: The Acceleration of Innovation in Big Data // Stefan Groschupf, Datameer [FirstMark's Data Driven]

Data Scientist

80% Data Preparation

20% Machine Learning

Feature Selection

Page 16: The Acceleration of Innovation in Big Data // Stefan Groschupf, Datameer [FirstMark's Data Driven]

Deep Learning

Unsupervised Learning of Features

Data Scientist Replacement?

Page 17: The Acceleration of Innovation in Big Data // Stefan Groschupf, Datameer [FirstMark's Data Driven]

@StefanGroschupf